# **VECTORES DE PALABRAS**

Representan a cada palabra de manera numérica a través de un vector que determina cómo es usada la palabra o qué significado tiene. También se conoce como *word embeddings*.

Palabras semejantes tendrán vectores más parecidos. 

Por ejemplo: niña, niño, rey, reyna, principe, princesa, mujer, hombre.

<img src = "https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/01/one-hot-word-embedding-vectors-768x276.png" width = "500"/>


<img src = "https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/01/3-dimensional-word-embeddings-example.png" width = "500" />

Estos vectores se pueden usar como características de los algoritmos de ML.

In [None]:
!spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 29.8 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [None]:
import spacy

#Creación de un modelo vacío
nlp = spacy.load ("en_core_web_sm")

In [None]:
import numpy as np

#se deshabilitan pipes que no son necesarios
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in  nlp(text)])

vectors.shape    

(12, 96)

In [None]:
vectors [1]

array([-1.3429472 , -2.240938  , -3.213147  , -2.2139177 ,  0.3327363 ,
       -0.38399982,  1.5048683 ,  1.6689196 ,  0.406124  ,  0.8455851 ,
        0.37744203,  5.2982664 ,  0.9432245 , -1.171901  , -0.8502863 ,
       -0.6752609 , -3.6783633 ,  0.939524  ,  4.992652  ,  2.2854311 ,
       -1.3392693 ,  1.1788974 ,  0.33695132, -2.2711327 ,  3.1466722 ,
        0.89087856, -3.6001291 , -5.9416976 , -1.8832479 ,  0.89249367,
        1.4950736 ,  1.1052675 , -2.3659415 , -4.6182976 ,  4.0185924 ,
       -3.0996244 ,  1.230273  ,  2.3710747 , -1.6388221 ,  2.7848885 ,
        2.7772512 ,  3.45545   , -2.0276263 , -5.5858855 ,  1.2401128 ,
        0.8873438 ,  0.74940133, -2.9127874 ,  1.016153  ,  1.7863564 ,
       -1.530396  , -0.83653593, -2.3035686 , -1.2321254 ,  0.5537652 ,
        0.24352777,  0.61954427, -0.5755199 , -1.5466686 ,  0.5532625 ,
       -0.5666108 ,  5.241022  , -0.24276963, -0.67211723,  4.1474586 ,
        2.0798903 , -1.5468581 , -4.217361  ,  4.335536  , -0.85

Sin embargo, para que tenga sentido el word embedding se requiere todo un documento.

#Modelos de clasificación

Se puede usar scikit-learn

Hay muchas formas de combinar los vectores de palabras de un documento para entrenar un modelo, una forma simple y que funciona bien, por lo general, es el promedio.

spaCy calcula el vector promedio a través de doc.vector

In [None]:
import pandas as pd

spam = pd.read_csv('/content/spam.csv')

with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])
    
doc_vectors.shape

(5572, 96)

In [None]:
print  (spam.iloc [0:10])

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
5  spam  FreeMsg Hey there darling it's been 3 week's n...
6   ham  Even my brother is not like to speak with me. ...
7   ham  As per your request 'Melle Melle (Oru Minnamin...
8  spam  WINNER!! As a valued network customer you have...
9  spam  Had your mobile 11 months or more? U R entitle...


In [None]:
spam.iloc [0, 1]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
doc_vectors[0]

array([ 5.2395266e-01, -1.2636948e-01, -1.3072444e-01,  5.1887769e-02,
        2.2692590e+00,  1.0398932e+00,  5.4671544e-01,  1.2430009e-01,
        9.4206452e-01,  8.9603591e-01,  7.3549652e-01, -4.7702563e-01,
        7.5878668e-01, -1.2702343e-01, -3.5136822e-01, -3.1757599e-01,
       -6.8197894e-01,  4.9975929e-01, -6.1296284e-01, -1.3100659e+00,
       -4.1902740e-02, -5.8360463e-01, -3.8601562e-01, -4.8014584e-01,
       -2.1765793e-02,  2.2383940e-01,  1.9851983e-01, -1.0353637e+00,
        8.6129165e-01, -4.8231784e-01, -6.0692739e-02, -4.8968062e-01,
        3.8796452e-01,  1.3823952e-01,  1.8447034e-01, -4.1515124e-01,
        1.7807012e+00, -8.9802718e-01, -5.2345365e-01, -5.0079030e-01,
        1.1541905e+00, -6.8005815e-02, -4.1152763e-01, -2.0579782e+00,
       -7.2456425e-01, -7.1640790e-02,  5.3028065e-01, -3.9056882e-01,
       -8.5317832e-01,  4.1951501e-01,  5.8040822e-01, -3.5850009e-01,
       -2.0739418e-01,  4.0847859e-03, -2.1024301e+00, -5.5083964e-02,
      

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label,
                                                    test_size=0.1, random_state=1)

Usemos SVM

In [None]:
from sklearn.svm import LinearSVC

# Con dual=False se incrementa la velocidad del entrenamiento
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print("Accuracy: ", svc.score(X_test, y_test) * 100, "%" )

Accuracy:  94.44444444444444 %


#Similaridad entre documentos

Documentos parecidos deben tener vectores parecidos. Una forma de encontrar su parecido es usando la similaridad del coseno, que mide el ángulo entre dos vectores $a$ y $b$:

$cos \theta=\frac{a\bullet b}{||a||||b||}$

que varía entre -1 y 1.

In [None]:
def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))

In [None]:
a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a, b)

0.5459551

In [None]:
c = nlp ("HERE IS YOUR FREE GREEN TEA"). vector 
cosine_similarity (a,c)

0.7245726

##Ejercicios

Vamos a trabajar con el análisis de sentimientos del restaurante.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy

In [None]:
#Lectura del archivo

review_data = pd.read_csv ('/content/yelp_ratings.csv')
review_data.head ()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


In [None]:
#Construcción de los 100-200 primeros vectores
#Calcular el dataset de todos los vectores tarda más de 20 minutos
reviews = review_data[:150]
with nlp.disable_pipes():
  vectors = np.array ([nlp(review.text).vector for idx, review in reviews.iterrows()])
    
vectors.shape

(150, 96)

Entrenar el modelo con los vectores

In [None]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment [:150], test_size=0.1, random_state=1)

# Crear el modelo LinearSVC
model = LinearSVC(random_state=1, dual=False)
# Entrenar el model
model.fit (X_train, y_train)

print ("Accuracy: ", model.score (X_test, y_test)*100)


Accuracy:  86.66666666666667


###Encontrar las revisiones más parecidas (similares)

In [None]:
review = """I absolutely love this place. The 360 degree glass windows with the 
Yerba buena garden view, tea pots all around and the smell of fresh tea everywhere 
transports you to what feels like a different zen zone within the city. I know 
the price is slightly more compared to the normal American size, however the food 
is very wholesome, the tea selection is incredible and I know service can be hit 
or miss often but it was on point during our most recent visit. Definitely recommend!
​
I would especially recommend the butternut squash gyoza."""

def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

review_vec = nlp(review).vector

## Centro de los vectores de los documentos
# Calcular el promedio para los vectores
vec_mean = vectors.mean(axis=0)
# Subtraer el promedio de los vectores
centered = vectors-vec_mean

# Calcular la similaridad para cada documento
# Asegurarse de subtraer el promedio del review vector
sims = np.array([cosine_similarity(review_vec - vec_mean, vec) for vec in centered])

# Obtener el índice del documento más parecido
most_similar = sims.argmax()

print(review_data.iloc[most_similar].text)

Yes... the Boba Tea explosion is in full force. I have been to Lee Lee International Supermarket in Chandler many times, but I never noticed this little gem next to it until a couple years ago. Boba Tea House has serving up some of the best boba tea in the Valley long before it became a big thing. They have a fantastic array of flavors and drink choices to choose like fruit slushes, snow, milk tea, pudding, mango jelly, coffee jelly, etc. They even have snacks like popcorn chicken, fried tofu, and fries. The staff is super friendly and the prices are reasonable. I still laugh at my friends who have no idea what Boba Tea is or are too afraid to suck up one of those chewy ball things. LOL. In case you didn't know, Boba Tea is a flavored tea (usually with milk) to which chewy tapioca balls or fruit jellies are added. I think they are super delicious. Today I got the Blueberry Milk Boba Tea and it made for the perfect snack in the middle of my day. Another favorite of my mine is the honeyd