### NLP TP3

El dataset utilizado es sobre el review de la pagina 'tripadvisor' (empresa que peemite planificar viajes a traves de agencias de viajes en linea)

In [1]:
import os
import platform

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import multiprocessing
from gensim.models import Word2Vec

from keras.preprocessing.text import text_to_word_sequence

Lectura de dataset

In [2]:
df = pd.read_csv('dataset/tripadvisor_reviews.csv')

df.head()

Unnamed: 0,rating_review,review_full
0,5,"Totally in love with the Auro of the place, re..."
1,5,I went this bar 8 days regularly with my husba...
2,5,We were few friends and was a birthday celebra...
3,5,Fatjar Cafe and Market is the perfect place fo...
4,5,"Hey Guys, if you are craving for pizza and sea..."


In [3]:
df['review_full'][0]

"Totally in love with the Auro of the place, really beautiful and quite fancy at the same time. The ambience is very pure and gives a sense of positivity throughout. Outdoor and indoor interior are quite quaint and cute. Love the open kitchen idea and there whole marketplace ideology. Due to coronovirus they specifically use disposable cutlery to keep the pandemic in mind taking all the precautionary measures from the beginning of the place with the mask on their staff and using good sanitisation. The food is really amazing specially the pizza straight from the oven and the hummus and pita bread are quite delicious too. If you're looking for a classy yet soothing Italian place in Delhi,Fatjar is a go to for you!"

Revision de la cantidad de reviews para cada puntuación (1 al 5)

In [4]:
df['rating_review'].value_counts()

rating_review
5    72390
4    50248
3    15936
2     4552
1     4455
Name: count, dtype: int64

Se toma los review con rating 1

In [5]:
sele_rating1 = df[df['rating_review']==1]

sele_rating1

Unnamed: 0,rating_review,review_full
115,1,The service was OK and the food seemed good. E...
174,1,Worst experience we had in a bar so far. Cockt...
223,1,Good Food My Chinese Group Like This Restauran...
547,1,I went to Kylin to celebrate my Mom & Dad's an...
608,1,The food at Coast Cafe last night was horrible...
...,...,...
146206,1,Annamaya was such a disappointment.The food is...
147092,1,We made reservations for our party of 14 to ce...
147333,1,I did booking with dine out and Went to the re...
147493,1,Food served very late and cold. Did not have e...


De los reviews elegidos solo se toman 1000 debido a la carga computacional que pueden representar

In [6]:
div_reviews = []

for i in range(1000):
    div_reviews = div_reviews + sele_rating1.iloc[i,1].split('.')
    div_reviews.pop()
        

df_reviews = pd.DataFrame(div_reviews)

print(len(div_reviews))

df_reviews.head()



8483


Unnamed: 0,0
0,The service was OK and the food seemed good
1,Except that my wife ate scallops and got a ga...
2,Worst experience we had in a bar so far
3,Cocktails with almost no alcohol inside
4,Clearly let them notice about it and staff di...


In [7]:
print("Cantidad de documentos:", df_reviews.shape[0])

Cantidad de documentos: 8483


In [8]:
sentence_tokens = []
# Recorrer todas las filas y transformar las oraciones
# en una secuencia de palabras (esto podría realizarse con NLTK o spaCy también)
for _, row in df_reviews[:None].iterrows():
    sentence_tokens.append(text_to_word_sequence(row[0]))

In [9]:
len(sentence_tokens[1])

10

In [10]:
# Demos un vistazo
sentence_tokens[:2][1]

['except',
 'that',
 'my',
 'wife',
 'ate',
 'scallops',
 'and',
 'got',
 'a',
 'gastroenteritis']

In [20]:
from gensim.models.callbacks import CallbackAny2Vec
# Durante el entrenamiento gensim por defecto no informa el "loss" en cada época
# Sobrecargamos el callback para poder tener esta información
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch
    """
    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if self.epoch == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss))
        else:
            print('Loss after epoch {}: {}'.format(self.epoch, loss- self.loss_previous_step))
        self.epoch += 1
        self.loss_previous_step = loss

In [38]:
# Crearmos el modelo generador de vectores
# En este caso utilizaremos la estructura modelo Skipgram
w2v_model = Word2Vec(min_count=5,    # frecuencia mínima de palabra para incluirla en el vocabulario
                     window=2,       # cant de palabras antes y desp de la predicha
                     vector_size=50,       # dimensionalidad de los vectores 
                     negative=20,    # cantidad de negative samples... 0 es no se usa
                     workers=1,      # si tienen más cores pueden cambiar este valor
                     sg=1)           # modelo 0:CBOW  1:skipgram

In [39]:
# Obtener el vocabulario con los tokens
w2v_model.build_vocab(sentence_tokens)

In [40]:
# Cantidad de filas/docs encontradas en el corpus
print("Cantidad de docs en el corpus:", w2v_model.corpus_count)

Cantidad de docs en el corpus: 8483


In [41]:
# Cantidad de words encontradas en el corpus
print("Cantidad de words distintas en el corpus:", len(w2v_model.wv))

Cantidad de words distintas en el corpus: 1892


In [42]:
# Entrenamos el modelo generador de vectores
# Utilizamos nuestro callback
w2v_model.train(sentence_tokens,
                 total_examples= w2v_model.corpus_count,
                 epochs=50,
                 compute_loss = True,
                 callbacks=[callback()]
                 )

Loss after epoch 0: 881340.5625
Loss after epoch 1: 634855.9375
Loss after epoch 2: 595770.5
Loss after epoch 3: 513201.5
Loss after epoch 4: 510547.0
Loss after epoch 5: 505898.0
Loss after epoch 6: 502111.0
Loss after epoch 7: 471652.5
Loss after epoch 8: 464798.5
Loss after epoch 9: 463657.5
Loss after epoch 10: 459887.0
Loss after epoch 11: 457241.0
Loss after epoch 12: 455366.0
Loss after epoch 13: 454708.0
Loss after epoch 14: 451367.5
Loss after epoch 15: 450208.0
Loss after epoch 16: 430635.5
Loss after epoch 17: 424050.0
Loss after epoch 18: 424414.0
Loss after epoch 19: 425453.0
Loss after epoch 20: 421016.0
Loss after epoch 21: 419370.0
Loss after epoch 22: 418900.0
Loss after epoch 23: 418706.0
Loss after epoch 24: 415868.0
Loss after epoch 25: 416420.0
Loss after epoch 26: 413922.0
Loss after epoch 27: 414489.0
Loss after epoch 28: 412765.0
Loss after epoch 29: 412481.0
Loss after epoch 30: 412063.0
Loss after epoch 31: 409192.0
Loss after epoch 32: 409317.0
Loss after epo

(3287702, 5181300)

In [43]:
# Palabras que MÁS se relacionan con...:
w2v_model.wv.most_similar(positive=["except"], topn=5)

[('12', 0.6140331029891968),
 ('acceptable', 0.6028549671173096),
 ('equivalent', 0.5951762795448303),
 ('first', 0.5649888515472412),
 ('spread', 0.5616462826728821)]

In [44]:
from sklearn.decomposition import IncrementalPCA    
from sklearn.manifold import TSNE                   
import numpy as np                                  

def reduce_dimensions(model):
    num_dimensions = 2  

    vectors = np.asarray(model.wv.vectors)
    #labels = np.asarray(model.wv.index2word)
    labels = np.asarray(model.wv.index_to_key)  

    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels

In [45]:
# Graficar los embedddings en 2D
import plotly.graph_objects as go
import plotly.express as px

x_vals, y_vals, labels = reduce_dimensions(w2v_model)

MAX_WORDS=200
fig = px.scatter(x=x_vals[:MAX_WORDS], y=y_vals[:MAX_WORDS], text=labels[:MAX_WORDS])
fig.show() # esto para plotly en colab

Conclusiones:

- A mayor dimensionalidad la perdida por epoca es mas bajo.

- Se observa similitud entre varias palabras claves como, should, great, make, place, etc

- Disminuir el vector del embedding menos de 50, lo caul es el punto estable entre la carga computacional y la 'loss' alcanzada.

- Aumentar el tamaño de la ventana contextual, por encima de 4, disminuye ligeramente el rendimiento del embbeding.
