# Práctica 2: Procesamiento del Lenguaje Natural

__Fecha de entrega: 8 de mayo de 2023__

El objetivo de esta práctica es aplicar los conceptos teóricos vistos en clase en el módulo de PLN. La práctica consta de 2 notebooks que se entregarán simultáneamente en la tarea de entrega habilitada en el Campus  Virtual.

Lo más importante en esta práctica no es el código Python, sino el análisis de los datos y modelos que construyas y las explicaciones razonadas de cada una de las decisiones que tomes. __No se valorarán trozos de código o gráficas sin ningún tipo de contexto o explicación__.

Finalmente, recuerda establecer el parámetro `random_state` en todas las funciones que tomen decisiones aleatorias para que los resultados sean reproducibles (los resultados no varíen entre ejecuciones).

In [1]:
RANDOM_STATE = 1234

# Apartado 1: Análisis de sentimientos con word embeddings


__Número de grupo: 20__

__Nombres de los estudiantes: Alejandro Barrachina Argudo y Juan Pablo Corella Martín__

## 1) Carga del conjunto de datos

El fichero `IMBD_Dataset.csv` contiene opiniones de películas clasificadas en 2 categorías diferentes (positiva/negativa).

Este set de datos se creó utilizando el "IMDB Dataset of 50K Movie Reviews", el cual contiene 50,000 reseñas de películas con un sentimiento positivo o negativo adjunto a ellas.

Muestra un ejemplo de cada clase.

Haz un estudio del conjunto de datos. ¿qué palabras aparecen más veces?, ¿tendría sentido normalizar de alguna manera el corpus?

Crea una partición de los datos dejando el 80% para entrenamiento y el 20% restante para test usando la función `train_test_split` de sklearn.


In [2]:
# acceso a google drive

# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
import pandas as pd
import numpy as np


In [4]:
# imbd_file = '/content/drive/MyDrive/IA2/p3/IMDB_Dataset.csv'
imbd_file = './IMDB_Dataset.csv'

df=pd.read_csv(imbd_file)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
print("Positivo:\n" + df.review[0])
print("Negativo:\n" + df.review[3])

Positivo:
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due 

Tiene sentido normalizar el corpus, ya que el texto extraido incluye tags HTML, signos de puntuación y mayúsculas.

In [6]:
import nltk
import re
from os import path, getcwd

nltk_path = path.join(getcwd(), 'data')
nltk.data.path.append(nltk_path)
# nltk.download('stopwords', nltk_path)

wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')


def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub('<br />', ' ', doc)
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc


normalize_corpus = np.vectorize(normalize_document)

In [82]:
df_normalized = df.copy(True)
df_normalized.review = normalize_corpus(df.review)

In [83]:
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

max_words = 1500    # cogeremos las 1500 palabras más frecuentes
max_comment_length = 20 #las secuencias tendrán 20 palabras, el resto serán ceros

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df_normalized.review)

sequences = tokenizer.texts_to_sequences(df_normalized.review)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
max_words = len(word_index)

data = pad_sequences(sequences, maxlen=max_comment_length)
word_index.keys()

Found 162148 unique tokens.




In [84]:
tokenizer = Tokenizer(num_words=max_words)
df_posit = df_normalized[df_normalized["sentiment"] == 'positive']
tokenizer.fit_on_texts(df_posit.review)
words_posit = tokenizer.word_index

tokenizer = Tokenizer(num_words=max_words)
df_neg = df_normalized[df_normalized["sentiment"] == 'negative']
tokenizer.fit_on_texts(df_neg.review)
words_neg = tokenizer.word_index

In [85]:
# tensorflow trabaja con variables numéricas,
# por lo que sustituimos los sentimientos
df_normalized.sentiment.replace({"positive": 1, "negative": 0}, inplace=True)

In [12]:
# Crea una partición de los datos dejando el 80% para entrenamiento y el 20% restante para test
# usando la función `train_test_split` de sklearn. 
#------------------------------------------------------------------------------

from sklearn.model_selection import train_test_split

d=df.values

x_train, x_test, y_train, y_test = train_test_split(data, df_normalized.sentiment, test_size=0.20,
                                                    random_state=RANDOM_STATE, stratify = df_normalized.sentiment)

print("Training texts:", len(y_train))
print("Test texts:", len(y_test))

Training texts: 40000
Test texts: 10000


## 2) Estudio del efecto de distintas configuraciones de word embeddings para resolver la tara

Usa distintas configuraciones de word embeddigns y discute los resultados obtenidos.



In [13]:
# Fijamos el tamaño de los embedding a 50 dimensiones (son las que tiene GloVe)

embedding_dim = 50

a) Sin Embeddings pre-entrenados

In [33]:
# MODELO 1. SIN EMBEDDINGS PRE-ENTRENADOS 

from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model1 = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs


model1.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
# After the Embedding layer, our activations have shape `(max_words, max_comment_length, embedding_dim)`.

# We flatten the 3D tensor of embeddings into a 2D tensor of shape `(max_words, max_comment_length * embedding_dim)`

model1.add(Flatten())

# We add the classifier on top
model1.add(Dense(1, activation='sigmoid'))

model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model1.summary()

print("Fitting...")
history = model1.fit(x_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(x_test, y_test))

print("Evaluating...")
score1 = model1.evaluate(x_test, y_test)

print("Accuracy: %.2f%%" % (score1[1]*100))

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 20, 50)            8107400   
                                                                 
 flatten_3 (Flatten)         (None, 1000)              0         
                                                                 
 dense_3 (Dense)             (None, 1)                 1001      
                                                                 
Total params: 8,108,401
Trainable params: 8,108,401
Non-trainable params: 0
_________________________________________________________________
Fitting...
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Evaluating...
Accuracy: 72.51%


Usando directamente la capa de embedding obtenemos un 72.51% de accuracy.

---------

b) Word embeddings pre-entrenados y congelados.

In [14]:
import numpy as np

embeddings_index = {}
with open('./glove.6B.50d.txt', encoding="UTF-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [29]:
# Construimos nuestra matriz

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word) #vector de GloVe de una palabra concreta
    if i < max_words:   #realmente se coge una palabra menos, ya que el índice cero está reservado
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [41]:
# Definimos el modelo y cargamos las word embeddings de GloVe
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model2 = Sequential()
model2.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
model2.add(Flatten())
model2.add(Dense(1, activation='sigmoid'))
model2.summary()

model2.layers[0].set_weights([embedding_matrix])
model2.layers[0].trainable = False  # El conjunto pre-entrenado no se verá afectado durante el entrenamiento

model2.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

print("Fitting...")
history = model2.fit(x_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(x_test, y_test))

print("Evaluating...")
score2 = model2.evaluate(x_test, y_test)

print("Accuracy: %.2f%%" % (score2[1]*100))

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 20, 50)            8107400   
                                                                 
 flatten_5 (Flatten)         (None, 1000)              0         
                                                                 
 dense_5 (Dense)             (None, 1)                 1001      
                                                                 
Total params: 8,108,401
Trainable params: 8,108,401
Non-trainable params: 0
_________________________________________________________________
Fitting...
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Evaluating...
Accuracy: 69.85%


El accuracy en este caso es del 69.85%, pero que el anterior aunque tardando mucho menos gracias al pre-entrenamiento.

C) Word embeddings pre-entrenados sin congelar.

In [42]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
model3.add(Flatten())
model3.add(Dense(1, activation='sigmoid'))
model3.summary()

model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = True

model3.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model3.fit(x_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(x_test, y_test))

score3 = model3.evaluate(x_test, y_test)

print("Accuracy: %.2f%%" % (score3[1]*100))

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 20, 50)            8107400   
                                                                 
 flatten_6 (Flatten)         (None, 1000)              0         
                                                                 
 dense_6 (Dense)             (None, 1)                 1001      
                                                                 
Total params: 8,108,401
Trainable params: 8,108,401
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Accuracy: 73.21%


El accuracy en este ultimo caso es de 73.21%, el mejor resultado.

## 3) Análisis final

Analiza con detalle el mejor clasificador. Busca un ejemplo mal clasificado de cada clase, justifica el error ¿se te ocurre alguna forma de solucionarlo?

Compara los resultados obtenidos con y sin word embeddings


In [43]:
print("Sin word embeddings pre-entrenados")
print("Accuracy: %.2f%%" % (score1[1]*100))
print("Con word embeddings pre-entrenados congelados")
print("Accuracy: %.2f%%" % (score2[1]*100))
print("Con word embeddings pre-entrenados sin congelar")
print("Accuracy: %.2f%%" % (score3[1]*100))

Sin word embeddings pre-entrenados
Accuracy: 72.51%
Con word embeddings pre-entrenados congelados
Accuracy: 69.85%
Con word embeddings pre-entrenados sin congelar
Accuracy: 73.21%


TODO: EXPLICAR QUE LOS RESULTADOS SON PEORES EN EL SEGUNDO POR USARSE UNAS WORD EMBEDDINGS GENERALES SIN QUE PUEDAN REAJUSTARSE A UN CORPUS DE TEXTOS TAN GRANDE, COSA QUE SE ARREGLA EN EL TERCERO.

Vamos a buscar textos clasificados erróneamente.

In [None]:
new_weights = model3.layers[0].get_weights()[0]
np.save("layers.npy", new_weights)

In [72]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

new_weights = np.load("layers.npy")

model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
model3.add(Flatten())
model3.add(Dense(1, activation='sigmoid'))
model3.summary()

model3.layers[0].set_weights([new_weights])

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 20, 50)            8107400   
                                                                 
 flatten_3 (Flatten)         (None, 1000)              0         
                                                                 
 dense_3 (Dense)             (None, 1)                 1001      
                                                                 
Total params: 8,108,401
Trainable params: 8,108,401
Non-trainable params: 0
_________________________________________________________________


In [68]:
# Las predicciones devueltas se enmarcan en el rango [0-1].
# Cuanto más cercano a 0, más seguro está de que es negativo,
# cuanto más cercano a 1, más seguro está de que es positivo.
predictions = model3.predict(x_test)
predict_rounded = predictions.round().astype("int")

df_out = pd.DataFrame(y_test).reset_index() #queremos conservar el índice para las consultas
df_out["predicted"] = predict_rounded
df_out



Unnamed: 0,index,sentiment,predicted
0,38449,0,0
1,12019,0,0
2,40165,0,0
3,29860,0,0
4,11645,0,1
...,...,...,...
9995,9407,0,0
9996,23528,0,0
9997,23810,0,0
9998,48985,1,0


In [92]:
print(words_posit.keys())
print(words_neg.keys())



In [69]:
# Siendo positivo, clasificado como negativo
print(df.review[48985])
print(df_normalized.review[48985])
print("Predicted value: ", str(predictions[9998]))

When you start watching this animation-masterpiece, you quickly notice, that it's a European production. Although the Europeans have (sadly) integrated some of the clichés you would normally find in an American production of this kind, most are missing. One of these is that there is an overwhelming evil that only our (very few and very unlikely) heroes can vanquish. Another is that one of the group is only in the business for the money, is greedy, runs away when the heat is up but somehow gives in to his better nature. This movie would have been better off without both.<br /><br />The movie is based upon a TV-Series that was out four years before the movie. Unlike the movie, the TV-Series is a cartoon and not a computer animation. At first I thought the computer would kill the charm and character of the plot but I was quickliy convinced: Whoever did the animation knew his or her stuff! Although the characters are obviously fictional (in stills they don't even look real), they seem as a

In [96]:
for word in df_normalized.review[48985].split():
    idx_all = str(word_index[word]) if word in word_index else "None"
    idx_pos = str(words_posit[word]) if word in words_posit else "None"
    idx_neg = str(words_neg[word]) if word in words_neg else "None"
    print(word+" -> ", "todas:", idx_all + ";", "posit:", idx_pos + ";", "neg:", idx_neg)

start ->  todas: 257; posit: 308; neg: 224
watching ->  todas: 55; posit: 78; neg: 50
animationmasterpiece ->  todas: 160328; posit: 108264; neg: None
quickly ->  todas: 795; posit: 857; neg: 707
notice ->  todas: 1256; posit: 1272; neg: 1207
european ->  todas: 1801; posit: 1695; neg: 1953
production ->  todas: 250; posit: 280; neg: 231
although ->  todas: 153; posit: 129; neg: 209
europeans ->  todas: 7751; posit: 8302; neg: 7334
sadly ->  todas: 927; posit: 1264; neg: 725
integrated ->  todas: 9919; posit: 8325; neg: 12807
clichs ->  todas: 1508; posit: 2670; neg: 1048
would ->  todas: 7; posit: 13; neg: 8
normally ->  todas: 1660; posit: 1935; neg: 1452
find ->  todas: 69; posit: 59; neg: 89
american ->  todas: 197; posit: 174; neg: 238
production ->  todas: 250; posit: 280; neg: 231
kind ->  todas: 138; posit: 159; neg: 119
missing ->  todas: 870; posit: 989; neg: 764
one ->  todas: 3; posit: 3; neg: 3
overwhelming ->  todas: 4175; posit: 3569; neg: 5016
evil ->  todas: 356; posit

In [98]:
"film" in df_normalized.review[48985].split()

False

In [70]:
# Siendo negativo, clasificado como positivo
print(df.review[11645])
print(df_normalized.review[11645])
print("Predicted value: ", str(predictions[4]))

No offense to anyone who saw this and liked it, but I hated it! It dragged on and on and there was not a very good plot, also, too simple and the acting was so so...<br /><br />I would give this snorefest a 2 at the most
offense anyone saw liked hated dragged good plot also simple acting would give snorefest
Predicted value:  [0.9951201]


In [97]:
for word in df_normalized.review[11645].split():
    idx_all = str(word_index[word]) if word in word_index else "None"
    idx_pos = str(words_posit[word]) if word in words_posit else "None"
    idx_neg = str(words_neg[word]) if word in words_neg else "None"
    print(word+" -> ", "todas:", idx_all + ";", "posit:", idx_pos + ";", "neg:", idx_neg)

offense ->  todas: 7502; posit: 11303; neg: 5642
anyone ->  todas: 146; posit: 161; neg: 141
saw ->  todas: 106; posit: 87; neg: 135
liked ->  todas: 306; posit: 238; neg: 419
hated ->  todas: 1595; posit: 2503; neg: 1162
dragged ->  todas: 3177; posit: 4732; neg: 2402
good ->  todas: 5; posit: 5; neg: 7
plot ->  todas: 37; posit: 55; neg: 22
also ->  todas: 17; posit: 11; neg: 28
simple ->  todas: 470; posit: 334; neg: 729
acting ->  todas: 36; posit: 51; neg: 23
would ->  todas: 7; posit: 13; neg: 8
give ->  todas: 99; posit: 120; neg: 93
snorefest ->  todas: 24100; posit: 95177; neg: 16091


TODO: Mirar donde están los valores altos y esas movidas (aunque nos pide que analizemos el mejor model en concreto, todos tienen resultados muy parecidos, por lo que estas cosas se pueden aplicar a los tres modelos).

SI TIENES TIEMPO ALEX GUAPO MIRA RELACIONES ENTRE PALABRAS PORFA, QUE HE CERRADO EL DOCU SIN QUERER Y HE PERDIDO LAS VECTORIZACIONES, SI NO POS NADA

Vamos a ver cómo son las relaciones entre palabras.

In [48]:
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

words = list(word_index.keys())[:-1] #metíamos una palabra menos
gensim_vector = KeyedVectors(embedding_dim)
gensim_vector.add_vectors(words, new_weights[1:]) # la 0 es un placeholder

In [50]:
word_index.keys()



In [52]:
gensim_vector.most_similar(positive=['bad'], topn=10)

  dists = dot(self.vectors[clip_start:clip_end], mean) / self.norms[clip_start:clip_end]


[('worse', 0.8878378868103027),
 ('unfortunately', 0.8650501370429993),
 ('really', 0.8486315011978149),
 ('little', 0.8427671194076538),
 ('bit', 0.8359227776527405),
 ('things', 0.8306117653846741),
 ('nothing', 0.8246172666549683),
 ('gone', 0.8238307237625122),
 ('trouble', 0.8222425580024719),
 ('thing', 0.821161150932312)]

TODO: COMPARAR CON LOS RESULTADOS DEL OTRO NOTEBOOK