# Práctica 4: Procesamiento del Lenguaje Natural

__Fecha de entrega: 14 de mayo de 2024__

El objetivo de esta práctica es aplicar los conceptos teóricos vistos en clase en el módulo de PLN.

Lo más importante en esta práctica no es el código Python, sino el análisis de los datos y modelos que construyas y las explicaciones razonadas de cada una de las decisiones que tomes. __No se valorarán trozos de código o gráficas sin ningún tipo de contexto o explicación__.

Finalmente, recuerda establecer el parámetro `random_state` en todas las funciones que tomen decisiones aleatorias para que los resultados sean reproducibles (los resultados no varíen entre ejecuciones).

# 1) Carga del conjunto de datos

El fichero `spam.csv` contiene mensajes SMS etiquetados como spam or ham (legítimo).

Muestra un ejemplo de cada clase.

Haz un estudio del conjunto de datos. ¿qué palabras aparecen más veces?, ¿tendría sentido normalizar de alguna manera el corpus?

Crea una partición de los datos dejando el 60% para entrenamiento, 20% para validación y el 20% restante para test. Comprueba que la distribución de los ejemplos en las particiones es similar.

In [32]:
RANDOM_STATE = 1234

In [4]:
# acceso a google drive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [81]:
import pandas as pd
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt
df1 = pd.read_csv('/content/drive/MyDrive/spam.csv')
df1

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5568,ham,Will �_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [82]:
#df_1 = df1[df1.iloc[:, 2].isna()]
#df_1 = df_1[df_1.iloc[:, 3].isna()]
#df_1 = df_1[df_1.iloc[:, 4].isna()]

df1 = df1.iloc[:, :-3]
df1 = df1.dropna()
'''
for indice_fila, fila in df_filtrado.iterrows():
     palabra = fila.iloc[3]  # Acceder al valor en la tercera columna de cada fila
     if not pd.isna(palabra):  # Verificar si el valor no es NaN
        print(palabra)
'''

df1

Unnamed: 0,v1,v2
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,0,This is the 2nd time we have tried 2 contact u...
5568,1,Will �_ b going to esplanade fr home?
5569,1,"Pity, * was in mood for that. So...any other s..."
5570,1,The guy did some bitching but I acted like i'd...


In [83]:
for clase in df1['v1'].unique():
 if not pd.isna(clase):
      # Obtener el primer elemento de cada clase
      primer_elemento = df1[df1['v1'] == clase].iloc[0]
      # Agregarlo al diccionario
      print(primer_elemento)

v1                                                    1
v2    Go until jurong point, crazy.. Available only ...
Name: 0, dtype: object
v1                                                    0
v2    Free entry in 2 a wkly comp to win FA Cup fina...
Name: 2, dtype: object


In [84]:
print(df1['v1'].value_counts())

v1
1    4825
0     747
Name: count, dtype: int64


Vemos que la cantidad de mensajes del tipo 'ham' aparece mucho más que el tipo 'spam' lo que indica que no estña balanceado. Por tanto, a la hora de sacar conlcusiones tendremos que tener en mente este hecho, ya que podría suceder que no fueran todo lo representativas que qusiéramos.

A continuación normalizaremos el documento eliminando caracteres especiales, pasando a minúsculas y eliminando palabras vacías.

In [85]:
wpt = nltk.WordPunctTokenizer()
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [86]:
df1_normalizado = normalize_corpus(df1['v2'].values)


In [87]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv_matrix = cv.fit_transform(df1_normalizado)
cv_matrix
cv_matrix = cv_matrix.toarray()
cv_matrix
suma_columnas = cv_matrix.sum(axis = 0)
indice_columna_maxima = np.argmax(suma_columnas)

nombres_columnas = cv.get_feature_names_out()

print(nombres_columnas[indice_columna_maxima])
print(np.max(suma_columnas))

call
579


La palabra que más se repite es call y aparece 579 veces.

# 2) Representación como bolsa de palabras

Elige justificadamente una representación de bolsa de palabras y aplícala.
Muestra un ejemplo antes y después de aplicar la representación. Explica los cambios.

# 3) Aplica 3 algoritmos de aprendizaje automático para resolver la tarea

Justifica porqué los has elegido.
Ajusta los modelos respecto a un hiperparámetro que consideres oportuno. Justifica tu elección.
Explica los resultados obtenidos.

# 4) Construye redes neuronales con Keras con distintas maneras de usar word embeddings

Justifica tus decisiones y explica los resultados obtenidos.

In [88]:
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

max_words = 1500
max_comment_length = 20



df1['v1'] = df1['v1'].replace({'ham': 1, 'spam': 0})


tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df1.v2)

sequences = tokenizer.texts_to_sequences(df1.v2)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
max_words = len(word_index)

data = pad_sequences(sequences, maxlen=max_comment_length)

Found 8916 unique tokens.


In [89]:
print(df1.v2[4])
print(data[4])
print(df1.v1[4])

Nah I don't think he goes to usf, he lives around here though
[   0    0    0    0    0    0    0    0 1017    1   98  107   69  487
    2  955   69  219  111  471]
1


In [90]:
from sklearn.model_selection import train_test_split

#valor de random state
rs=333

df1['v1'].replace({'ham': 1, 'spam': 0})
d=df1.values

x_train, x_test, y_train, y_test = train_test_split(data, df1.v1, test_size=0.20, random_state=rs, stratify = df1.v1)

print("Training texts:", len(y_train))
print("Test texts:", len(y_test))

Training texts: 4457
Test texts: 1115


In [91]:
# Fijamos el tamaño de los embedding a 50 dimensiones

embedding_dim = 50

###MODELO 1

In [92]:
# MODELO 1. SIN EMBEDDINGS PRE-ENTRENADOS

from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model1 = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs


model1.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
# After the Embedding layer, our activations have shape `(max_words, max_comment_length, embedding_dim)`.

# We flatten the 3D tensor of embeddings into a 2D tensor of shape `(max_words, max_comment_length * embedding_dim)`

model1.add(Flatten())

# We add the classifier on top
model1.add(Dense(1, activation='sigmoid'))

model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model1.summary()

history = model1.fit(x_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(x_test, y_test))

score1 = model1.evaluate(x_test, y_test)

print("Accuracy: %.2f%%" % (score1[1]*100))

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_9 (Embedding)     (None, 20, 50)            445800    
                                                                 
 flatten_9 (Flatten)         (None, 1000)              0         
                                                                 
 dense_9 (Dense)             (None, 1)                 1001      
                                                                 
Total params: 446801 (1.70 MB)
Trainable params: 446801 (1.70 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Accuracy: 98.83%


### EMBEDDINGS PRE-ENTRENADOS

In [93]:
import os
import numpy as np

glove_dir = '/content/drive/MyDrive/'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.50d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 0 word vectors.


In [94]:
embedding_dim = 50

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

### MODELO 2

## Definimos un modelo

Usaremos la misma arquitectura de modelo que antes:

In [95]:
# MODELO 2. EMBEDDINGS PRE-ENTRENADOS CONGELADOS

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model2 = Sequential()
model2.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
model2.add(Flatten())
model2.add(Dense(1, activation='sigmoid'))
model2.summary()

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_10 (Embedding)    (None, 20, 50)            445800    
                                                                 
 flatten_10 (Flatten)        (None, 1000)              0         
                                                                 
 dense_10 (Dense)            (None, 1)                 1001      
                                                                 
Total params: 446801 (1.70 MB)
Trainable params: 446801 (1.70 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Cargar los embeddings GloVe en el modelo


The `Embedding` layer has a single weight matrix: a 2D float matrix where each entry `i` is the word vector meant to be associated with
index `i`. Simple enough. Let's just load the GloVe matrix we prepared into our `Embedding` layer, the first layer in our model:

In [96]:
model2.layers[0].set_weights([embedding_matrix])
model2.layers[0].trainable = False

### Train and evaluate

Let's compile our model and train it:

In [97]:
model2.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model2.fit(x_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(x_test, y_test))

score2 = model2.evaluate(x_test, y_test)
print("Accuracy: %.2f%%" % (score2[1]*100))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Accuracy: 86.64%


### MODELO 3

In [98]:
# MODELO3. EMBEDDINGS PREENTRENADOS SIN CONGELAR

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=max_comment_length))
model3.add(Flatten())
model3.add(Dense(1, activation='sigmoid'))
model3.summary()

model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = True

model3.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model3.fit(x_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_data=(x_test, y_test))

score3 = model3.evaluate(x_test, y_test)

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_11 (Embedding)    (None, 20, 50)            445800    
                                                                 
 flatten_11 (Flatten)        (None, 1000)              0         
                                                                 
 dense_11 (Dense)            (None, 1)                 1001      
                                                                 
Total params: 446801 (1.70 MB)
Trainable params: 446801 (1.70 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### RESUMEN DE LOS RESULTADOS

In [99]:
print("Sin word embeddings pre-entrenados")
print("Accuracy: %.2f%%" % (score1[1]*100))
print("Con word embeddings pre-entrenados congelados")
print("Accuracy: %.2f%%" % (score2[1]*100))
print("Con word embeddings pre-entrenados sin congelar")
print("Accuracy: %.2f%%" % (score3[1]*100))

Sin word embeddings pre-entrenados
Accuracy: 98.83%
Con word embeddings pre-entrenados congelados
Accuracy: 86.64%
Con word embeddings pre-entrenados sin congelar
Accuracy: 98.83%


# 5) Aplica los modelos construidos a los datos de test y compáralos.

Calcula las métricas de recall, precisión y f1.
Discute cual es el mejor modelo y cual es peor y porqué.