# Datos

Vamos a usar el dataset de IMDB para clasificación de reseñas de películas, el objetivo del mismo es detectar si una reseña tiene sentimiento **positivo** o **negativo**.

Descarguen el dataset de este [link](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

Y word2vec de este [link](https://drive.google.com/file/d/1XusPRjsCVcIdCQ2hQDDWcH_wayfn4nWb/view?usp=sharing).

-> Para correr esta notebook en colab suban los archivos a una carpeta **data** en la raiz de su drive personal.


In [2]:
# from google.colab import drive
# drive.mount("/content/drive")

! cp "/content/drive/My Drive/data/IMDB_Dataset.zip" .
! unzip -q IMDB_Dataset.zip
! rm IMDB_Dataset.zip
! ls

'cp' is not recognized as an internal or external command,
operable program or batch file.
'unzip' is not recognized as an internal or external command,
operable program or batch file.
'rm' is not recognized as an internal or external command,
operable program or batch file.
'ls' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
import pandas as pd
imdb_data = pd.read_csv("IMDB Dataset.csv")

#sentiment count
print(imdb_data.columns)
imdb_data['sentiment'].value_counts()

# Convert positive and negative into binary classes (1-0)
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()

sentiment_data = lb.fit_transform(imdb_data["sentiment"])
imdb_data['sentiment'] = sentiment_data

Index(['review', 'sentiment'], dtype='object')


In [4]:
imdb_data.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
5,"Probably my all-time favorite movie, a story o...",1
6,I sure would like to see a resurrection of a u...,1
7,"This show was an amazing, fresh & innovative i...",0
8,Encouraged by the positive comments about this...,0
9,If you like original gut wrenching laughter yo...,1


# Imports


In [5]:
%pip install nltk
#,bs4

Note: you may need to restart the kernel to use updated packages.


In [6]:
import re
from bs4 import BeautifulSoup

import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Preprocesamiento Inicial

Como toda tarea de NLP tenemos que comenzar preprocesando los datos, eliminando palabras que no nos sirve, caracteres especiales, etc.

En particular hay tres tareas a ser realizadas basadas en un análisis inicial del dataset (mirando ejemplos al azar del mismo)



1.   Eliminar tags html (vamos a utilizar BeautifulSoup para esto)
2.   Eliminar texto entre parentesis rectos (Usando la siguiente expresion regular: ```\[[^]]*\]``` )
3. Eliminar caracteres especiales, usando una regex quitar todos los caracteres que no son ni letras ni números (```[^a-zA-z0-9\s] ``` )



![Alt text](image.png)

![Alt text](image-1.png)

In [7]:
def strip_html(text):
  soup = BeautifulSoup(text,"html.parser")
  return soup.get_text().strip()

def remove_between_square_brackets(text):
  p = re.compile('\[[^]]*\]')
  return p.sub(' ', text)

def remove_special_characters(text):
  p = re.compile('[^a-zA-Z0-9 ]')
  return p.sub(' ', text)

def low_level_preproc(text):
  return remove_special_characters(remove_between_square_brackets(strip_html(text)))

#Apply function on review column
imdb_data['review'] = imdb_data['review'].apply(low_level_preproc)

  soup = BeautifulSoup(text,"html.parser")


# Preprocesamiento de alto nivel

Una vez tenemos el texto limpio y trabajable volvemos a hacer otro pasaje de preprocesamiento de más alto nivel, ahora vamos a querer:



1.   Transformar todo el texto a minúscula
2.   Quitar stop words (usando nltk)
3.   Lemmatizar usando nltk WordNetLemmatizer

Para todo esto vamos a necesitar trabajar con **tokens** palabras individuales, en este caso vamos a separar por **whitespace**, pero se podrían usar mejores estrategias.



In [8]:
all_stopwords = set(stopwords.words("english"))

def remove_stop_words(full_text_line):
  tokens = full_text_line.split(" ")
  valid_tokens = [word for word in tokens if word not in all_stopwords]
  return valid_tokens

def lemmatize(tokens):
  lemmatizer = WordNetLemmatizer()
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
  return lemmatized_tokens

def high_level_preproc(text):
  return " ".join(lemmatize(remove_stop_words(text)))

#Apply function on review column
imdb_data['review'] = imdb_data['review'].str.lower()
imdb_data['review'] = imdb_data['review'].apply(high_level_preproc)

In [10]:
imdb_data['review'].head(10)

0    one reviewer mentioned watching 1 oz episode h...
1    wonderful little production  filming technique...
2    thought wonderful way spend time hot summer we...
3    basically family little boy  jake  think zombi...
4    petter mattei  love time money  visually stunn...
5    probably time favorite movie  story selflessne...
6    sure would like see resurrection dated seahunt...
7    show amazing  fresh   innovative idea 70 first...
8    encouraged positive comment film looking forwa...
9    like original gut wrenching laughter like movi...
Name: review, dtype: object

# Modelando

Para modelar vamos a comenzar separando el dataset.

In [11]:
#split the dataset  
#train dataset
train_reviews = imdb_data.review[:40000]
train_sentiments = imdb_data.sentiment[:40000]

#test dataset
test_reviews = imdb_data.review[40000:]
test_sentiments = imdb_data.sentiment[40000:]


print("Train set:", train_reviews.shape, train_sentiments.shape)
print("Test set:", test_reviews.shape, test_sentiments.shape)

Train set: (40000,) (40000,)
Test set: (10000,) (10000,)


Vamos a generar vectores para las reseñas usando TF-IDF (sklearn). Vamos a hacer uso del parametro ```max_features``` que nos permite controlar cuántas palabras considerar para generar los vectores (en orden de frecuencia). Luego usamos esa representacion vectorial para entrenar y testear un regresor logístico (LogisticRegressor). En particular vamos a empezar con 300 features, más adelante veremos por qué.


Vamos a entrenar el modelo por 500 iteraciones como máximo y usamos l2 como regularizador.

![Alt text](image-3.png)

In [33]:
# Su código para vectorizar
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=6000,max_df=0.9)
vectorizer.fit(train_reviews)
X_train = vectorizer.transform(train_reviews)
vectorizer.get_feature_names_out()

array(['00', '000', '10', ..., 'zero', 'zombie', 'zone'], dtype=object)

In [34]:
X_train = vectorizer.transform(train_reviews)
X_test = vectorizer.transform(test_reviews)

![Alt text](image-4.png)

![Alt text](image-5.png)

### Su código para el modelo


In [35]:
clf = LogisticRegression(random_state=0).fit(X_train, train_sentiments)

In [36]:
y_hat = clf.predict(X_test)
accuracy_score(test_sentiments, y_hat)

0.8883

In [37]:
vectorizer.get_feature_names_out()

array(['00', '000', '10', ..., 'zero', 'zombie', 'zone'], dtype=object)

# Vectores pre entrenados
Ahora vamos a ver si podemos superar la performance del modelo haciendo uso de deep learning.  Primero vamos a entrenar el mismo modelo usando los embeddings preentrenados de Word2Vec (usando gensim). 

Luego vamos a darle esos embeddings a un MLP y ver si logramos superar la performance anterior.

*Un embedding es una representación vectorial de nuestras palabras. Vamos a tener un vector asociado a una palabra.*

In [None]:
%pip install gensim

In [18]:
import gensim

In [22]:
w2v = gensim.models.KeyedVectors.load_word2vec_format("word2vec.txt", binary=False)

In [23]:
mean_vector = np.mean(w2v.vectors, axis=0)

In [38]:
def get_sentence_embedding(text):
  tokens = text.split(" ")
  embeddings = [w2v[token] if token in w2v else mean_vector for token in tokens]
  return np.mean(np.array(embeddings), axis=0)


train_vectors = [get_sentence_embedding(sent) for sent in train_reviews]
test_vectors = [get_sentence_embedding(sent) for sent in test_reviews]

In [43]:
w2v['Mariano']

array([ 1.99898e-02,  1.32136e-02,  7.41993e-02, -3.01541e-02,
       -5.62424e-02, -1.00288e-01, -8.43636e-02,  5.88682e-03,
        5.96305e-02, -2.96458e-02,  5.86141e-02, -5.35319e-02,
       -9.75772e-02,  5.14991e-02,  2.89682e-02,  8.67353e-02,
        1.43655e-01,  1.33830e-02,  3.50668e-02,  1.08419e-01,
       -1.55508e-04, -1.15873e-01,  8.47024e-02, -3.81161e-02,
        7.72486e-02, -1.87192e-02,  1.99898e-02, -2.65966e-02,
       -9.55443e-02,  1.69405e-02,  9.68996e-02,  1.90580e-02,
        7.86038e-02, -3.84549e-02,  3.42198e-02, -2.50719e-02,
        7.92815e-02,  2.02439e-02,  1.10960e-02, -1.73471e-01,
        2.40555e-02,  2.81212e-02,  5.96305e-02,  2.49025e-02,
       -9.21562e-02, -6.56444e-03,  2.45637e-02,  5.01438e-02,
        4.91274e-02,  6.50515e-02, -6.13246e-02,  6.64067e-02,
       -2.25308e-02, -1.05709e-01,  2.64272e-02,  1.19939e-01,
       -6.94560e-03, -2.52413e-02, -2.47331e-02,  1.62629e-02,
       -4.87886e-02,  2.50719e-02, -8.50412e-02, -6.132

In [48]:
from numpy import dot
from numpy.linalg import norm

a = w2v['king'] - w2v['man'] + w2v['woman']
b = w2v['queen'] 

distancia = dot(a,b/norm(a)*norm(b))

distancia

0.71181965

In [47]:
result1 = w2v.most_similar(positive=['woman',"king"],negative='men',topn=1)
result2 = w2v.most_similar(positive=['woman',"king"],negative='man',topn=1)

print(result1)
print(result2)

[('queen', 0.5957394242286682)]
[('queen', 0.7118193507194519)]


In [49]:
clf = LogisticRegression(random_state=0).fit(train_vectors, train_sentiments)
y_hat = clf.predict(test_vectors)
accuracy_score(test_sentiments, y_hat)

0.8326

# Deep Learning


MLP: vamos a crear un MLP para atacar ese mismo problema, el diseño corre por su cuenta pero deberían ser capaces de obetener mejor performance en test que los modelos anteriores.


In [50]:
# Imports
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(DEVICE)

torch.manual_seed(42)
torch.backends.cudnn.deterministic = True

cuda:0


In [51]:
def train_epoch(training_model, loader, criterion, optim):
    training_model.train()
    epoch_loss = 0.0
    all_labels = []
    all_predictions = []
    
    for data, labels in loader:
      all_labels.extend(labels.numpy())  

      optim.zero_grad()

      predictions = training_model(data.to(DEVICE))
      all_predictions.extend(torch.argmax(predictions, dim=1).cpu().numpy())

      loss = criterion(predictions, labels.to(DEVICE))
      
      loss.backward()
      optim.step()

      epoch_loss += loss.item()

    return epoch_loss / len(loader), accuracy_score(all_labels, all_predictions) * 100


def validation_epoch(val_model, loader, criterion):
    val_model.eval()
    epoch_loss = 0.0
    all_labels = []
    all_predictions = []
    
    with torch.no_grad():
      for data, labels in loader:
        all_labels.extend(labels.numpy())  

        predictions = val_model(data.to(DEVICE))
        all_predictions.extend(torch.argmax(predictions, dim=1).cpu().numpy())

        loss = criterion(predictions, labels.to(DEVICE))

        epoch_loss += loss.item()

    return epoch_loss / len(loader), accuracy_score(all_labels, all_predictions) * 100
  

def train_model(model, train_loader, test_loader, criterion, optim, number_epochs):
  train_history = []
  test_history = []
  accuracy_history = []

  for epoch in range(number_epochs):
      start_time = time.time()

      train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer)
      train_history.append(train_loss)
      print("Training epoch {} | Loss {:.6f} | Accuracy {:.2f}% | Time {:.2f} seconds"
            .format(epoch + 1, train_loss, train_acc, time.time() - start_time))

      start_time = time.time()
      test_loss, acc = validation_epoch(model, test_loader, criterion)
      test_history.append(test_loss)
      accuracy_history.append(acc)
      print("Validation epoch {} | Loss {:.6f} | Accuracy {:.2f}% | Time {:.2f} seconds"
            .format(epoch + 1, test_loss, acc, time.time() - start_time))

In [57]:
len(train_sentiments)

40000

In [78]:
class MLP(nn.Module):

  def __init__(self, in_features):
    super(MLP, self).__init__()
    self.linear1 = nn.Linear(in_features, 256)
    self.linear2 = nn.Linear(256, 128)
    self.linear3 = nn.Linear(128, 64)
    self.linear4 = nn.Linear(64, 2)
    
    # Su implementacion

  def forward(self, new_input):
    x = self.linear1(new_input)
    x = F.relu(x)
    x = self.linear2(x)
    x = F.relu(x)
    x = self.linear3(x)
    x = F.relu(x)
    x = self.linear4(x)
    x = F.softmax(x)
    return x
  

In [82]:
modelo = MLP(in_features=300).to(DEVICE)
from torchsummary import summary
summary(modelo,input_size=(300,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                  [-1, 256]          77,056
            Linear-2                  [-1, 128]          32,896
            Linear-3                   [-1, 64]           8,256
            Linear-4                    [-1, 2]             130
Total params: 118,338
Trainable params: 118,338
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.45
Estimated Total Size (MB): 0.46
----------------------------------------------------------------


  x = F.softmax(x)


In [83]:
loss_function = nn.CrossEntropyLoss().to(DEVICE)
optimizer = torch.optim.Adam(modelo.parameters(), lr=0.001)
BATCH_SIZE = 32

In [84]:
# Dataloaders
train_vectors = [get_sentence_embedding(sent) for sent in train_reviews]
test_vectors = [get_sentence_embedding(sent) for sent in test_reviews]

train_targets = torch.Tensor(train_sentiments.to_numpy()).long()
train_dataset = TensorDataset(torch.Tensor(train_vectors), train_targets) 
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, pin_memory=True, num_workers=2)

test_targets = torch.Tensor(test_sentiments.to_numpy()).long()
test_dataset = TensorDataset(torch.Tensor(test_vectors), test_targets) 
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, pin_memory=True, num_workers=2)

  train_dataset = TensorDataset(torch.Tensor(train_vectors), train_targets)


In [85]:
train_model(modelo, train_dataloader, test_dataloader, loss_function, optimizer, 10)

  x = F.softmax(x)


Training epoch 1 | Loss 0.502030 | Accuracy 80.05% | Time 10.81 seconds


  x = F.softmax(x)


Validation epoch 1 | Loss 0.463419 | Accuracy 84.13% | Time 4.80 seconds


  x = F.softmax(x)


Training epoch 2 | Loss 0.467858 | Accuracy 83.76% | Time 9.55 seconds


  x = F.softmax(x)


Validation epoch 2 | Loss 0.457293 | Accuracy 85.04% | Time 6.03 seconds


  x = F.softmax(x)


Training epoch 3 | Loss 0.463534 | Accuracy 84.22% | Time 16.24 seconds


  x = F.softmax(x)


Validation epoch 3 | Loss 0.455353 | Accuracy 85.30% | Time 4.41 seconds


  x = F.softmax(x)


Training epoch 4 | Loss 0.460398 | Accuracy 84.66% | Time 10.74 seconds


  x = F.softmax(x)


Validation epoch 4 | Loss 0.454226 | Accuracy 85.36% | Time 5.21 seconds


  x = F.softmax(x)


Training epoch 5 | Loss 0.459233 | Accuracy 84.78% | Time 10.51 seconds


  x = F.softmax(x)


Validation epoch 5 | Loss 0.455182 | Accuracy 85.17% | Time 4.91 seconds


  x = F.softmax(x)


Training epoch 6 | Loss 0.458877 | Accuracy 84.79% | Time 11.69 seconds


  x = F.softmax(x)


Validation epoch 6 | Loss 0.454190 | Accuracy 85.34% | Time 5.24 seconds


  x = F.softmax(x)


Training epoch 7 | Loss 0.457351 | Accuracy 84.98% | Time 11.83 seconds


  x = F.softmax(x)


Validation epoch 7 | Loss 0.453391 | Accuracy 85.28% | Time 4.73 seconds


  x = F.softmax(x)


Training epoch 8 | Loss 0.456364 | Accuracy 85.08% | Time 9.35 seconds


  x = F.softmax(x)


Validation epoch 8 | Loss 0.453176 | Accuracy 85.31% | Time 4.03 seconds


  x = F.softmax(x)


Training epoch 9 | Loss 0.455276 | Accuracy 85.20% | Time 8.73 seconds


  x = F.softmax(x)


Validation epoch 9 | Loss 0.453261 | Accuracy 85.33% | Time 3.98 seconds


  x = F.softmax(x)


Training epoch 10 | Loss 0.454850 | Accuracy 85.26% | Time 8.42 seconds


  x = F.softmax(x)


Validation epoch 10 | Loss 0.452747 | Accuracy 85.49% | Time 3.89 seconds


# Exploración

Exploren otras técnicas de preprocesamiento, tokenizacion, vectorizacion, etc. para ver si puede lograr superar los modelos presentados en clase.


In [86]:
class MLPRegularizada(nn.Module):
    def __init__(self, in_features):
        super(MLP, self).__init__()
        self.linear1 = nn.Linear(in_features, 256)
        self.batch_norm1 = nn.BatchNorm1d(256)
        self.dropout1 = nn.Dropout(p=0.5)
        self.linear2 = nn.Linear(256, 128)
        self.batch_norm2 = nn.BatchNorm1d(128)
        self.dropout2 = nn.Dropout(p=0.5)
        self.linear3 = nn.Linear(128, 64)
        self.batch_norm3 = nn.BatchNorm1d(64)
        self.dropout3 = nn.Dropout(p=0.5)
        self.linear4 = nn.Linear(64, 2)

    def forward(self, x):
        x = self.dropout1(F.leaky_relu(self.batch_norm1(self.linear1(x))))
        x = self.dropout2(F.leaky_relu(self.batch_norm2(self.linear2(x))))
        x = self.dropout3(F.leaky_relu(self.batch_norm3(self.linear3(x))))
        x = F.softmax(self.linear4(x), dim=1)
        return x

modelo = MLP(in_features=300).to(DEVICE)
optimizer = torch.optim.Adam(modelo.parameters(), lr=0.001, weight_decay=1e-5) 


In [87]:
loss_function = nn.CrossEntropyLoss().to(DEVICE)
optimizer = torch.optim.Adam(modelo.parameters(), lr=0.001)
BATCH_SIZE = 32

In [88]:
train_model(modelo, train_dataloader, test_dataloader, loss_function, optimizer, 10)

  x = F.softmax(x)


Training epoch 1 | Loss 0.502825 | Accuracy 79.95% | Time 8.52 seconds


  x = F.softmax(x)


Validation epoch 1 | Loss 0.464235 | Accuracy 84.07% | Time 4.11 seconds


  x = F.softmax(x)


Training epoch 2 | Loss 0.467938 | Accuracy 83.83% | Time 9.27 seconds


  x = F.softmax(x)


Validation epoch 2 | Loss 0.456960 | Accuracy 85.07% | Time 4.50 seconds


  x = F.softmax(x)


Training epoch 3 | Loss 0.463606 | Accuracy 84.19% | Time 10.24 seconds


  x = F.softmax(x)


Validation epoch 3 | Loss 0.455397 | Accuracy 85.22% | Time 5.23 seconds


  x = F.softmax(x)


Training epoch 4 | Loss 0.460731 | Accuracy 84.55% | Time 11.95 seconds


  x = F.softmax(x)


Validation epoch 4 | Loss 0.454225 | Accuracy 85.44% | Time 5.45 seconds


  x = F.softmax(x)


Training epoch 5 | Loss 0.459785 | Accuracy 84.69% | Time 8.57 seconds


  x = F.softmax(x)


Validation epoch 5 | Loss 0.454379 | Accuracy 85.31% | Time 3.82 seconds


  x = F.softmax(x)


Training epoch 6 | Loss 0.459118 | Accuracy 84.82% | Time 8.40 seconds


  x = F.softmax(x)


Validation epoch 6 | Loss 0.455029 | Accuracy 85.21% | Time 3.87 seconds


  x = F.softmax(x)


Training epoch 7 | Loss 0.458407 | Accuracy 84.86% | Time 8.40 seconds


  x = F.softmax(x)


Validation epoch 7 | Loss 0.454169 | Accuracy 85.44% | Time 3.82 seconds


  x = F.softmax(x)


Training epoch 8 | Loss 0.456813 | Accuracy 85.05% | Time 8.27 seconds


  x = F.softmax(x)


Validation epoch 8 | Loss 0.453916 | Accuracy 85.39% | Time 3.92 seconds


  x = F.softmax(x)


Training epoch 9 | Loss 0.455625 | Accuracy 85.13% | Time 8.30 seconds


  x = F.softmax(x)


Validation epoch 9 | Loss 0.453582 | Accuracy 85.40% | Time 3.80 seconds


  x = F.softmax(x)


Training epoch 10 | Loss 0.455128 | Accuracy 85.22% | Time 8.33 seconds


  x = F.softmax(x)


Validation epoch 10 | Loss 0.453530 | Accuracy 85.36% | Time 3.78 seconds


### CNN

In [100]:
class TextCNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes, kernel_sizes, num_filters):
        super(TextCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.convs = nn.ModuleList(
            [nn.Conv2d(1, num_filters, (k, embed_dim)) for k in kernel_sizes]
        )
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)

    def forward(self, x):
        x = x.long()
        x = self.embedding(x).unsqueeze(1)  # Agrega un canal
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs]
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]
        x = torch.cat(x, 1)
        x = self.dropout(x)
        return self.fc(x)


In [1]:
modelo = TextCNN(vocab_size=300, embed_dim=300, kernel_sizes= [4], num_filters=64, num_classes=2).to(DEVICE)
loss_function = nn.CrossEntropyLoss().to(DEVICE)
optimizer = torch.optim.Adam(modelo.parameters(), lr=0.001)
BATCH_SIZE = 16

NameError: name 'TextCNN' is not defined

In [103]:
train_model(modelo, train_dataloader, test_dataloader, loss_function, optimizer, 10)