# Datos y Preprocesamiento

Vamos a usar el dataset de IMDB para clasificación de reseñas de películas, el objetivo del mismo es detectar si una reseña tiene sentimiento **positivo** o **negativo**.

Descarguen el dataset de este [link](https://drive.google.com/file/d/1i0bBI4p80AxsLgnWcXkxVT65AahIzePu/view?usp=sharing) y subanlo a una carpeta **data** en la raiz de su drive personal.


In [None]:
import pandas as pd
from google.colab import drive
drive.mount("/content/drive")

! cp "/content/drive/My Drive/data/IMDB_Dataset.zip" .
! unzip -q IMDB_Dataset.zip
! rm IMDB_Dataset.zip
! ls

In [None]:
import re
import time
from itertools import chain
from bs4 import BeautifulSoup
from collections import Counter

import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
nltk.download('wordnet')

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(DEVICE)

torch.manual_seed(42)
torch.backends.cudnn.deterministic = True

In [None]:
imdb_data = pd.read_csv("IMDB Dataset.csv")

#sentiment count
print(imdb_data.columns)
imdb_data['sentiment'].value_counts()

# Convert positive and negative into binary classes (1-0)
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()

sentiment_data = lb.fit_transform(imdb_data["sentiment"])
imdb_data['sentiment'] = sentiment_data

In [None]:
def strip_html(text):
  soup = BeautifulSoup(text, "html.parser")
  return soup.get_text()


def remove_between_square_brackets(text):
  return re.sub('\[[^]]*\]', '', text)


def remove_special_characters(text):
  pattern = r'[^a-zA-z\s]'
  text = re.sub(pattern,'',text)
  return text


def low_level_preproc(text):
  text = strip_html(text)
  text = remove_between_square_brackets(text)
  text = remove_special_characters(text)
  return text

#Apply function on review column
imdb_data['review'] = imdb_data['review'].apply(low_level_preproc)

In [None]:
all_stopwords = set(stopwords.words("english"))

def remove_stop_words(full_text_line):
  tokens = full_text_line.split()
  tokens = [tok for tok in tokens if tok not in all_stopwords]

  return " ".join(tokens)


def lemmatize(text):
  wnl= WordNetLemmatizer()
  lemas = [wnl.lemmatize(word) for word in text.split()]

  return " ".join(lemas)


def high_level_preproc(text):
  text = remove_stop_words(text)
  return lemmatize(text)


#Apply function on review column
imdb_data['review'] = imdb_data['review'].str.lower()
imdb_data['review'] = imdb_data['review'].apply(high_level_preproc)

In [None]:
#split the dataset  
#train dataset
train_reviews = imdb_data.review[:40000]
train_sentiments = imdb_data.sentiment[:40000]

#test dataset
test_reviews = imdb_data.review[40000:]
test_sentiments = imdb_data.sentiment[40000:]


print("Train set:", train_reviews.shape, train_sentiments.shape)
print("Test set:", test_reviews.shape, test_sentiments.shape)

# Vocabulario y Encoding

Vamos a crear un volcabulario para el problema, de este modo podemos representar cada palabra con un entero único. Esto nos va a permitir representar una review como una lista de ints (que luego el modelo va a mapear a word embeddings!).

Una cosa a tener en cuenta es que vamos a querer agregar padding a nuestros inputs (para que todas las reviews tengan el mismo largo), para esto vamos a usar el 0, por lo que las palabras de nuestro vocabulario deben empezar en 1.



In [None]:
def make_vocab(all_texts, max_vocab_size, oov_token="<OOV>"):
  # Count the number of occurrences of each word

  # Create vocab containing max_vocab_size tokens
  # Add the out of vocabulary at the end

  # Map from word to int index in vocab

  return vocab_to_int

In [None]:
vocab_mapping = make_vocab(train_reviews, 100_000)

Ahora vamos a implementar una funcion que transforma un string con la review en una lista de enteros con la posiscion de cada una de nuestras palabras en el vocabulario. Si una palabra no está en el vocabulario usamos el indice para`"<OOV>"`

In [None]:
def get_review_features(text, word_to_idx):

  return indices

Lo siguiente es implementar padding de las sentencias, si bien los modelos son capaces de trabajar con secuencias de cualquier largo queremos que el tiempo de entrenamiento e inferencia esté controlado y no dependa del largo de los inputs. Como una pequeña optimizacion vamos a hacer **left padding**, es decir, agregar 0s a la izquierda de una secuencia hasta alcanzar `max_sequence_length` elementos.

Agregar ceros a la izquierda ayuda a los modelos a aprender de los datos ya que la informacion valiosa aparece al final de la secuencia y no tiene que recordar 3 palabras en luego de haber visto 100 ceros..

Ejemplo, la secuencia

`[117, 18, 128]`

 Quedaría:

`[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]`

En lugar de 

`[117, 18, 128, 0, 0, 0, 0, 0, 0, 0] ` Forzando al modelo a recordar los 3 primeros inputs para poder predecir algo.


In [None]:
def pad_features(review_ints, sequence_length):

  return features

In [None]:
def get_review_representation(review_text, word_to_idx, max_sequence_length):
  return ?

# Transformando los textos a vectores

In [None]:
MAX_SEQUENCE_LENGTH = 100

train_vectors = train_reviews.apply(lambda x: get_review_representation(x, vocab_mapping, MAX_SEQUENCE_LENGTH))
test_vectors = test_reviews.apply(lambda x: get_review_representation(x, vocab_mapping, MAX_SEQUENCE_LENGTH))

train_vectors = np.array([vec for vec in train_vectors])
test_vectors = np.array([vec for vec in test_vectors])

# Codigo de entrenamiento e inferencia 

Same old..


In [None]:
def train_epoch(training_model, loader, criterion, optim):
    training_model.train()
    epoch_loss = 0.0
    all_labels = []
    all_predictions = []
    
    for data, labels in loader:
      all_labels.extend(labels.numpy())  

      optim.zero_grad()

      predictions = training_model(data.to(DEVICE))
      all_predictions.extend(torch.argmax(predictions, dim=1).cpu().numpy())

      loss = criterion(predictions, labels.to(DEVICE))
      
      loss.backward()
      optim.step()

      epoch_loss += loss.item()

    return epoch_loss / len(loader), accuracy_score(all_labels, all_predictions) * 100


def validation_epoch(val_model, loader, criterion):
    val_model.eval()
    epoch_loss = 0.0
    all_labels = []
    all_predictions = []
    
    with torch.no_grad():
      for data, labels in loader:
        all_labels.extend(labels.numpy())  

        predictions = val_model(data.to(DEVICE))
        all_predictions.extend(torch.argmax(predictions, dim=1).cpu().numpy())

        loss = criterion(predictions, labels.to(DEVICE))

        epoch_loss += loss.item()

    return epoch_loss / len(loader), accuracy_score(all_labels, all_predictions) * 100
  

def train_model(model, train_loader, test_loader, criterion, optim, number_epochs):
  train_history = []
  test_history = []
  accuracy_history = []

  for epoch in range(number_epochs):
      start_time = time.time()

      train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer)
      train_history.append(train_loss)
      print("Training epoch {} | Loss {:.6f} | Accuracy {:.2f}% | Time {:.2f} seconds"
            .format(epoch + 1, train_loss, train_acc, time.time() - start_time))

      start_time = time.time()
      test_loss, acc = validation_epoch(model, test_loader, criterion)
      test_history.append(test_loss)
      accuracy_history.append(acc)
      print("Validation epoch {} | Loss {:.6f} | Accuracy {:.2f}% | Time {:.2f} seconds"
            .format(epoch + 1, test_loss, acc, time.time() - start_time))

# Modelo


In [None]:
class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(SentimentRNN, self).__init__()
        # Necesitan una capa de Embedding y una RNN.
        # El resto (y sus hyperparametros) corre por su cuenta.
        # 2 outputs

    def forward(self, x):
        return x

# Entrenamiento

In [None]:
train_targets = torch.Tensor(train_sentiments.to_numpy()).long()
train_dataset = TensorDataset(torch.LongTensor(train_vectors), train_targets) 
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, pin_memory=True, num_workers=2)

test_targets = torch.Tensor(test_sentiments.to_numpy()).long()
test_dataset = TensorDataset(torch.LongTensor(test_vectors), test_targets) 
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, pin_memory=True, num_workers=2)

In [None]:
loss_function = nn.CrossEntropyLoss().to(DEVICE)
BATCH_SIZE = 32

In [None]:
# Instanciar un modelo
# Crear optimizador
# Entrenar (Una RNN normal debería poder superar 60% de accuracy)

# Mejoras en el modelo



1. Podemos mejorar la performance si usamos una GRU o una LSTM ?
2. Que pasa si usamos celdas **bidireccionales** ?
3. Que pasa si aumentamos el numero de **capas** de nuestras celdas recurrentes 
?
4. Y si usamos vectores preentrenados (W2V, GloVe) ?


