Integrantes:   
- Daniel Carmona
- Consuelo Rojas

En esta tarea van a crear una red neuronal que clasifique mensajes como spam o no spam. Lo primero es descargar la data:

In [1]:
# !wget https://www.ivan-sipiran.com/downloads/spam.csv

Los datos vienen en un archivo CSV que contiene dos columnas "text" y "label". La columna "text" contiene el texto del mensaje y la columna "label" contiene las etiquetas "ham" y "spam". Un mensaje "ham" es un mensaje que no se considera spam.

# Tarea 
El objetivo de la tarea es crear una red neuronal que clasifique los datos entregados. Para lograr esto debes:



*   Implementar el pre-procesamiento de los datos que creas necesario.
*   Particionar los datos en 70% entrenamiento, 10% validación y 20% test.
*   Usa los datos de entrenamiento y valiadación para tus experimentos y sólo usa el conjunto de test para reportar el resultado final.

Para el diseño de la red neuronal puedes usar una red neuronal recurrente o una red basada en transformers. El objetivo de la tarea no es obtener el performance ultra máximo, sino entender qué decisiones de diseño afectan la solución de un problema como este. Lo que si es necesario (como siempre) es que discutas los resultados y decisiones realizadas.



# Librerías

In [2]:
import pandas as pd
import time
import numpy as np
import random

# NLP
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize  
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

# Pre-procesamiento
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.feature_extraction.text import CountVectorizer

# Modelos
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
torch.cuda.empty_cache()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  warn(f"Failed to load image Python extension: {e}")


# Pre-procesamiento

In [3]:
df = pd.read_csv('spam.csv').dropna() #, encoding='latin-1')
display(df.head())

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


In [4]:
print(df["label"].unique())
# print(df["label"].value_counts())

['ham' 'spam' '#&gt' 'URL&gt' 'DECIMAL&gt' 'l 09064012103 box334sk38ch'
 '  &lt' ':(' ')'
 ' they wer askd 2 sit in an aeroplane. Aftr they sat they wer told dat the plane ws made by their students. Dey all hurried out of d plane.. Bt only 1 didnt move... He said:\\if it is made by my students'
 'TIME&gt' ' successful day.' ' wish U Merry Xmas...'
 " have to reach before 4'o clock so call me plz" ' sweet dreams'
 ' HAVE A NICE SLEEP..SWEET DREAMS..'
 ' best friends.. GOODEVENING Dear..:)' " don't worry."
 ' Finally It Becomes Part Of Your Life.. Follow It.. Happy Morning &amp'
 "i'm in luv wid u. Blue" 'Take care' 'abel'
 ') reminds me i still need 2go.did u c d little thing i left in the lounge?'
 " it kills me that u don't care enough to stop me..." '_'
 ' send this to ur frndZ &amp' ' fletcher now'
 ' I love u Grl: Hogolo Boy: gold chain kodstini Grl: Agalla Boy: necklace madstini Grl: agalla Boy: Hogli 1 mutai eerulli kodthini! Grl: I love U kano'
 '-) Good morning.. keep smiling:-

In [5]:
df2 = df.apply(lambda row: row[df['label'].isin(["ham", "spam"])])
print(df2["label"].value_counts())

ham     4617
spam     746
Name: label, dtype: int64


# Tokenizer

In [6]:
# Definimos algunas stopword que queremos que sean eliminadas
stop_words = stopwords.words('english')

# Definimos un tokenizador con Stemming
class StemmerTokenizer:
    def __init__(self):
        self.ps = PorterStemmer()
    def __call__(self, doc):
        doc_tok = word_tokenize(doc)
        doc_tok = [t for t in doc_tok if t not in stop_words]
        return [self.ps.stem(t) for t in doc_tok]

In [7]:
bow = CountVectorizer(tokenizer= StemmerTokenizer(), ngram_range=(1,2))
df_bow = bow.fit_transform(df2["text"])

df_bow = pd.DataFrame(df_bow.toarray(), columns=bow.get_feature_names_out())

In [8]:
df_bow = SelectPercentile(f_classif, percentile=70).fit_transform(df_bow, df2["label"])
df_bow = pd.DataFrame(df_bow)
df_bow.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,29522,29523,29524,29525,29526,29527,29528,29529,29530,29531
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
X_train, X_test, y_train, y_test = train_test_split(df_bow, df2["label"], test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=1) # 0.125 x 0.8 = 0.1

print("train: \n", y_train.value_counts())
print("val: \n", y_val.value_counts())
print("test: \n", y_test.value_counts())

train: 
 ham     3245
spam     508
Name: label, dtype: int64
val: 
 ham     453
spam     84
Name: label, dtype: int64
test: 
 ham     919
spam    154
Name: label, dtype: int64


# Funciones de Entrenamiento

In [10]:
'''Esta función permite inicializar todas las semillas de números pseudoaleatorios.
Puedes usar esta función para resetear los generadores de números aleatorios'''
def iniciar_semillas():
  SEED = 1234

  random.seed(SEED)
  np.random.seed(SEED)
  torch.manual_seed(SEED)
  torch.cuda.manual_seed(SEED)
  torch.backends.cudnn.deterministic = True

#Función para computar el accuracy. Se asume que predicciones y etiquetas son tensores en el GPU
def calculate_accuracy(y_pred, y):
  top_pred = y_pred.argmax(1, keepdim=True)
  correct = top_pred.eq(y.view_as(top_pred)).sum()
  acc = correct.float()/y.shape[0]
  return acc

'''Función para entrenar una época de un modelo. Recibe como parámetros
    -model: una red neuronal
    -iterator: un iterador de la data a usar para el entrenamiento (generalmente creado con un DataLoader)
    -optimizer: el optimizador para el entrenamiento
    -criterion: la función de loss
    -device: dispositivo a usar para el entrenamiento

Devuelve el loss promedio y el accuracy promedio de la época (promedio de todos los batches)'''
def train_one_epoch(model, iterator, optimizer, criterion, device):
  epoch_loss = 0
  epoch_acc = 0

  #We have to set the neural network in training mode. This is because during
  #training, we need gradients and complementary data to ease the computation  
  model.train()
  batch_size = 15 #x.size(0)
  #hidden_state = model.init_hidden(batch_size)
  #Training loop
  for idx, data in enumerate(iterator):
  #for (x, y) in iterator:
    # print(iterator)
    # print(x, y)
    x, y = data[0].to(torch.int64), data[1].type(torch.LongTensor)  #data[0].to(torch.float32), data[1].to(torch.float32)
    x = x.to(device)#.float() #Data
    y = y.to(device)#.long() #Labels
    hidden_state = model.init_hidden(x.shape[0])
    hidden_state = tuple([each.data for each in hidden_state])
    #hidden_state = tuple([each.repeat(1, batch_size, 1).data for each in hidden_state]) 
        
    optimizer.zero_grad() #Clean gradients  
    y_pred, hidden_state = model(x, hidden_state)#.to(torch.int64)      #model(x.squeeze(), hidden_state)    
    #y_pred = torch.transpose(model(x), 1, 2) #Feed the network with data    
    loss = criterion(y_pred, y) #Compute the loss   
    acc = calculate_accuracy(y_pred, y) #Compute the accuracy 
    #hidden_state = hidden_state.detach()
         
    loss.backward()#retain_graph=True) #Compute gradients  

    nn.utils.clip_grad_norm_(model.parameters(), 5)  
    optimizer.step() #Apply update rules
        
    epoch_loss += loss.item()
    epoch_acc += acc.item()
        
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

'''Función que evalúa una red neuronal con un conjunto de datos de prueba. Recibe como parámetros
    -model: una red neuronal
    -iterator: un iterador de la data a usar para el entrenamiento (generalmente creado con un DataLoader)
    -criterion: la función de loss
    -device: dispositivo a usar para el entrenamiento
Devuelve el loss promedio y el accuracy promedio de la época (promedio de todos los batches)'''
def evaluate(model, iterator, criterion, device):
  epoch_loss = 0
  epoch_acc = 0

  #We put the network in testing mode
  #In this mode, Pytorch doesn't use features only reserved for 
  #training (dropout for instance)    
  model.eval()
    
  with torch.no_grad(): #disable the autograd engine (save computation and memory)

    batch_size = 15 #y.size(0)
    #hidden_state = model.init_hidden(batch_size)    
    #for (x, y) in iterator:
    for idx, data in enumerate(iterator):
      
      #hidden_state = model.init_hidden(15)
      x, y = data[0].to(torch.int64), data[1].type(torch.LongTensor)  #data[0].to(torch.float32), data[1].to(torch.float32)
      x = x.to(device)#.float() #Data
      y = y.to(device)#.long() #Labels
      hidden_state = model.init_hidden(x.shape[0]) #batch_size)
      hidden_state = tuple([each.data for each in hidden_state])
      #hidden_state = tuple([each.repeat(1, batch_size, 1).data for each in hidden_state])  
      y_pred, hidden_state = model(x, hidden_state) #torch.argmax(model(x))
  
      #y_pred = torch.transpose(model(x), 1, 2) #Feed the network with data 
      loss = criterion(y_pred.squeeze(), y)
      acc = calculate_accuracy(y_pred, y)

      epoch_loss += loss.item()
      epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

#Calcula el tiempo transcurrido entre dos timestamps
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

'''Esta función realiza el entrenamiento completo de una red. Recibe como parámetros:
    -network: la red neuronal
    -optimizer: el optimizador para entrenamiento
    -train_loader: el dataloader de datos de entrenamiento
    -tes_loader: el dataloader de datos de prueba
    -name: nombre a usar para guardar en disco la red con el mejor accuracy'''
def train_complete(network, device, optimizer, criterion, train_loader, val_loader, test_loader, name, epochs=10):
  
  #Se envían la red y la función de loss al GPU
  network = network.to(device)
  #criterion = nn.CrossEntropyLoss()
  criterion = criterion.to(device)

  #Fijar el entrenamiento en 20 épocas siempre
  EPOCHS = epochs

  best_valid_acc = float('-inf')

  for epoch in range(EPOCHS):
    
    start_time = time.time()

    #Train + validation cycles  
    train_loss, train_acc = train_one_epoch(network, train_loader, optimizer, criterion, device)
    valid_loss, valid_acc = evaluate(network, val_loader, criterion, device)
    
    #Si encontramos un modelo con accuracy de validación mayor, lo guardamos
    if valid_acc > best_valid_acc:
     best_valid_acc = valid_acc
     torch.save(network.state_dict(), f'{name}.pt')
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')
  
  #Cuando se termina el entrenamiento, cargamos el mejor modelo guardado y calculamos el accuracy de prueba
  network.load_state_dict(torch.load(f'{name}.pt'))

  test_loss , test_acc = evaluate(network, test_loader, criterion, device)
  print(f'Test Loss: {test_loss:.3f} | Mejor test acc: {test_acc*100:.2f}%')
  return network

# Dataloaders

In [11]:
le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)
y_test = le.transform(y_test)
y_val = le.transform(y_val)

#ist(le.inverse_transform(y_train))

In [12]:
# Hyper-parameters 
batch_size = 15

from torch.utils.data import TensorDataset, DataLoader

tensor_x_train = torch.Tensor(X_train.to_numpy()) # transform to torch tensor
tensor_y_train = torch.Tensor(y_train)
train_dataset = TensorDataset(tensor_x_train,tensor_y_train) # create your datset
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False) # create your dataloader

tensor_x_val = torch.Tensor(X_val.to_numpy()) # transform to torch tensor
tensor_y_val = torch.Tensor(y_val)
val_dataset = TensorDataset(tensor_x_val,tensor_y_val) # create your datset
val_loader = DataLoader(dataset=val_dataset, batch_size=batch_size, shuffle=False) # create your dataloader

tensor_x_test = torch.Tensor(X_test.to_numpy()) # transform to torch tensor
tensor_y_test = torch.Tensor(y_test)
test_dataset = TensorDataset(tensor_x_test,tensor_y_test) # create your datset
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False) # create your dataloader

# Modelo

In [15]:
class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # Capas embedding y LSTM
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                            dropout=drop_prob, batch_first=True)
        
        # dropout
        self.dropout = nn.Dropout(drop_prob)
        
        # Capa lineal y sigmoide
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
                
        #Tomamos solo el último valor de salida del LSTM
        lstm_out = lstm_out[:,-1,:]
                
        # dropout y fully-connected
        out = self.dropout(lstm_out)
        out = self.fc(out)
               
        # sigmoide
        sig_out = self.sig(out)
                  
        # retornar sigmoide y último estado oculto
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        # Crea dos nuevos tensores con tamaño n_layers x batch_size x hidden_dim,
        # inicializados a cero, para estado oculto y memoria de LSTM
        weight = next(self.parameters()).data
        
        hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().to(device),
                   weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().to(device))
        # if(train_on_gpu):
        #   hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
        #            weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        # else:
        #   hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
        #            weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

# Entrenamiento

In [17]:
torch.cuda.empty_cache()
CUDA_LAUNCH_BLOCKING=1.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_on_gpu=torch.cuda.is_available()
print(device)

cuda


In [18]:
vocab_size = len(df_bow.columns) + 1 # +1 for zero padding + our word tokens
output_size = 2
embedding_dim = 200 #100 
hidden_dim = 64 #64
n_layers = 2
lr = 0.001

model_RNN = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
criterion = nn.CrossEntropyLoss()  #nn.NLLLoss()
optimizer = torch.optim.Adam(model_RNN.parameters(), lr=lr, weight_decay=1e-5)
#optimizer = torch.optim.SGD(model_RNN.parameters(), lr=lr, momentum=0.9)


In [19]:
model_RNN = train_complete(model_RNN, device, optimizer, criterion, train_loader, val_loader, test_loader, "model_RNN", epochs=4)

Epoch: 01 | Epoch Time: 0m 48s
	Train Loss: 0.462 | Train Acc: 86.37%
	 Val. Loss: 0.470 |  Val. Acc: 84.35%
Epoch: 02 | Epoch Time: 0m 47s
	Train Loss: 0.450 | Train Acc: 86.40%
	 Val. Loss: 0.470 |  Val. Acc: 84.35%
Epoch: 03 | Epoch Time: 0m 47s
	Train Loss: 0.449 | Train Acc: 86.40%
	 Val. Loss: 0.470 |  Val. Acc: 84.35%
Epoch: 04 | Epoch Time: 0m 47s
	Train Loss: 0.449 | Train Acc: 86.40%
	 Val. Loss: 0.470 |  Val. Acc: 84.35%
Test Loss: 0.458 | Mejor test acc: 85.58%
