Integrantes:   
- Daniel Carmona
- Consuelo Rojas

En esta tarea van a crear una red neuronal que clasifique mensajes como spam o no spam. Lo primero es descargar la data:

In [83]:
!wget https://www.ivan-sipiran.com/downloads/spam.csv

--2022-11-29 16:04:20--  https://www.ivan-sipiran.com/downloads/spam.csv
Resolving www.ivan-sipiran.com (www.ivan-sipiran.com)... 66.96.149.31
Connecting to www.ivan-sipiran.com (www.ivan-sipiran.com)|66.96.149.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 471781 (461K)
Saving to: 'spam.csv.6'

     0K .......... .......... .......... .......... .......... 10%  304K 1s
    50K .......... .......... .......... .......... .......... 21%  345K 1s
   100K .......... .......... .......... .......... .......... 32% 2.70M 1s
   150K .......... .......... .......... .......... .......... 43%  462K 1s
   200K .......... .......... .......... .......... .......... 54% 1.53M 0s
   250K .......... .......... .......... .......... .......... 65% 4.16M 0s
   300K .......... .......... .......... .......... .......... 75% 5.51M 0s
   350K .......... .......... .......... .......... .......... 86% 7.27M 0s
   400K .......... .......... .......... .......... .......... 97

Los datos vienen en un archivo CSV que contiene dos columnas "text" y "label". La columna "text" contiene el texto del mensaje y la columna "label" contiene las etiquetas "ham" y "spam". Un mensaje "ham" es un mensaje que no se considera spam.

# Tarea 
El objetivo de la tarea es crear una red neuronal que clasifique los datos entregados. Para lograr esto debes:



*   Implementar el pre-procesamiento de los datos que creas necesario.
*   Particionar los datos en 70% entrenamiento, 10% validación y 20% test.
*   Usa los datos de entrenamiento y valiadación para tus experimentos y sólo usa el conjunto de test para reportar el resultado final.

Para el diseño de la red neuronal puedes usar una red neuronal recurrente o una red basada en transformers. El objetivo de la tarea no es obtener el performance ultra máximo, sino entender qué decisiones de diseño afectan la solución de un problema como este. Lo que si es necesario (como siempre) es que discutas los resultados y decisiones realizadas.



# Librerías

In [84]:
import pandas as pd
import time
import numpy as np
import random
from sklearn.model_selection import train_test_split

# NLP
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize  
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

# Pre-procesamiento
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer

# Clasifación
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Daniel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Daniel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Pre-procesamiento

In [85]:
df = pd.read_csv('spam.csv').dropna() #, encoding='latin-1')
display(df.head())

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


In [86]:
print(df["label"].unique())
# print(df["label"].value_counts())

['ham' 'spam' '#&gt' 'URL&gt' 'DECIMAL&gt' 'l 09064012103 box334sk38ch'
 '  &lt' ':(' ')'
 ' they wer askd 2 sit in an aeroplane. Aftr they sat they wer told dat the plane ws made by their students. Dey all hurried out of d plane.. Bt only 1 didnt move... He said:\\if it is made by my students'
 'TIME&gt' ' successful day.' ' wish U Merry Xmas...'
 " have to reach before 4'o clock so call me plz" ' sweet dreams'
 ' HAVE A NICE SLEEP..SWEET DREAMS..'
 ' best friends.. GOODEVENING Dear..:)' " don't worry."
 ' Finally It Becomes Part Of Your Life.. Follow It.. Happy Morning &amp'
 "i'm in luv wid u. Blue" 'Take care' 'abel'
 ') reminds me i still need 2go.did u c d little thing i left in the lounge?'
 " it kills me that u don't care enough to stop me..." '_'
 ' send this to ur frndZ &amp' ' fletcher now'
 ' I love u Grl: Hogolo Boy: gold chain kodstini Grl: Agalla Boy: necklace madstini Grl: agalla Boy: Hogli 1 mutai eerulli kodthini! Grl: I love U kano'
 '-) Good morning.. keep smiling:-

In [87]:
df2 = df.apply(lambda row: row[df['label'].isin(["ham", "spam"])])
print(df2["label"].value_counts())

ham     4617
spam     746
Name: label, dtype: int64


# Tokenizer

In [88]:
# Definimos algunas stopword que queremos que sean eliminadas
stop_words = stopwords.words('english')

# Definimos un tokenizador con Stemming
class StemmerTokenizer:
    def __init__(self):
        self.ps = PorterStemmer()
    def __call__(self, doc):
        doc_tok = word_tokenize(doc)
        doc_tok = [t for t in doc_tok if t not in stop_words]
        return [self.ps.stem(t) for t in doc_tok]

In [89]:
bow = CountVectorizer(tokenizer= StemmerTokenizer(), ngram_range=(1,2))
df_bow = bow.fit_transform(df2["text"])

df_bow = pd.DataFrame(df_bow.toarray(), columns=bow.get_feature_names_out())

In [90]:
df_bow = SelectPercentile(f_classif, percentile=90).fit_transform(df_bow, df2["label"])
df_bow = pd.DataFrame(df_bow)
df_bow.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,37960,37961,37962,37963,37964,37965,37966,37967,37968,37969
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [91]:
X_train, X_test, y_train, y_test = train_test_split(df_bow, df2["label"], test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=1) # 0.125 x 0.8 = 0.1

print("train: \n", y_train.value_counts())
print("val: \n", y_val.value_counts())
print("test: \n", y_test.value_counts())

train: 
 ham     3245
spam     508
Name: label, dtype: int64
val: 
 ham     453
spam     84
Name: label, dtype: int64
test: 
 ham     919
spam    154
Name: label, dtype: int64


# Funciones de Entrenamiento

In [92]:
'''Esta función permite inicializar todas las semillas de números pseudoaleatorios.
Puedes usar esta función para resetear los generadores de números aleatorios'''
def iniciar_semillas():
  SEED = 1234

  random.seed(SEED)
  np.random.seed(SEED)
  torch.manual_seed(SEED)
  torch.cuda.manual_seed(SEED)
  torch.backends.cudnn.deterministic = True

#Función para computar el accuracy. Se asume que predicciones y etiquetas son tensores en el GPU
def calculate_accuracy(y_pred, y):
  top_pred = y_pred.argmax(1, keepdim=True)
  correct = top_pred.eq(y.view_as(top_pred)).sum()
  acc = correct.float()/y.shape[0]
  return acc

'''Función para entrenar una época de un modelo. Recibe como parámetros
    -model: una red neuronal
    -iterator: un iterador de la data a usar para el entrenamiento (generalmente creado con un DataLoader)
    -optimizer: el optimizador para el entrenamiento
    -criterion: la función de loss
    -device: dispositivo a usar para el entrenamiento

Devuelve el loss promedio y el accuracy promedio de la época (promedio de todos los batches)'''
def train_one_epoch(model, iterator, optimizer, criterion, device):
  epoch_loss = 0
  epoch_acc = 0

  #We have to set the neural network in training mode. This is because during
  #training, we need gradients and complementary data to ease the computation  
  model.train()
  
  #Training loop
  for (x, y) in iterator:
    print(iterator)
    print(x, y)
    x = x.to(device) #Data
    y = y.long().to(device) #Labels
        
    optimizer.zero_grad() #Clean gradients
             
    y_pred = model(x) #Feed the network with data
        
    loss = criterion(y_pred, y) #Compute the loss
       
    acc = calculate_accuracy(y_pred, y) #Compute the accuracy
        
    loss.backward() #Compute gradients
        
    optimizer.step() #Apply update rules
        
    epoch_loss += loss.item()
    epoch_acc += acc.item()
        
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

'''Función que evalúa una red neuronal con un conjunto de datos de prueba. Recibe como parámetros
    -model: una red neuronal
    -iterator: un iterador de la data a usar para el entrenamiento (generalmente creado con un DataLoader)
    -criterion: la función de loss
    -device: dispositivo a usar para el entrenamiento
Devuelve el loss promedio y el accuracy promedio de la época (promedio de todos los batches)'''
def evaluate(model, iterator, criterion, device):
  epoch_loss = 0
  epoch_acc = 0

  #We put the network in testing mode
  #In this mode, Pytorch doesn't use features only reserved for 
  #training (dropout for instance)    
  model.eval()
    
  with torch.no_grad(): #disable the autograd engine (save computation and memory)
        
    for (x, y) in iterator:
      x = x.to(device)
      y = y.long().to(device)

      y_pred= model(x)

      loss = criterion(y_pred, y)

      acc = calculate_accuracy(y_pred, y)

      epoch_loss += loss.item()
      epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

#Calcula el tiempo transcurrido entre dos timestamps
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

'''Esta función realiza el entrenamiento completo de una red. Recibe como parámetros:
    -network: la red neuronal
    -optimizer: el optimizador para entrenamiento
    -train_loader: el dataloader de datos de entrenamiento
    -tes_loader: el dataloader de datos de prueba
    -name: nombre a usar para guardar en disco la red con el mejor accuracy'''
def train_complete(network, device, optimizer, train_loader, val_loader, test_loader, name, epochs=10):
  
  #Se envían la red y la función de loss al GPU
  network = network.to(device)
  criterion = nn.CrossEntropyLoss()
  criterion = criterion.to(device)

  #Fijar el entrenamiento en 20 épocas siempre
  EPOCHS = epochs

  best_valid_acc = float('-inf')

  for epoch in range(EPOCHS):
    
    start_time = time.time()

    #Train + validation cycles  
    train_loss, train_acc = train_one_epoch(network, train_loader, optimizer, criterion, device)
    valid_loss, valid_acc = evaluate(network, val_loader, criterion, device)
    
    #Si encontramos un modelo con accuracy de validación mayor, lo guardamos
    if valid_acc > best_valid_acc:
     best_valid_acc = valid_acc
     torch.save(network.state_dict(), f'{name}.pt')
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')
  
  #Cuando se termina el entrenamiento, cargamos el mejor modelo guardado y calculamos el accuracy de prueba
  network.load_state_dict(torch.load(f'{name}.pt'))

  test_loss , test_acc = evaluate(network, test_loader, criterion, device)
  print(f'Test Loss: {test_loss:.3f} | Mejor test acc: {test_acc*100:.2f}%')

# Modelo

In [93]:
# Para crear la red debemos heredar desde nn.Module
class GruNet(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim,
                 n_layers, 
                 bidirectional, 
                 dropout): #, 
                 #pad_idx):

        super().__init__()

        # Capa de embedding
        self.embedding = nn.Embedding(input_dim,
                                    embedding_dim#,
                                    #padding_idx=pad_idx,
                                    )
        # Capa GRU
        self.gru = nn.GRU(embedding_dim, hidden_dim, n_layers, batch_first=True, dropout = dropout if n_layers > 1 else 0, bidirectional=bidirectional)
        # Capa de salida
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        #self.relu = nn.ReLU()
        # Dropout
        self.dropout = nn.Dropout(dropout)




    # Definimos las operaciones de las capas sobre el input en el forward.
    def forward(self, text): 
        embedded = self.embedding(text)
        outputs, hidden = self.gru(embedded)
        predictions = self.fc(self.dropout(outputs))
        return predictions

In [94]:
# tamaño del vocabulario. recuerden que la entrada son vectores bag of word(one-hot).
INPUT_DIM = len(df_bow.columns)
EMBEDDING_DIM = 300  # dimensión de los embeddings.
HIDDEN_DIM = 256  # dimensión de la capas LSTM
OUTPUT_DIM = 2  # número de clases

N_LAYERS = 2  # número de capas.
DROPOUT = 0.3
BIDIRECTIONAL = False

# Creamos nuestro modelo.
modelo_3 = GruNet(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
                         N_LAYERS, BIDIRECTIONAL, DROPOUT) #, PAD_IDX)

model_name_3 = 'Gru_Model'  # nombre que tendrá el modelo guardado...
n_epochs_3 = 10


# Loss: Cross Entropy
# TAG_PAD_IDX = NER_TAGS.vocab.stoi[NER_TAGS.pad_token]
# criterion_3 = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

# Dataloaders

In [95]:
# Hyper-parameters 
batch_size = 32
train_data = {"data": X_train.to_numpy(), "label": y_train.to_numpy()}
val_data = {"data": X_val.to_numpy(), "label": y_val.to_numpy()}
test_data = {"data": X_test.to_numpy(), "label": y_test.to_numpy()}

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_data, 
                                           batch_size=batch_size, 
                                           shuffle=True)

val_loader = torch.utils.data.DataLoader(dataset=val_data,
                                         batch_size=batch_size,
                                         shuffle=False)

test_loader = torch.utils.data.DataLoader(dataset=test_data, 
                                          batch_size=batch_size, 
                                          shuffle=False)

# Entrenamiento

In [96]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [97]:
momentum = 0.9
log_interval = 100
learning_rate=0.01