<a href="https://colab.research.google.com/github/PedroAdair/CNN_news_detection_spanish/blob/main/CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Centro de Investigación en Matemáticas a.c 
## Unidad Monterrey
### Temas Selectos de Ciencia de Datos, tarea 3
#### Pedro Adair Gallegos Avila

#Problema 1: Sobre la las CNN

#Problema 2:  Clasificador de noticias por país  basado en CNN

In [None]:
import os 
import re
import pandas as pd
import numpy as np
import torch

In [None]:
"""
NLKT es la opción que selecciones para preprocesar mis textos
"""
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

#Para stop words en español
es_stop = set(nltk.corpus.stopwords.words('spanish'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


El primer paso consiste en descargar nuestro conjunto de datos en train/test. Para ello usaremos las funciones proporcionadas por el profesor.

In [None]:
def get_texts_from_dir(cat_dir): 
  texts = []
  data_dir = cat_dir 
  category_index = {} 
  categories = []
  for category_name in sorted(os.listdir(data_dir)):
    category_id = len(category_index)
    category_index[category_name] = category_id
    category_path = os.path.join(data_dir, category_name) 
    for f_name in sorted(os.listdir(category_path)):
      f_path = os.path.join(category_path, f_name) 
      f = open(f_path, "r", encoding="utf8") 
      texts += [f.read()]
      f.close()
      categories += [category_id]
  print("%d files loaded from %s" % (len(texts), cat_dir)) 
  return texts, categories, category_index  

In [None]:
tr_txt, tr_y, tr_y_ind = get_texts_from_dir("/content/drive/MyDrive/NLP/train")
te_txt, te_y, te_y_ind = get_texts_from_dir("/content/drive/MyDrive/NLP/test")

2250 files loaded from /content/drive/MyDrive/NLP/train
1000 files loaded from /content/drive/MyDrive/NLP/test


### Paso 0: Preprocesamiento de las noticias

Primero, preproceso el texto, aplicando las siguientes consideraciones:


*   Remover caracteres especiales (*,$,#, etc.)
*   Convertir a minúsculas.
*   Convertir espacios multiples a solo uno.
*   ...



In [None]:
stemmer = WordNetLemmatizer()
word_punctuation_tokenizer = nltk.WordPunctTokenizer()

def preprocess_text(document):
        # Remove all the special characters
        document = re.sub(r'\W', ' ', str(document))

        # remove all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Remove single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()

        # Lemmatization
        tokens = document.split()
        tokens = [stemmer.lemmatize(word) for word in tokens]
        tokens = [word for word in tokens if word not in es_stop]
        tokens = [word for word in tokens if len(word) > 3]

        preprocessed_text = ' '.join(tokens)

        return preprocessed_text

In [None]:
#preproceso todos los documentos par evitar embeddings de palabras no importantes
final_corpus = [preprocess_text(document) for document in tr_txt if document.strip() !='']

In [None]:
# longitud de un texto antes y despues de preprocesar
print("tamaño de un texto antes de preprocesar:", len(tr_txt[0]) ) 
print("tamaño de un texto despues de preprocesar:", len(final_corpus[0]) ) 

tamaño de un texto antes de preprocesar: 3728
tamaño de un texto despues de preprocesar: 2612


### Paso 1: Tokenize Text Data And Build Vocabulary

Sobre estos textos preprocesados es que tokenizare las noticias y construire los vocabularios (dependiendo del embedding que valla a usar).

In [None]:
def tokenize(texts):
    """Tokeniza nuestros textos, construye el vocabulario y encuentra la noticia de maxima longitud.
    
    Args:
        texts (List[str]): Lista de textos (noticias)
    
    Returns:
        tokenized_texts (List[List[str]]): Lista de listas de tokens
        word2idx (Dict): El vocabulario obtenido del corpus
        max_len (int): el texto de maxima longitud (para que los emb. de las noticias sean del mismo tamaño)
    """

    max_len = 0
    tokenized_texts = []
    word2idx = {}

    # Agregamos los tokens <pad> y <unk> a nuestros vocabulario.
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    # Construimos el vocabulario y sus indices  apartir de 2
    idx = 2
    for sent in texts:
        tokenized_sent = word_tokenize(sent)

        # Add `tokenized_sent` to `tokenized_texts`
        tokenized_texts.append(tokenized_sent)

        # Add new token to `word2idx`
        for token in tokenized_sent:
            if token not in word2idx:
                word2idx[token] = idx
                idx += 1

        # Update `max_len`
        max_len = max(max_len, len(tokenized_sent))

    return tokenized_texts, word2idx, max_len

In [None]:
tokenized_texts, word2idx, max_len = tokenize(final_corpus)

In [None]:
print("la longitud del vocabulario construido es de:", len(word2idx))

la longitud del vocabulario construido es de: 240140


Ahora, debemos de realizar el embedding de nuestros textos previamente preprocesados para que sean el input de nuestra CNN. 
Para ello requerimos que todos los embeddings sean de la misma longitud, la estartegia es que todos tengan la longitud del texto de mayor longitud (tokenizado) colocando el token de padding **<pad>** hasta alcanzar el mismo tamaño.  

In [None]:
def encode(tokenized_texts, word2idx, max_len):
    """Rellena cada oración a la longitud máxima de la oración y codifica los tokens por su .

    Returns:
        input_ids (np.array): Array de tokens indexados en el vocabulario de tamaño
                              shape (N, max_len).
    """

    input_ids = []
    for tokenized_sent in tokenized_texts:
        # Pad sentences to max_len
        tokenized_sent += ['<pad>'] * (max_len - len(tokenized_sent))

        # Encode tokens to input_ids
        input_id = [word2idx.get(token) for token in tokenized_sent]
        input_ids.append(input_id)
    
    return np.array(input_ids)

In [None]:
input_ids = encode(tokenized_texts, word2idx, max_len)

In [None]:
input_ids.shape

(2250, 5823)

In [None]:
from tqdm import tqdm_notebook

def load_pretrained_vectors(word2idx, fname):
    """Load pretrained vectors and create embedding layers.
    
    Args:
        word2idx (Dict): Vocabulary built from the corpus
        fname (str): Path to pretrained vector file

    Returns:
        embeddings (np.array): Embedding matrix with shape (N, d) where N is
            the size of word2idx and d is embedding dimension
    """

    print("Loading pretrained vectors...")
    fin = open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())

    # Initilize random embeddings
    embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), d)) #lleno la matriz d enumeros de forma aleatoria
    embeddings[word2idx['<pad>']] = np.zeros((d,))  #a aquellos valores que tiene un '<pad>'

    # Load pretrained vectors
    count = 0
    for line in tqdm_notebook(fin):
        tokens = line.rstrip().split(' ')
        word = tokens[0]
        if word in word2idx:
            count += 1
            embeddings[word2idx[word]] = np.array(tokens[1:], dtype=np.float32)

    print(f"There are {count} / {len(word2idx)} pretrained vectors found.")

    return embeddings

#### Construccion de los embeddings

Para esta tarea (así como para las reviews de Amazon)usare los 3 modelos de embbegings que vimos en el curso, FastText, Globe y Word2Vect

FastText

In [None]:
embeddings_FastText = load_pretrained_vectors(word2idx, "/content/drive/MyDrive/NLP/FastText/embeddings-l-model.vec")
embeddings_FastText = torch.tensor(embeddings_FastText)

Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


0it [00:00, ?it/s]

There are 158198 / 240140 pretrained vectors found.


In [None]:
embeddings_FastText.shape

torch.Size([240140, 300])

In [None]:
#Word2Vec
embeddings_Word2Vec = load_pretrained_vectors(word2idx, "/content/drive/MyDrive/T3/SBW-vectors-300-min5.txt")
embeddings_Word2Vec = torch.tensor(embeddings_Word2Vec)

Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


0it [00:00, ?it/s]

There are 122480 / 240140 pretrained vectors found.


Glove

In [None]:
#Globe /content/drive/MyDrive/T3/glove-sbwc.i25.vec
embeddings_Glove = load_pretrained_vectors(word2idx, "/content/drive/MyDrive/T3/glove-sbwc.i25.vec")
embeddings_Glove = torch.tensor(embeddings_Glove)

Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


0it [00:00, ?it/s]

There are 150800 / 240140 pretrained vectors found.


#### Construccion del DataLoader

batch_size=50

In [None]:
from torch.utils.data import (TensorDataset, DataLoader, RandomSampler, SequentialSampler)

In [None]:
def data_loader(train_inputs, val_inputs, train_labels, val_labels,
                batch_size=50):
    """Convert train and validation sets to torch.Tensors and load them to
    DataLoader.
    """

    # Convert data type to torch.Tensor
    train_inputs, val_inputs, train_labels, val_labels =\
    tuple(torch.tensor(data) for data in
          [train_inputs, val_inputs, train_labels, val_labels])

    # Specify batch_size
    batch_size = 50

    # Create DataLoader for training data
    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Create DataLoader for validation data
    val_data = TensorDataset(val_inputs, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

    return train_dataloader, val_dataloader

In [None]:
from sklearn.model_selection import train_test_split

# Train Test Split
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    input_ids,tr_y , test_size=0.25, random_state=42)

# Load data to PyTorch DataLoader
train_dataloader, val_dataloader = \
data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50)

In [None]:
train_dataloader, val_dataloader = data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50)

### Paso 2: Construccion de la cnn

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
class CNN_NLP(nn.Module):
    """An 1D Convulational Neural Network for Sentence Classification."""
    def __init__(self,
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=300,
                 filter_sizes=[7, 6, 9],
                 num_filters=[100, 100, 100],
                 num_classes=5,
                 dropout=0.5):
        """
        The constructor for CNN_NLP class.

        Args:
            pretrained_embedding (torch.Tensor): Pretrained embeddings with
                shape (vocab_size, embed_dim)
            freeze_embedding (bool): Set to False to fine-tune pretraiend
                vectors. Default: False
            vocab_size (int): Need to be specified when not pretrained word
                embeddings are not used.
            embed_dim (int): Dimension of word vectors. Need to be specified
                when pretrained word embeddings are not used. Default: 300
            filter_sizes (List[int]): List of filter sizes. Default: [3, 4, 5]
            num_filters (List[int]): List of number of filters, has the same
                length as `filter_sizes`. Default: [100, 100, 100]
            n_classes (int): Number of classes. Default: 2
            dropout (float): Dropout rate. Default: 0.5
        """

        super(CNN_NLP, self).__init__()

        #1. Embedding layer
        #Para el caso de un modelo preeentrenado
        if pretrained_embedding is not None:
            self.vocab_size, self.embed_dim = pretrained_embedding.shape
            self.embedding = nn.Embedding.from_pretrained(pretrained_embedding, freeze=freeze_embedding)
        else:
            self.embed_dim = embed_dim
            self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                          embedding_dim=self.embed_dim,
                                          padding_idx=0,
                                          max_norm=5.0)
        #2. Convolution layers
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        #3. Fully-connected layer and Dropout
        self.fc = nn.Linear(np.sum(num_filters), num_classes)

        #4. dropout 
        self.dropout = nn.Dropout(p=dropout)


    def forward(self, input_ids):
        """Perform a forward pass through the network.

        Args:
            input_ids (torch.Tensor): A tensor of token ids with shape
                (batch_size, max_sent_length)

        Returns:
            logits (torch.Tensor): Output logits with shape (batch_size,
                n_classes)
        """

        # Get embeddings from `input_ids`. Output shape: (b, max_len, embed_dim)
        x_embed = self.embedding(input_ids).float()

        # Permute `x_embed` to match input shape requirement of `nn.Conv1d`.
        # Output shape: (b, embed_dim, max_len)
        x_reshaped = x_embed.permute(0, 2, 1)

        # Apply CNN and ReLU. Output shape: (b, num_filters[i], L_out)
        x_conv_list = [F.relu(conv1d(x_reshaped)) for conv1d in self.conv1d_list]

        # Max pooling. Output shape: (b, num_filters[i], 1)
        x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2])
            for x_conv in x_conv_list]
        
        # Concatenate x_pool_list to feed the fully connected layer.
        # Output shape: (b, sum(num_filters))
        x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list],
                         dim=1)
        
        # Compute logits. Output shape: (b, n_classes)
        logits = self.fc(self.dropout(x_fc))

        return logits

In [None]:
import torch.optim as optim

def initilize_model(pretrained_embedding=None,
                    freeze_embedding=False,
                    vocab_size=None,
                    embed_dim=300,
                    filter_sizes=[7, 6, 9], #original [3, 4, 5]
                    num_filters=[100, 100, 100],
                    num_classes=5,
                    dropout=0.5,
                    learning_rate=0.01):
    """Instantiate a CNN model and an optimizer."""

    assert (len(filter_sizes) == len(num_filters)), "filter_sizes and \
    num_filters need to be of the same length."

    # Instantiate CNN model
    cnn_model = CNN_NLP(pretrained_embedding=pretrained_embedding,
                        freeze_embedding=freeze_embedding,
                        vocab_size=vocab_size,
                        embed_dim=embed_dim,
                        filter_sizes=filter_sizes,
                        num_filters=num_filters,
                        num_classes=5,
                        dropout=0.5)
    
    # Send model to `device` (GPU/CPU)
    cnn_model.to(device)

    # Instantiate SGD optimizer
    optimizer = optim.SGD(cnn_model.parameters(),
                               lr=learning_rate,momentum=0.05)

    return cnn_model, optimizer

In [None]:
import random
import time

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, optimizer, train_dataloader, val_dataloader=None, epochs=10):
    """Train the CNN model."""
    
    # Tracking best validation accuracy
    best_accuracy = 0

    # Start training loop
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {    'Val Acc':^9} | {'Elapsed':^9}")
    print("-"*60)

    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================

        # Tracking time and loss
        t0_epoch = time.time()
        total_loss = 0

        # Put the model into the training mode
        model.train()

        for step, batch in enumerate(train_dataloader):
            # Load batch to GPU
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Update parameters
            optimizer.step()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        # =======================================
        #               Evaluation
        # =======================================
        if val_dataloader is not None:
            # After the completion of each training epoch, measure the model's
            # performance on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Track the best accuracy
            if val_accuracy > best_accuracy:
                best_accuracy = val_accuracy

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            
    print("\n")
    print(f"Training complete! Best accuracy: {best_accuracy:.2f}%.")

def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's
    performance on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled
    # during the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

### Paso 3: Entrenamiento de las diferentes configuraciones de CNN con embedding

Como el entrenamiento es la parte más costona a nivel computacional, es la unica que realizamos con GPU

In [None]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

#device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4


#### Modelo sin trasfer learning

In [None]:
set_seed(42)  
cnn_rand, optimizer = initilize_model(vocab_size=len(word2idx),
                                      embed_dim=300,
                                      learning_rate=0.25,
                                      dropout=0.5)
#% minutos aprox.
train(cnn_rand, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   2.902345   |  1.870407  |   19.06   |   18.80  
   2    |   1.651878   |  1.372272  |   43.94   |   12.46  
   3    |   1.352889   |  1.790661  |   56.46   |   12.64  
   4    |   0.864116   |  1.072754  |   58.01   |   12.81  
   5    |   0.467790   |  1.275605  |   47.68   |   13.00  
   6    |   0.285812   |  0.895876  |   67.29   |   13.20  
   7    |   0.189145   |  0.987490  |   63.63   |   13.43  
   8    |   0.144791   |  0.913058  |   66.29   |   13.61  
   9    |   0.118859   |  0.891463  |   64.51   |   13.85  
  10    |   0.086746   |  0.922402  |   67.41   |   14.00  
  11    |   0.082495   |  0.955410  |   68.27   |   13.86  
  12    |   0.057228   |  0.909804  |   64.35   |   13.72  
  13    |   0.046798   |  0.885539  |   66.51   |   13.69  
  14    |   0.053101   |  0.941372  |   66.96   |   13.71  
  15    |   0.046822

#### Modelos con FastText

##### Modelo basado en transfer learning con FastText descongelado 

In [None]:
# CNN-non-static: fastText pretrained word vectors are fine-tuned during training. SGD optmin
set_seed(42)
cnn_FT_descongelado, optimizer = initilize_model(pretrained_embedding=embeddings_FastText,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_FT_descongelado, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   1.697728   |  1.527699  |   29.76   |   15.35  
   2    |   1.320394   |  1.088348  |   68.38   |   15.59  
   3    |   0.942693   |  0.967022  |   64.27   |   15.54  
   4    |   0.587022   |  0.759043  |   75.41   |   15.29  
   5    |   0.387299   |  0.796224  |   73.10   |   15.20  
   6    |   0.244358   |  0.834356  |   69.99   |   15.25  
   7    |   0.152729   |  0.676685  |   75.77   |   15.36  
   8    |   0.111130   |  0.702182  |   75.91   |   15.39  
   9    |   0.071565   |  0.681294  |   76.24   |   15.33  
  10    |   0.057133   |  0.709138  |   76.58   |   15.31  
  11    |   0.040875   |  0.737786  |   75.27   |   15.30  
  12    |   0.031480   |  0.683813  |   77.74   |   15.31  
  13    |   0.034923   |  0.668537  |   76.41   |   15.30  
  14    |   0.024339   |  0.689228  |   77.41   |   15.31  
  15    |   0.022598

##### Modelo basado en transfer learning con FastText congelado 

In [None]:
# CNN-non-static: fastText pretrained word vectors are fine-tuned during training. SGD optmin
# CNN-non-static: fastText pretrained word vectors are fine-tuned during training. SGD optmin with moment =0.05 (es Nesterov)
#filtros size [7,6,9]
set_seed(42)
cnn_FT_congelado, optimizer = initilize_model(pretrained_embedding=embeddings_FastText,
                                            freeze_embedding=True,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_FT_congelado, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   1.697808   |  1.527705  |   30.23   |   10.73  
   2    |   1.320893   |  1.090536  |   68.05   |   10.74  
   3    |   0.946182   |  0.945285  |   66.10   |   10.77  
   4    |   0.589823   |  0.775185  |   74.24   |   10.73  
   5    |   0.391174   |  0.811221  |   72.60   |   10.73  
   6    |   0.244770   |  0.836334  |   70.65   |   10.73  
   7    |   0.156520   |  0.684125  |   76.27   |   10.73  
   8    |   0.116348   |  0.695498  |   76.74   |   10.73  
   9    |   0.072727   |  0.679860  |   76.27   |   10.74  
  10    |   0.058949   |  0.709124  |   76.58   |   10.73  
  11    |   0.042707   |  0.702628  |   75.77   |   10.73  
  12    |   0.032946   |  0.680841  |   76.24   |   10.75  
  13    |   0.033870   |  0.662249  |   76.91   |   10.75  
  14    |   0.025259   |  0.687512  |   77.08   |   10.73  
  15    |   0.023382

#### Modelos con Glove

##### Modelo basado en transfer learning con Glove descongelado 

In [None]:
set_seed(42)
cnn_Glove_descongelado, optimizer = initilize_model(pretrained_embedding=embeddings_Glove,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_Glove_descongelado, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   1.784248   |  1.470333  |   32.90   |   15.59  
   2    |   1.285337   |  1.127403  |   62.72   |   15.65  
   3    |   0.901063   |  1.155124  |   55.49   |   15.42  
   4    |   0.461553   |  0.920625  |   67.41   |   15.24  
   5    |   0.247919   |  0.877979  |   67.91   |   15.23  
   6    |   0.129137   |  0.908447  |   68.58   |   15.33  
   7    |   0.088332   |  0.836383  |   70.24   |   15.41  
   8    |   0.065097   |  0.898813  |   68.24   |   15.39  
   9    |   0.049564   |  0.842117  |   68.94   |   15.38  
  10    |   0.039615   |  0.886042  |   69.58   |   15.34  
  11    |   0.036421   |  0.875027  |   69.24   |   15.32  
  12    |   0.030070   |  0.888481  |   67.58   |   15.30  
  13    |   0.028746   |  0.853048  |   68.58   |   15.33  
  14    |   0.020661   |  0.846773  |   70.08   |   15.35  
  15    |   0.016160

##### Modelo basado en transfer learning con Glove congelado 

In [None]:
set_seed(42)
cnn_Glove_congelado, optimizer = initilize_model(pretrained_embedding=embeddings_Glove,
                                            freeze_embedding=True,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_Glove_congelado, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   1.784824   |  1.474001  |   32.40   |   10.62  
   2    |   1.297701   |  1.124493  |   63.58   |   10.71  
   3    |   0.898814   |  1.146242  |   56.96   |   10.89  
   4    |   0.466982   |  0.920381  |   66.27   |   10.93  
   5    |   0.251917   |  0.883979  |   67.44   |   10.79  
   6    |   0.131746   |  0.894949  |   68.24   |   10.69  
   7    |   0.087063   |  0.872704  |   71.72   |   10.63  
   8    |   0.066119   |  0.928833  |   66.60   |   10.62  
   9    |   0.050937   |  0.849552  |   71.22   |   10.66  
  10    |   0.039579   |  0.909317  |   70.38   |   10.71  
  11    |   0.039085   |  0.857885  |   71.05   |   10.75  
  12    |   0.025914   |  0.882029  |   69.72   |   10.73  
  13    |   0.025012   |  0.849495  |   71.05   |   10.73  
  14    |   0.021522   |  0.847735  |   70.74   |   10.74  
  15    |   0.016679

#### Modelos con Word2Vec

##### Modelo basado en transfer learning con Word2Vec descongelado 

In [None]:
set_seed(42)
cnn_W2V_descongelado, optimizer = initilize_model(pretrained_embedding=embeddings_Word2Vec,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_W2V_descongelado, optimizer, train_dataloader, val_dataloader, epochs=20)


Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   1.618980   |  1.620095  |   19.33   |   15.27  
   2    |   1.581350   |  1.587210  |   28.73   |   15.50  
   3    |   1.517161   |  1.542086  |   34.21   |   15.62  
   4    |   1.401760   |  1.508827  |   34.71   |   15.42  
   5    |   1.297745   |  1.502216  |   37.68   |   15.26  
   6    |   1.129275   |  1.500030  |   37.18   |   15.23  
   7    |   0.945362   |  1.425271  |   40.87   |   15.30  
   8    |   0.738905   |  1.380465  |   44.87   |   15.39  
   9    |   0.559340   |  1.360952  |   46.04   |   15.37  
  10    |   0.393842   |  1.363100  |   45.04   |   15.36  
  11    |   0.283393   |  1.335589  |   44.21   |   15.32  
  12    |   0.202568   |  1.302149  |   47.85   |   15.32  
  13    |   0.152474   |  1.303223  |   49.51   |   15.31  
  14    |   0.114092   |  1.289274  |   49.18   |   15.30  
  15    |   0.102521

##### Modelo basado en transfer learning con Word2Vec congelado 

In [None]:
set_seed(42)
cnn_W2V_descongelado, optimizer = initilize_model(pretrained_embedding=embeddings_Word2Vec,
                                            freeze_embedding=True,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_W2V_descongelado, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
   1    |   1.618991   |  1.620120  |   19.33   |   10.74  
   2    |   1.582052   |  1.588106  |   28.56   |   10.73  
   3    |   1.519676   |  1.545048  |   34.87   |   10.74  
   4    |   1.407127   |  1.511000  |   35.04   |   10.75  
   5    |   1.306250   |  1.503754  |   37.51   |   10.76  
   6    |   1.141716   |  1.502073  |   37.01   |   10.76  
   7    |   0.964340   |  1.431219  |   41.21   |   10.74  
   8    |   0.764023   |  1.388179  |   44.68   |   10.74  
   9    |   0.588444   |  1.370532  |   44.21   |   10.73  
  10    |   0.420380   |  1.375874  |   44.71   |   10.75  
  11    |   0.306461   |  1.347252  |   44.37   |   10.74  
  12    |   0.221151   |  1.310915  |   48.35   |   10.73  
  13    |   0.167255   |  1.315927  |   48.21   |   10.73  
  14    |   0.125688   |  1.304233  |   46.71   |   10.73  
  15    |   0.111310

### Paso 4: Evaluación de los mejores modelos

Del entrenamiento de las redes anteriores es que se tiene la siguiente tabla donde se compara el modelo, si tuvo o no fine tuning, su presición y en que epoca fue que alcanzo este resultado. Notamos que varias de las arquitecturas requirieron bastantes epocas para alcanzar sus mejores resultados, mientras que otras no lo alzanzaron sino hasta el final; lo cual podria ser indicio de que requieren más tiempo para ejecutarse con mejores resultados. 

Todas las redes fueron ejecutadas con los mismos parametros.


Modelo | Accuracy train | Epoch
--- | --- | ---
s/transfer learning    | 69.77%.   | 18
FastText descongelado  | 77.74%    | 20
FastText congelado     | 77.08%    | 14
Word2Vec descongelado  | 50.49%.   | 19
Word2Vec congelado     | 50.49%    | 20
Glove  descongelado    | 70.24%.   | 7
Glove   congelado      | 71.72%    | 7

De lo anterior es que notamos que el modelo de FastText fue el más apropiado para el tipo de datos con su particular preprocesamiento, mientras que Glove fue el que llego a sus mejores resultados con menos epocas. Sin embargo algo interesante a resaltar es que el modelo sin transfer learning tuvo resultados satisfactorios, superando incluso a los arrojados por Glove. 

### Paso 5: Conclusiones

#Problema 3: Clasificación por categoria y sentimiento de Amazon Reviews Corpus


Extraemos lo que requerimos de las reviews (la review, la calificación y la categoria del producto) unicamente de las 

In [None]:
#train=dataset['train'][:'review_body','stars','product_category']
train= df_train.loc[:,['review_body','stars','product_category']]
train_tmp = pd.DataFrame

In [None]:
categorias = ['home', 'wireless', 'toy', 'sports', 'pc', 'home_improvement', 'electronics']

def select_by_category(dataset_file, category_name):
  data_reducido = dataset_file.loc[:, ['review_body','stars','product_category']]
  data_reducido = data_reducido[data_reducido['product_category']==category_name]
  return (data_reducido)

train_tmp = pd.DataFrame

In [None]:
t1 = select_by_category(train,categorias[0])
t2 = select_by_category(train,categorias[1])
t3 = select_by_category(train,categorias[2])
t4 = select_by_category(train,categorias[3])
t5 = select_by_category(train,categorias[4])
t6 = select_by_category(train,categorias[5])
t7 = select_by_category(train,categorias[6])

In [None]:
training=list(t1['review_body'])+list(t2['review_body'])+list(t3['review_body'])+list(t4['review_body'])+list(t5['review_body'])+list(t6['review_body'])+list(t7['review_body'])

In [None]:
train_clean = pd.concat([t1, t2])
train_clean = pd.concat([train_clean,t3])
train_clean = pd.concat([train_clean,t4])
train_clean = pd.concat([train_clean,t5])
train_clean = pd.concat([train_clean,t6])
train_clean = pd.concat([train_clean,t7])

In [None]:
print(len(t1)+len(t2)+len(t3)+len(t4)+len(t5)+len(t6)+len(t7))
print(len(training))

112139
112139
112139


Ahora realizamos el cambio de estrellas a negativo (0) y positivo (1)

In [None]:
def stars_to_sentiment(df_star):
  df = df_star['stars']
  encoded_labels = np.array([1 if label >= 3 else 0 for label in df])
  return encoded_labels

In [None]:
train_clean['stars'] = stars_to_sentiment(train_clean)

Ahora vuelvo numericas las categorias del producto

In [None]:
train_clean.product_category = pd.Categorical(train_clean.product_category) #se cambio el tipo de la columna:

In [None]:
train_clean['product_category'] = train_clean.product_category.cat.codes

In [None]:
y_train_star = train_clean['stars']
y_train_cat = train_clean['product_category']
X_train = training

In [None]:
val= df_validation.loc[:,['review_body','stars','product_category']]
val_tmp = pd.DataFrame

In [None]:
v1 = select_by_category(val,categorias[0])
v2 = select_by_category(val,categorias[1])
v3 = select_by_category(val,categorias[2])
v4 = select_by_category(val,categorias[3])
v5 = select_by_category(val,categorias[4])
v6 = select_by_category(val,categorias[5])
v7 = select_by_category(val,categorias[6])

In [None]:
val_clean = pd.concat([v1, v2])
val_clean = pd.concat([val_clean,v3])
val_clean = pd.concat([val_clean,v4])
val_clean = pd.concat([val_clean,v5])
val_clean = pd.concat([val_clean,v6])
val_clean = pd.concat([val_clean,v7])

In [None]:
validation=list(v1['review_body'])+list(v2['review_body'])+list(v3['review_body'])+list(v4['review_body'])+list(v5['review_body'])+list(v6['review_body'])+list(v7['review_body'])

estrellas

In [None]:
val_clean['stars'] = stars_to_sentiment(val_clean)

categorias

In [None]:
val_clean.product_category = pd.Categorical(val_clean.product_category) #se cambio el tipo de la columna:
val_clean['product_category'] = val_clean.product_category.cat.codes

Obtenemos el conjunto de validación y sus etiquetas para ambas tareas de clasificación

In [None]:
y_val_star = val_clean['stars']
y_val_cat  = val_clean['product_category']
X_val      = val_clean['review_body']

In [None]:
print(len(v1)+len(v2)+len(v3)+len(v4)+len(v5)+len(v6)+len(v7))
print(len(y_val_cat))
print(len(validation))

2817
2817
2817


In [None]:
texts  = training 
labels = y_train_cat 

In [None]:
print( len(training)) 
print(len(y_train_cat))

112139
112139


# Problema 3: Clasificados de producto y sentimiento para el conjunto "Amazon reviews"

## Paso 0: Descarga y preprocesamiento de los datos

In [None]:
#!pip install datasets
#Se tiene que usar cada vez que corremos el codigo desde 0

In [None]:
#descargamos el corpus
from datasets import load_dataset, get_dataset_config_names
from IPython.display import display, HTML
dataset_name = "amazon_reviews_multi"
dataset = load_dataset(path=dataset_name, name="es")

In [None]:
train=dataset['train'][:]
validation=dataset['validation'][:]
test=dataset['test'][:]

In [None]:
train1=train[train['product_category']=='home']
train2=train[train['product_category']=='wireless']
train3=train[train['product_category']=='toy']
train4=train[train['product_category']=='sports']
train5=train[train['product_category']=='pc']
train6=train[train['product_category']=='home_improvement']
train7=train[train['product_category']=='electronics']

In [None]:
training=list(train1['review_body'])+list(train2['review_body'])+list(train3['review_body'])+list(train4['review_body'])+list(train5['review_body'])+list(train6['review_body'])+list(train7['review_body'])
labels=list(train1['product_category'])+list(train2['product_category'])+list(train3['product_category'])+list(train4['product_category'])+list(train5['product_category'])+list(train6['product_category'])+list(train7['product_category'])


In [None]:
for i in range(len(labels)):
  if(labels[i]=='home'):
    labels[i]=0
  elif(labels[i]=='wireless'):
    labels[i]=1
  elif(labels[i]=='toy'):
    labels[i]=2
  elif(labels[i]=='sports'):
    labels[i]=3
  elif(labels[i]=='pc'):
    labels[i]=4
  elif(labels[i]=='home_improvement'):
    labels[i]=5
  else:
    labels[i]=6

In [None]:
stemmer = WordNetLemmatizer()
word_punctuation_tokenizer = nltk.WordPunctTokenizer()

def preprocess_text(document):
        # Remove all the special characters
        document = re.sub(r'\W', ' ', str(document))

        # remove all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Remove single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()

        # Lemmatization
        tokens = document.split()
        tokens = [stemmer.lemmatize(word) for word in tokens]
        tokens = [word for word in tokens if word not in es_stop]
        tokens = [word for word in tokens if len(word) > 3]

        preprocessed_text = ' '.join(tokens)

        return preprocessed_text

Preprocesamos el texto

In [None]:
final_corpus = [preprocess_text(document) for document in texts if document.strip() !='']

In [None]:
# Tokenize, build vocabulary, encode tokens
tokenized_texts, word2idx, max_len = tokenize(final_corpus)
input_ids = encode(tokenized_texts, word2idx, max_len)

#### Construccion del DataLoader

In [None]:
from torch.utils.data import (TensorDataset, DataLoader, RandomSampler, SequentialSampler)

In [None]:
def data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50):
    """Convert train and validation sets to torch.Tensors and load them to
    DataLoader.
    """

    # Convert data type to torch.Tensor
    train_inputs, val_inputs, train_labels, val_labels =tuple(torch.tensor(data) for data in [train_inputs, val_inputs, train_labels, val_labels])

    # Specify batch_size
    batch_size = 50

    # Create DataLoader for training data
    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Create DataLoader for validation data
    val_data = TensorDataset(val_inputs, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

    return train_dataloader, val_dataloader

In [None]:
from sklearn.model_selection import train_test_split

# Train Test Split
train_inputs, val_inputs, train_labels, val_labels = train_test_split(input_ids, labels, test_size=0.1, random_state=42)

# Load data to PyTorch DataLoader
train_dataloader, val_dataloader = data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50)

#### Construccion de los embeddings

FastText

In [None]:
embeddings_FastText = load_pretrained_vectors(word2idx, "/content/drive/MyDrive/NLP/FastText/embeddings-l-model.vec")
embeddings_FastText = torch.tensor(embeddings_FastText)

Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


0it [00:00, ?it/s]

There are 37092 / 39720 pretrained vectors found.


Word2Vecc

In [None]:
#Word2Vec
embeddings_Word2Vec = load_pretrained_vectors(word2idx, "/content/drive/MyDrive/T3/SBW-vectors-300-min5.txt")
embeddings_Word2Vec = torch.tensor(embeddings_Word2Vec)

Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


0it [00:00, ?it/s]

There are 34022 / 39720 pretrained vectors found.


Glove

In [None]:
#Globe /content/drive/MyDrive/T3/glove-sbwc.i25.vec
embeddings_Glove = load_pretrained_vectors(word2idx, "/content/drive/MyDrive/T3/glove-sbwc.i25.vec")
embeddings_Glove = torch.tensor(embeddings_Glove)

Loading pretrained vectors...


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


0it [00:00, ?it/s]

There are 34952 / 39720 pretrained vectors found.


##Paso 1:Construcción de la CNN para la clasificación de la clase de producto
Usaremos la misma que para el problema 2 ajustada a que ahora son 7 clases (1 para cada tipo de producto)

In [None]:
class CNN_NLP(nn.Module):
    """An 1D Convulational Neural Network for Sentence Classification."""
    def __init__(self,
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=300,
                 filter_sizes=[3, 4, 5],
                 num_filters=[100, 100, 100],
                 num_classes=7,
                 dropout=0.5):
        """
        The constructor for CNN_NLP class.

        Args:
            pretrained_embedding (torch.Tensor): Pretrained embeddings with
                shape (vocab_size, embed_dim)
            freeze_embedding (bool): Set to False to fine-tune pretraiend
                vectors. Default: False
            vocab_size (int): Need to be specified when not pretrained word
                embeddings are not used.
            embed_dim (int): Dimension of word vectors. Need to be specified
                when pretrained word embeddings are not used. Default: 300
            filter_sizes (List[int]): List of filter sizes. Default: [3, 4, 5]
            num_filters (List[int]): List of number of filters, has the same
                length as `filter_sizes`. Default: [100, 100, 100]
            n_classes (int): Number of classes. Default: 2
            dropout (float): Dropout rate. Default: 0.5
        """

        super(CNN_NLP, self).__init__()
        # Embedding layer
        if pretrained_embedding is not None:
            self.vocab_size, self.embed_dim = pretrained_embedding.shape
            self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                          freeze=freeze_embedding)
        else:
            self.embed_dim = embed_dim
            self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                          embedding_dim=self.embed_dim,
                                          padding_idx=0,
                                          max_norm=5.0)
        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        # Fully-connected layer and Dropout
        self.fc = nn.Linear(np.sum(num_filters), num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input_ids):
        """Perform a forward pass through the network.

        Args:
            input_ids (torch.Tensor): A tensor of token ids with shape
                (batch_size, max_sent_length)

        Returns:
            logits (torch.Tensor): Output logits with shape (batch_size,
                n_classes)
        """

        # Get embeddings from `input_ids`. Output shape: (b, max_len, embed_dim)
        x_embed = self.embedding(input_ids).float()

        # Permute `x_embed` to match input shape requirement of `nn.Conv1d`.
        # Output shape: (b, embed_dim, max_len)
        x_reshaped = x_embed.permute(0, 2, 1)

        # Apply CNN and ReLU. Output shape: (b, num_filters[i], L_out)
        x_conv_list = [F.relu(conv1d(x_reshaped)) for conv1d in self.conv1d_list]

        # Max pooling. Output shape: (b, num_filters[i], 1)
        x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2])
            for x_conv in x_conv_list]
        
        # Concatenate x_pool_list to feed the fully connected layer.
        # Output shape: (b, sum(num_filters))
        x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list],
                         dim=1)
        
        # Compute logits. Output shape: (b, n_classes)
        logits = self.fc(self.dropout(x_fc))

        return logits

In [None]:
def initilize_model(pretrained_embedding=None,
                    freeze_embedding=False,
                    vocab_size=None,
                    embed_dim=300,
                    filter_sizes=[3, 4, 5],
                    num_filters=[100, 100, 100],
                    num_classes=7,
                    dropout=0.5,
                    learning_rate=0.01):
    """Instantiate a CNN model and an optimizer."""

    assert (len(filter_sizes) == len(num_filters)), "filter_sizes and \
    num_filters need to be of the same length."

    # Instantiate CNN model
    cnn_model = CNN_NLP(pretrained_embedding=pretrained_embedding,
                        freeze_embedding=freeze_embedding,
                        vocab_size=vocab_size,
                        embed_dim=embed_dim,
                        filter_sizes=filter_sizes,
                        num_filters=num_filters,
                        num_classes=7,
                        dropout=0.5)
    
    # Send model to `device` (GPU/CPU)
    cnn_model.to(device)

    # Instantiate Adadelta optimizer
    optimizer = optim.Adadelta(cnn_model.parameters(),
                               lr=learning_rate,
                               rho=0.95)

    return cnn_model, optimizer

In [None]:
# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, optimizer, train_dataloader, val_dataloader=None, epochs=20):
    """Train the CNN model."""
    
    # Tracking best validation accuracy
    best_accuracy = 0

    # Start training loop
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
    print("-"*60)

    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================

        # Tracking time and loss
        t0_epoch = time.time()
        total_loss = 0

        # Put the model into the training mode
        model.train()

        for step, batch in enumerate(train_dataloader):
            # Load batch to GPU
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Update parameters
            optimizer.step()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        # =======================================
        #               Evaluation
        # =======================================
        if val_dataloader is not None:
            # After the completion of each training epoch, measure the model's
            # performance on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Track the best accuracy
            if val_accuracy > best_accuracy:
                best_accuracy = val_accuracy

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            
    print("\n")
    print(f"Training complete! Best accuracy: {best_accuracy:.2f}%.")

def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's
    performance on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled
    # during the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

##Paso 2: Entrenamiento (Caso Categoria)

In [None]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

#device = torch.device("cpu")

No GPU available, using the CPU instead.


#### Modelo sin transfer Learning

In [None]:
set_seed(42)
mod1, optimizer = initilize_model(vocab_size=len(word2idx),
                                      embed_dim=300,
                                      learning_rate=0.25,
                                      dropout=0.5)
train(mod1, optimizer, train_dataloader, val_dataloader, epochs=20)

#### Modelos con FastText

##### Modelo basado en transfer learning con FastText descongelado

In [None]:
# CNN-non-static: fastText pretrained word vectors are fine-tuned during training. SGD optmin
set_seed(42)
cnn_FT_descongelado, optimizer = initilize_model(pretrained_embedding=embeddings_FastText,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_FT_descongelado, optimizer, train_dataloader, val_dataloader, epochs=20)

##### Modelo basado en transfer learning con FastText congelado

In [None]:
# CNN-non-static: fastText pretrained word vectors are fine-tuned during training. SGD optmin
set_seed(42)
cnn_FT_congelado, optimizer = initilize_model(pretrained_embedding=embeddings_FastText,
                                            freeze_embedding=True,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_FT_congelado, optimizer, train_dataloader, val_dataloader, epochs=20)

#### Modelos con Word2Vec

##### Modelo basado en transfer learning con Word2Vec descongelado

In [None]:
# CNN-non-static: fastText pretrained word vectors are fine-tuned during training. SGD optmin
set_seed(42)
cnn_W2V_descongelado, optimizer = initilize_model(pretrained_embedding=embeddings_Word2Vec,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_W2V_descongelado, optimizer, train_dataloader, val_dataloader, epochs=20)

##### Modelo basado en transfer learning con Word2Vec congelado

In [None]:
# CNN-non-static: fastText pretrained word vectors are fine-tuned during training. SGD optmin
set_seed(42)
cnn_W2V_congelado, optimizer = initilize_model(pretrained_embedding=embeddings_Word2Vec,
                                            freeze_embedding=True,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_W2V_congelado, optimizer, train_dataloader, val_dataloader, epochs=20)

#### Modelos con Glove

##### Modelo basado en transfer learning con Glove descongelado

In [None]:
set_seed(42)
cnn_Glove_descongelado, optimizer = initilize_model(pretrained_embedding=embeddings_Glove,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_Glove_descongelado, optimizer, train_dataloader, val_dataloader, epochs=20)

##### Modelo basado en transfer learning con Glove congelado

In [None]:
set_seed(42)
cnn_Glove_congelado, optimizer = initilize_model(pretrained_embedding=embeddings_Glove,
                                            freeze_embedding=True,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_Glove_congelado, optimizer, train_dataloader, val_dataloader, epochs=20)

## Paso 3: Evaluación de los mejores modelos