<a href="https://www.kaggle.com/code/angevalli/sentiment-classification/notebook" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a> <a href="https://colab.research.google.com/drive/1kZdKcvVXTebR4gh-PKFN4bIq7wxL_oMd" target="_blank"><img align="left" alt="Colab" title="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a>

# Dataset

In [None]:
!wget 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
!tar -xvf /content/aclImdb_v1.tar.gz

## Text classification with Pytorch

In [None]:
import torch
import torch.nn as nn

The main interest in us using Pytorch is the ```autograd``` package. ```torch.Tensor```objects have an attribute ```.requires_grad```; if set as True, it starts to track all operations on it. When you finish your computation, can call ```.backward()``` and all the gradients are computed automatically (and stored in the ```.grad``` attribute).

One way to easily cut a tensor from the computational once it is not needed anymore is to use ```.detach()```.

In [None]:
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# Build a computational graph.
y = w * x + b    # y = 2 * x + 3

# Compute gradients.
y.backward()

# Print out the gradients.
print(x.grad)    # x.grad = 2 
print(w.grad)    # w.grad = 1 
print(b.grad)    # b.grad = 1 

In [None]:
x = torch.randn(10, 3)
y = torch.randn(10, 2)

# Build a fully connected layer.
linear = nn.Linear(3, 2)
for name, p in linear.named_parameters():
    print(name)
    print(p)

# Build loss function - Mean Square Error
criterion = nn.MSELoss()

# Forward pass.
pred = linear(x)

# Compute loss.
loss = criterion(pred, y)
print('Initial loss: ', loss.item())

# Backward pass.
loss.backward()

# Print out the gradients.
print ('dL/dw: ', linear.weight.grad) 
print ('dL/db: ', linear.bias.grad)

In [None]:
# You can perform gradient descent manually, with an in-place update ...
linear.weight.data.sub_(0.01 * linear.weight.grad.data)
linear.bias.data.sub_(0.01 * linear.bias.grad.data)

# Print out the loss after 1-step gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print('Loss after one update: ', loss.item())

In [None]:
# Use the optim package to define an Optimizer that will update the weights of the model.
optimizer = torch.optim.SGD(linear.parameters(), lr=0.01)

# By default, gradients are accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Before the backward pass, we need to use the optimizer object to zero all of the
# gradients.
optimizer.zero_grad()
loss.backward()

# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()

# Print out the loss after the second step of gradient descent.
pred = linear(x)
loss = criterion(pred, y)
print('Loss after two updates: ', loss.item())

### Tools for data processing 

In [None]:
toy_corpus = ['I walked down down the boulevard',
              'I walked down the avenue',
              'I ran down the boulevard',
              'I walk down the city',
              'I walk down the the avenue']

toy_categories = [0, 0, 1, 0, 0]

In [None]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    # A pytorch dataset class for holding data for a text classification task.
    def __init__(self, data, categories):
        # Upon creating the Dataset object, store the data in an attribute
        # Split the text data and labels from each other
        self.X, self.Y = [], []
        for x, y in zip(data, categories):
            # We will propably need to preprocess the data - have it done in a separate method
            # We do it here because we might need corpus-wide info to do the preprocessing 
            # For example, cutting all examples to the same length
            self.X.append(self.preprocess(x))
            self.Y.append(y)
                
    # Method allowing you to preprocess data                      
    def preprocess(self, text):
        text_pp = text.lower().strip()
        return text_pp
    
    # Overriding the method __len__ so that len(CustomDatasetName) returns the number of data samples                     
    def __len__(self):
        return len(self.Y)
   
    # Overriding the method __getitem__ so that CustomDatasetName[i] returns the i-th sample of the dataset                      
    def __getitem__(self, idx):
           return self.X[idx], self.Y[idx]

In [None]:
toy_dataset = CustomDataset(toy_corpus, toy_categories)

In [None]:
print(len(toy_dataset))
for i in range(len(toy_dataset)):
    print(toy_dataset[i])

```torch.utils.data.DataLoader``` is what we call an iterator, which provides very useful features:
- Batching the data
- Shuffling the data
- Load the data in parallel using multiprocessing workers.
and can be created very simply from a ```Dataset```. Continuing on our simple example: 

In [None]:
toy_dataloader = DataLoader(toy_dataset, batch_size = 2, shuffle = True)

In [None]:
for e in range(3):
    print("Epoch:" + str(e))
    for x, y in toy_dataloader:
        print("Batch: " + str(x) + "; labels: " + str(y))  

Epoch:0
Batch: ('i walked down down the boulevard', 'i walked down the avenue'); labels: tensor([0, 0])
Batch: ('i walk down the city', 'i walk down the the avenue'); labels: tensor([0, 0])
Batch: ('i ran down the boulevard',); labels: tensor([1])
Epoch:1
Batch: ('i ran down the boulevard', 'i walk down the city'); labels: tensor([1, 0])
Batch: ('i walk down the the avenue', 'i walked down down the boulevard'); labels: tensor([0, 0])
Batch: ('i walked down the avenue',); labels: tensor([0])
Epoch:2
Batch: ('i walked down the avenue', 'i walk down the the avenue'); labels: tensor([0, 0])
Batch: ('i walk down the city', 'i ran down the boulevard'); labels: tensor([0, 1])
Batch: ('i walked down down the boulevard',); labels: tensor([0])


### Data processing of a text dataset

Now, we would like to apply what we saw to our case, and **create a specific class** ```TextClassificationDataset``` **inheriting** ```Dataset``` that will:
- Create a vocabulary from the data (use what we saw in the previous TP)
- Preprocess the data using this vocabulary, adding whatever we need for our pytorch model
- Have a ```__getitem__``` method that allows us to use the class with a ```Dataloader``` to easily build batches.

In [None]:
import os
import sys
from torch.nn import functional as F
import numpy as np
import random

from nltk import word_tokenize
from torch.nn.utils.rnn import pad_sequence

import re # Regex for clean_and_tokenize
import time # For evaluating durations

First, we get the filenames and the corresponding categories: 

In [None]:
from glob import glob
filenames_neg = sorted(glob(os.path.join('.', 'data', 'imdb1', 'neg', '*.txt')))
filenames_pos = sorted(glob(os.path.join('.', 'data', 'imdb1', 'pos', '*.txt')))
filenames = filenames_neg + filenames_pos

# The first half of the elements of the list are string of negative reviews, and the second half positive ones
# We create the labels, as an array of [1,len(texts)], filled with 1, and change the first half to 0
categories = np.ones(len(filenames), dtype=np.int)
categories[:len(filenames_neg)] = 0.

print("%d documents" % len(filenames))

25000 documents


We will need to create a ```TextClassificationDataset``` and a ```Dataloader``` for the training data, the validation data, and the testing data. We need to implement a function that will help us split the data in three, according to proportions we give in input.

In [None]:
# Create a function allowing you to simply shuffle then split the filenames and categories into the desired
# proportions for a training, validation and testing set. 
def get_splits(x, y, splits):
    """
    The idea is to use an index list as reference:
    Indexes = [0 1 2 3 4 5 6 7 8 9]
    To shuffle it randomly:
    Indexes = [7 1 5 0 2 9 8 6 4 3]
    We need 'splits' to contain 2 values. Assuming those are = (0.8, 0.1), we'll have:
    Train_indexes = [7 1 5 0 2 9 8 6]
    Valid_indexes = [4]
    Test_indexes = [3]
    """
    # Create an index list and shuffle it - use the function random.shuffle
    indexes = list(range(len(x)))
    random.shuffle(indexes)

    # Find the two indexes we'll use to cut the lists from the splits
    train_cut = int(splits[0]*len(x))
    valid_cut = int((splits[0]+splits[1])*len(x))
    train_indexes = indexes[:train_cut]
    valid_indexes = indexes[train_cut:valid_cut]
    test_indexes = indexes[valid_cut:]
    
    # Do the cutting (careful: you can't use a list as index for a list - this only works with tensors)
    # (you need to use list comprehensions - or go through numpy)
    train_x, train_y = [x[i] for i in train_indexes], y[train_indexes]
    valid_x, valid_y = [x[i] for i in valid_indexes], y[valid_indexes]
    test_x, test_y = [x[i] for i in test_indexes], y[test_indexes]
    return (train_x, train_y), (valid_x, valid_y), (test_x, test_y)

In [None]:
# Fixed seed for reproducibility
random.seed(42)

# Choose the training, validation, testing splits
splits = (0.8, 0.1)
(train_f, train_c), (valid_f, valid_c), (test_f, test_c) = get_splits(filenames, categories, splits)

In [None]:
REDUCT = 50 # Take only 1 sample every REDUCT samples

reduct_len_train = int(len(train_f) / REDUCT)
reduct_len_valid = int(len(valid_f) / REDUCT)
reduct_len_test = int(len(test_f) / REDUCT)
(train_f, train_c) = (train_f[:reduct_len_train], train_c[:reduct_len_train])
(valid_f, valid_c) = (valid_f[:reduct_len_valid], valid_c[:reduct_len_valid])
(test_f, test_c) = (test_f[:reduct_len_test], test_c[:reduct_len_test])

# Check the new lengths of the datasets
print("Training samples:", len(train_f))
print("Validation samples:", len(valid_f))
print("Testing samples:", len(test_f))

Training samples: 400
Validation samples: 50
Testing samples: 50


We can now implement our ```TextClassificationDataset``` class, that we will build from:
- A list of path to the IMDB files in the training set: ```path_to_file```
- A list of the corresponding categories: ```categories```
We will add three optional arguments:
- First, a way to input a vocabulary (so that we can re-use the training vocabulary on the validation and training ```TextClassificationDataset```). By default, the value of the argument is ```None```.
- In order to work with batches, we will need to have sequences of the same size. That can be done via **padding** but we will still need to limit the size of documents (to avoid having batches of huge sequences that are mostly empty because of one very long documents) to a ```max_length```. Let's put it to 100 by default.
- Lastly, a ```min_freq``` that indicates how many times a word must appear to be taken in the vocabulary. 

The idea behind **padding** is to transform a list of pytorch tensors (of maybe different length) into a two dimensional tensor - which we can see as a batch. The size of the first dimension is the one of the longest tensor - and other are **padded** with a chosen symbol: here, we choose 0. 

In [None]:
tensor_1 = torch.LongTensor([1, 4, 5])
tensor_2 = torch.LongTensor([2])
tensor_3 = torch.LongTensor([6, 7])

In [None]:
tensor_padded = pad_sequence([tensor_1, tensor_2, tensor_3], batch_first=True, padding_value=0)
print(tensor_padded)

tensor([[1, 4, 5],
        [2, 0, 0],
        [6, 7, 0]])


In [None]:
def clean_and_tokenize(text):
    """
    Cleaning a document with:
        - Lowercase        
        - Removing numbers with regular expressions
        - Removing punctuation with regular expressions
        - Removing other artifacts
    And separate the document into words by simply splitting at spaces
    Params:
        text (string): a sentence or a document
    Returns:
        tokens (list of strings): the list of tokens (word units) forming the document
    """
    # Lowercase
    text = text.lower()
    # Remove numbers
    text = re.sub(r"[0-9]+", "", text)
    # Remove punctuation
    REMOVE_PUNCT = re.compile("[.;:!\'?,\"()\[\]]")
    text = REMOVE_PUNCT.sub("", text)
    # Remove HTML artifacts specific to the corpus we're going to work with
    REPLACE_HTML = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
    text = REPLACE_HTML.sub(" ", text)
    
    tokens = text.split()        
    return tokens

In [None]:
class TextClassificationDataset(Dataset):
    def __init__(self, paths_to_files, categories, vocab=None,
                 max_length=100, min_freq=5):
        # Read all files and put the data in a list of strings
        start_time = time.monotonic() 
        self.data = [open(f, encoding="utf8").read() for f in paths_to_files]
        print("Load data content in {}s ({} samples)".format(
            time.monotonic() - start_time, len(paths_to_files)
        ))

        # Set the maximum length we will keep for the sequences
        self.max_length = max_length
        
        # Allow to import a vocabulary (for valid/test datasets, that will use the training vocabulary)
        if vocab is not None:
            self.word2idx, self.idx2word = vocab
        else:
            # If no vocabulary imported, build it (and reverse)
            self.word2idx, self.idx2word = self.build_vocab(self.data, min_freq)
        
        # We then need to tokenize the data ...
        tokenized_data = [
            clean_and_tokenize(sent) for sent in self.data
        ]
        # Transform words into lists of indexes ... (use the .get() method to redirect unknown words to the UNK token)
        unk_id = self.word2idx['UNK']
        indexed_data = [
            [self.word2idx.get(token, unk_id) for token in tok_sent]
            for tok_sent in tokenized_data
        ]
        # And transform this list of lists into a list of Pytorch LongTensors
        tensor_data = [
            torch.LongTensor(indexed_sent)
            for indexed_sent in indexed_data
        ]
        # And the categories into a FloatTensor
        tensor_y = torch.FloatTensor(categories)
        # To finally cut it when it's above the maximum length
        cut_tensor_data = [
            tensor_sent[:self.max_length]
            for tensor_sent in tensor_data
        ]
        
        # Now, we need to use the pad_sequence function to have the whole dataset represented as one tensor,
        # containing sequences of the same length. We choose the padding_value to be 0, the we want the
        # batch dimension to be the first dimension
        self.tensor_data = pad_sequence(
            cut_tensor_data,
            batch_first=True,
            padding_value=0
        )
        self.tensor_y = tensor_y
        
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # The iterator just gets one particular example with its category
        # The dataloader will take care of the shuffling and batching
        if torch.is_tensor(idx):
            idx = idx.tolist()
        return self.tensor_data[idx], self.tensor_y[idx] 
    
    def build_vocab(self, corpus, count_threshold):
        """
        We output word_index, a dictionary containing words 
        and their corresponding indexes as {word : indexes} 
        But also the reverse, which is a dictionary {indexes: word}
        We add a UNK token that we need when encountering unknown words
        We also choose '0' to represent the padding index, so begin the vocabulary index at 1 ! 
        """
        # Collect and count all distinct words
        word_counts = {}
        for sent in corpus:
            tok_sent = clean_and_tokenize(sent)
            for token in tok_sent:
                if token in word_counts:
                    word_counts[token] += 1
                else:
                    word_counts[token] = 1

        # Keep only the words that appear frequently enough
        filtered_word_counts = []
        for token, count in word_counts.items():
            if count >= count_threshold:
                filtered_word_counts.append([token, count])

        # Sort the words by frequency (optional)
        filtered_word_counts.sort(key= lambda x: x[1], reverse=True)

        # Build the vocabulary dictionaries
        word_index = dict(zip(
            [word_count[0] for word_count in filtered_word_counts], # token
            range(1, 1+len(filtered_word_counts)) # index
        ))
        idx_word = dict(zip(
            range(1, 1+len(filtered_word_counts)), # index
            filtered_word_counts[:][0]  # token
        ))

        # Add UNK token
        word_index['UNK'] = 1+len(filtered_word_counts)
        idx_word[1+len(filtered_word_counts)] = 'UNK'

        return word_index, idx_word
    
    def get_vocab(self):
        # A simple way to get the training vocab when building the valid/test 
        return self.word2idx, self.idx2word

In [None]:
# Parameters
MAX_LENGTH = 100
MIN_FREQ = 5

# Build dataset and vocabulary
training_dataset = TextClassificationDataset(
    train_f, train_c,
    max_length=MAX_LENGTH, min_freq=MIN_FREQ
)
training_word2idx, training_idx2word = training_dataset.get_vocab()
#print("Vocabulary:", training_word2idx) # for debugging
print("Size of vocabulary:", len(training_word2idx))

Load data content in 182.27123725100006s (400 samples)
Size of vocabulary: 2168


In [None]:
valid_dataset = TextClassificationDataset(valid_f, valid_c, (training_word2idx, training_idx2word))
test_dataset = TextClassificationDataset(test_f, test_c, (training_word2idx, training_idx2word))

Load data content in 21.20292835700002s (50 samples)
Load data content in 23.243693093000047s (50 samples)


In [None]:
training_dataloader = DataLoader(training_dataset, batch_size=200, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=25)
test_dataloader = DataLoader(test_dataset, batch_size=25)

## A simple averaging model

Now, we will implement in Pytorch a simple averaging model. For each model we will implement, we need to create a class which inherits from ```nn.Module``` and redifine the ```__init__``` method as well as the ```forward``` method.

In [None]:
# Models are usually implemented as custom nn.Module subclass
# We need to redefine the __init__ method, which creates the object
# We also need to redefine the forward method, which transform the input into outputs

class AveragingModel(nn.Module):    
    def __init__(self, embedding_dim, vocabulary_size):
        super().__init__()
        # Create an embedding object. Be careful to padding - you need to increase the vocabulary size by one !
        # Look into the arguments of the nn.Embedding class
        self.embeddings = nn.Embedding(
            num_embeddings=vocabulary_size + 1,
            embedding_dim=embedding_dim
        ) # Completed
        # Create a linear layer that will transform the mean of the embeddings into a classification score
        self.linear = nn.Linear(
            in_features=embedding_dim,
            out_features=1
        ) # Completed
        
        # No need for sigmoid, it will be into the criterion ! 
        
    def forward(self, inputs):
        # Remember: the inpts are written as Batch_size * seq_length * embedding_dim
        # First, take the mean of the embeddings of the document
        x = torch.mean(self.embeddings(inputs), 1) # Completed
        # Then make it go through the linear layer and remove the extra dimension with the method .squeeze()
        o = torch.squeeze(self.linear(x)) # Completed
        return o

In [None]:
import torch.optim as optim

In [None]:
model = AveragingModel(300, len(training_word2idx))
#model = PretrainedAveragingModel(GloveEmbeddings, freeze=True) # to run pretrained Glove

# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a binary cross entropy loss based on logits - meaning that the sigmoid is integrated into the criterion
criterion = nn.BCEWithLogitsLoss()

In [None]:
# Implement a training function, which will train the model with the corresponding optimizer and criterion,
# with the appropriate dataloader, for one epoch.

def train_epoch(model, opt, criterion, dataloader):
    model.train()
    losses = []
    for i, (x, y) in enumerate(dataloader):
        opt.zero_grad()
        # (1) Forward
        pred = model(x) # Completed
        # (2) Compute the loss 
        loss = criterion(pred, y) # Completed
        # (3) Compute gradients with the criterion
        loss.backward() # Completed
        # (4) Update weights with the optimizer
        opt.step() # Completed     
        losses.append(loss.item())
        # Count the number of correct predictions in the batch - here, you'll need to use the sigmoid
        num_corrects = sum(torch.round(torch.sigmoid(pred)) == y) # Completed
        acc = 100.0 * num_corrects/len(y)
        
        if (i%20 == 0):
            print("Batch " + str(i) + " : training loss = " + str(loss.item()) + "; training acc = " + str(acc.item()))
    return losses

In [None]:
# Same for the evaluation ! We don't need the optimizer here. 
def eval_model(model, criterion, evalloader):
    model.eval()
    total_epoch_loss = 0
    total_epoch_acc = 0
    with torch.no_grad():
        for i, (x, y) in enumerate(evalloader):
            pred = model(x) # Completed
            loss = criterion(pred, y) # Completed
            num_corrects = sum(torch.round(torch.sigmoid(pred)) == y) # Completed
            acc = 100.0 * num_corrects/len(y)
            total_epoch_loss += loss.item()
            total_epoch_acc += acc.item()

    return total_epoch_loss/(i+1), total_epoch_acc/(i+1)

In [None]:
# A function which will help you execute experiments rapidly - with a early_stopping option when necessary. 
def experiment(model, opt, criterion, num_epochs = 5, early_stopping = True):
    train_losses = []
    if early_stopping:
        best_valid_loss = 10.
    print("Beginning training...")
    start_time = time.monotonic() # ADDED line
    for e in range(num_epochs):
        print("Epoch " + str(e+1) + ":")
        train_losses += train_epoch(model, opt, criterion, training_dataloader)
        valid_loss, valid_acc = eval_model(model, criterion, valid_dataloader)
        print("Epoch " + str(e+1) + " : Validation loss = " + str(valid_loss) + "; Validation acc = " + str(valid_acc))
        if early_stopping:
            if valid_loss < best_valid_loss:
                best_valid_loss = valid_loss
            else:
                print("Early stopping.")
                break  
    test_loss, test_acc = eval_model(model, criterion, test_dataloader)
    print("Epoch " + str(e+1) + " : Test loss = " + str(test_loss) + "; Test acc = " + str(test_acc))
    print("Total duration: {} epochs in {}s".format( # ADDED line
        e+1,
        time.monotonic() - start_time,
    ))
    return train_losses

In [None]:
train_losses = experiment(model, opt, criterion)

Beginning training...
Epoch 1:
Batch 0 : training loss = 0.6941554546356201; training acc = 51.5
Epoch 1 : Validation loss = 0.6843913793563843; Validation acc = 60.0
Epoch 2:
Batch 0 : training loss = 0.6950291991233826; training acc = 50.5
Epoch 2 : Validation loss = 0.6839759647846222; Validation acc = 58.0
Epoch 3:
Batch 0 : training loss = 0.6867280006408691; training acc = 55.5
Epoch 3 : Validation loss = 0.6836389899253845; Validation acc = 58.0
Epoch 4:
Batch 0 : training loss = 0.6868730187416077; training acc = 52.5
Epoch 4 : Validation loss = 0.6832371056079865; Validation acc = 58.0
Epoch 5:
Batch 0 : training loss = 0.6686186194419861; training acc = 59.0
Epoch 5 : Validation loss = 0.6832980513572693; Validation acc = 60.0
Early stopping.
Epoch 5 : Test loss = 0.6848028898239136; Test acc = 54.0
Total duration: 5 epochs in 0.6132919039999933s


### With Glove embeddings: 

Now, we integrate pre-trained word embeddings into our model

In [None]:
import gensim.downloader as api
loaded_glove_model = api.load("glove-wiki-gigaword-300")
loaded_glove_embeddings = loaded_glove_model.vectors

In [None]:
def get_glove_adapted_embeddings(glove_model, input_voc):
    keys = {i: glove_model.vocab.get(w, None) for w, i in input_voc.items()}
    index_dict = {i: key.index for i, key in keys.items() if key is not None}
    embeddings = np.zeros((len(input_voc)+1,glove_model.vectors.shape[1]))
    for i, ind in index_dict.items():
        embeddings[i] = glove_model.vectors[ind]
    return embeddings

GloveEmbeddings = get_glove_adapted_embeddings(loaded_glove_model, training_word2idx)

In [None]:
print(GloveEmbeddings.shape)
# We should check that the "padding" vector is at zero
print(GloveEmbeddings[0])

Here, implement a ```PretrainedAveragingModel``` very similar to the previous model, using the ```nn.Embedding``` method ```from_pretrained()``` to initialize the embeddings from a numpy array. Use the ```requires_grad_``` method to specify if the model must fine-tune the embeddings or not ! 

In [None]:
class PretrainedAveragingModel(nn.Module):
    # Completed
    def __init__(self, pretrained_embeddings, fine_tuning=False):
        """
        Initialize Averaging model with pretrained embeddings weights
        
        pretrained_embeddings: pretrained embeddings weights
        fine_tuning (bool): If False, the embeddings tensor does not get updated
            in the learning process. Defaults to False. Equivalent to
            embedding.weight.requires_grad = False. See 'freeze' argument
            in https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html
        """
        super().__init__()
        # Create an embedding object from pretrained
        self.embeddings = nn.Embedding.from_pretrained(
            embeddings=torch.Tensor(pretrained_embeddings),
            freeze=not fine_tuning
        )
        # Create a linear layer that will transform the mean of the embeddings into a classification score
        embedding_dim = pretrained_embeddings.shape[1] # =300 for Glove
        self.linear = nn.Linear(
            in_features=embedding_dim,
            out_features=1
        )

    def forward(self, inputs):
        # The inputs are written as Batch_size * seq_length * embedding_dim
        # First, take the mean of the embeddings of the document
        x = torch.mean(self.embeddings(inputs), 1)
        # Then make it go through the linear layer and remove the extra dimension with the method .squeeze()
        o = torch.squeeze(self.linear(x))
        return o

# LSTM Cells in pytorch

In [None]:
# Create a toy example of LSTM: 
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# LSTMs expect inputs having 3 dimensions:
# - The first dimension is the temporal dimension, along which we (in our case) have the different words
# - The second dimension is the batch dimension, along which we stack the independant batches
# - The third dimension is the feature dimension, along which are the features of the vector representing the words

# In our toy case, we have inputs and outputs containing 3 features (third dimension !)
# We created a sequence of 5 different inputs (first dimension !)
# We don't use batch (the second dimension will have one lement)

# We need an initial hidden state, of the right sizes for dimension 2/3, but with only one temporal element:
# Here, it is:
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
# We create a tuple of two tensors because we use LSTMs: they use two sets of weights,
# and two hidden states (Hidden state, and Cell state). Read: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
# If we used a classic RNN, we would simply have:
# hidden = torch.randn(1, 1, 3)

# The naive way of applying a lstm to inputs is to apply it one step at a time, and loop through the sequence
for i in inputs:
    # After each step, hidden contains the hidden states (remember, it's a tuple of two states).
    out, hidden = lstm(i.view(1, 1, -1), hidden)
    
# Alternatively, we can do the entire sequence all at once.
# The first value returned by LSTM is all of the Hidden states throughout the sequence.
# The second is just the most recent Hidden state and Cell state (you can compare the values)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence, for each temporal step
# "hidden" will allow you to continue the sequence and backpropagate later, with another sequence
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # Re-initialize
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

tensor([[[-0.1571, -0.4284,  0.2217]],

        [[ 0.2212, -0.3245, -0.2939]],

        [[ 0.3902,  0.0851, -0.0344]],

        [[ 0.5499,  0.2773,  0.0102]],

        [[ 0.4572,  0.1448, -0.2210]]], grad_fn=<StackBackward>)
(tensor([[[ 0.4572,  0.1448, -0.2210]]], grad_fn=<StackBackward>), tensor([[[ 0.9262,  0.2277, -0.3543]]], grad_fn=<StackBackward>))


### Creating our own LSTM Model

We'll implement now a LSTM model, taking the same inputs and also outputing a score for the sentence.

In [None]:
class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, vocabulary_size, hidden_dim,
          embeddings=None, fine_tuning=False):
      super().__init__()
      self.hidden_dim = hidden_dim

      if embeddings is not None :
        self.embeddings = nn.Embedding.from_pretrained(embeddings=torch.Tensor(embeddings), freeze=not fine_tuning)
        embedding_dim=embeddings.shape[1]
      else :
        self.embeddings = nn.Embedding(
            vocabulary_size+1, embedding_dim
        )
      self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
      self.linear = nn.Linear(hidden_dim, 1)

    def forward(self, inputs):
      embeddings = self.embeddings(inputs)
      _, (last_hidden_state, _) = self.lstm(embeddings)
      linear_out = torch.squeeze(self.linear(last_hidden_state))
      return linear_out

## Run LSTM model without pretraining

In [None]:
model = LSTMModel(
    embedding_dim=300,
    vocabulary_size=len(training_word2idx),
    hidden_dim=300
)
# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a binary cross entropy loss based on logits - meaning that the sigmoid is integrated into the criterion
criterion = nn.BCEWithLogitsLoss()

train_losses = experiment(model, opt, criterion)

Beginning training...
Epoch 1:
Batch 0 : training loss = 0.6963910460472107; training acc = 47.5
Epoch 1 : Validation loss = 0.6882959306240082; Validation acc = 56.0
Epoch 2:
Batch 0 : training loss = 0.627260684967041; training acc = 67.5
Epoch 2 : Validation loss = 0.6987157166004181; Validation acc = 54.0
Early stopping.
Epoch 2 : Test loss = 0.7395232021808624; Test acc = 50.0
Total duration: 2 epochs in 5.50652630400009s


## Run LSTM model with pretrained embeddings (fixed)

In [None]:
model = LSTMModel(
    embedding_dim=300,
    vocabulary_size=len(training_word2idx),
    hidden_dim=300,
    embeddings=GloveEmbeddings
)
# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a binary cross entropy loss based on logits - meaning that the sigmoid is integrated into the criterion
criterion = nn.BCEWithLogitsLoss()

train_losses = experiment(model, opt, criterion)

Beginning training...
Epoch 1:
Batch 0 : training loss = 0.6923812627792358; training acc = 52.0
Epoch 1 : Validation loss = 0.6822963654994965; Validation acc = 54.0
Epoch 2:
Batch 0 : training loss = 0.6815677881240845; training acc = 53.5
Epoch 2 : Validation loss = 0.6804623901844025; Validation acc = 56.0
Epoch 3:
Batch 0 : training loss = 0.6675193309783936; training acc = 58.5
Epoch 3 : Validation loss = 0.6831582188606262; Validation acc = 60.0
Early stopping.
Epoch 3 : Test loss = 0.7196337878704071; Test acc = 44.0
Total duration: 3 epochs in 6.908174721000023s


## Run LSTM model with pretrained embeddings and fine-tuning

In [None]:
model = LSTMModel(
    embedding_dim=300,
    vocabulary_size=len(training_word2idx),
    hidden_dim=300,
    embeddings=GloveEmbeddings,
    fine_tuning=True
)
# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a binary cross entropy loss based on logits - meaning that the sigmoid is integrated into the criterion
criterion = nn.BCEWithLogitsLoss()

train_losses = experiment(model, opt, criterion)

Beginning training...
Epoch 1:
Batch 0 : training loss = 0.6927673816680908; training acc = 49.5
Epoch 1 : Validation loss = 0.6823270320892334; Validation acc = 52.0
Epoch 2:
Batch 0 : training loss = 0.6571475863456726; training acc = 56.5
Epoch 2 : Validation loss = 0.6990417838096619; Validation acc = 58.0
Early stopping.
Epoch 2 : Test loss = 0.7328164875507355; Test acc = 44.0
Total duration: 2 epochs in 5.366575243999932s


## And with a CNN ? 

Reference: *Using convolution neural networks to classify text in pytorch*, Blog Post [link](https://tzuruey.medium.com/using-convolution-neural-networks-to-classify-text-in-pytorch-3b626a42c3ca)

In [None]:
class CNNModel(nn.Module):
    def __init__(self, embedding_dim, vocabulary_size,
              embeddings=None, fine_tuning=False, window_size: int=16,
              filter_multiplier=64):
         super().__init__()
         if embeddings is not None :
           self.embeddings = nn.Embedding.from_pretrained(embeddings=torch.Tensor(embeddings), freeze=not fine_tuning)
           embedding_dim=embeddings.shape[1]
         else :
           self.embeddings = nn.Embedding(
               vocabulary_size+1, embedding_dim
           )
         self.conv = nn.Conv2d(in_channels=1, out_channels=filter_multiplier, kernel_size=(window_size,embedding_dim))
         self.linear = nn.Linear(filter_multiplier, out_features=1)
        
    def forward(self, inputs):
         x = self.embeddings(inputs)
         x = torch.unsqueeze(x, 1)
         x = self.conv(x)
         x = torch.squeeze(x, 3)
         x = F.relu(x)
         x = F.max_pool1d(x, x.shape[2]).squeeze(2)
         x = self.linear(x)
         return torch.squeeze(x)

## Train CNN model without pretrained embeddings

In [None]:
model = CNNModel(embedding_dim=300, vocabulary_size=len(training_word2idx))
# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a binary cross entropy loss based on logits - meaning that the sigmoid is integrated into the criterion
criterion = nn.BCEWithLogitsLoss()

train_losses = experiment(model, opt, criterion)

Beginning training...
Epoch 1:
Batch 0 : training loss = 0.7702906131744385; training acc = 52.5
Epoch 1 : Validation loss = 0.8369641900062561; Validation acc = 40.0
Epoch 2:
Batch 0 : training loss = 0.5364144444465637; training acc = 73.0
Epoch 2 : Validation loss = 1.0656010806560516; Validation acc = 54.0
Early stopping.
Epoch 2 : Test loss = 0.9522664546966553; Test acc = 58.0
Total duration: 2 epochs in 6.500939182000138s


## Train CNN model with pretrained embeddings but without fine-tuning

In [None]:
model = CNNModel(embedding_dim=300, vocabulary_size=len(training_word2idx), embeddings=GloveEmbeddings, fine_tuning=False)
# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a binary cross entropy loss based on logits - meaning that the sigmoid is integrated into the criterion
criterion = nn.BCEWithLogitsLoss()

train_losses = experiment(model, opt, criterion)

Beginning training...
Epoch 1:
Batch 0 : training loss = 0.6925008893013; training acc = 52.0
Epoch 1 : Validation loss = 0.7378141582012177; Validation acc = 46.0
Epoch 2:
Batch 0 : training loss = 0.6293485760688782; training acc = 50.0
Epoch 2 : Validation loss = 0.6675723791122437; Validation acc = 54.0
Epoch 3:
Batch 0 : training loss = 0.5026382207870483; training acc = 69.5
Epoch 3 : Validation loss = 0.7164798080921173; Validation acc = 48.0
Early stopping.
Epoch 3 : Test loss = 0.7474651336669922; Test acc = 44.0
Total duration: 3 epochs in 6.60121419699999s


## Train CNN model with pretrained embeddings and fine-tuning

In [None]:
model = CNNModel(embedding_dim=300, vocabulary_size=len(training_word2idx), embeddings=GloveEmbeddings, fine_tuning=True)
# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a binary cross entropy loss based on logits - meaning that the sigmoid is integrated into the criterion
criterion = nn.BCEWithLogitsLoss()

train_losses = experiment(model, opt, criterion)

Beginning training...
Epoch 1:
Batch 0 : training loss = 0.7140135169029236; training acc = 48.0
Epoch 1 : Validation loss = 0.7727560698986053; Validation acc = 54.0
Epoch 2:
Batch 0 : training loss = 0.7112321257591248; training acc = 57.5
Epoch 2 : Validation loss = 0.6959434747695923; Validation acc = 46.0
Epoch 3:
Batch 0 : training loss = 0.6918337941169739; training acc = 49.5
Epoch 3 : Validation loss = 0.6977367699146271; Validation acc = 46.0
Early stopping.
Epoch 3 : Test loss = 0.7032230496406555; Test acc = 40.0
Total duration: 3 epochs in 9.441260064000062s


# Results



## Grid Search

In this section, we provide an implementation for performing gridsearch over our different models. We especially investigate the influence of the size of the vocabulary for different sizes of dataset. Outputs are saved in the file `grid_search.csv` in current working directory.

The size of the training dataset is chosen in `[400, 800, 1600]`; it corresponds to reduction factors of `[50, 25, 12.5]` over the initial dataset. The size of the vocabulary is changed using the `min_freq` parameter, among the values `[1,3,5,10,20]`. For reproducibility, dataset splits are made after initializing random seed to 42 for `random` standard Python library. Average runtime for exploring Averaging models (with the three cases of embedding) is about 10min, for LSTM about 20min and for CNN models about 15min, run on Google Colab computing infrastructure (CPU only).

In all these experiments, hyperparameters are fixed as follow:
- number of epochs: 10
- early stopping: True
- batchsize: 200 for training set, 25 for testing and validation sets
- optimizer: Adam, with param `betas=(0.9, 0.999)`
- learning rate: 0.0025
- maximum length of the samples (`max_length`): 100

## Modified versions of `TextClassificationDataset` and `experiment` to avoid loading data multiple times and retrieve test loss and accuracy.

In [None]:
# Parameters
MAX_LENGTH = 100
MIN_FREQ = 5

class TextClassificationDataset(Dataset):
    def __init__(self, paths_to_files, categories, vocab=None,
            max_length=MAX_LENGTH, min_freq=MIN_FREQ, data=None):
        # Read all files and put the data in a list of strings
        # MODIFIED: avoid loading data multiple times
        if data is None:
          self.data = [open(f, encoding="utf8").read() for f in paths_to_files]
        else:
          self.data = data
        
        # Set the maximum length we will keep for the sequences
        self.max_length = max_length
        
        # Allow to import a vocabulary (for valid/test datasets, that will use the training vocabulary)
        if vocab is not None:
            self.word2idx, self.idx2word = vocab
        else:
            # If no vocabulary imported, build it (and reverse)
            self.word2idx, self.idx2word = self.build_vocab(self.data, min_freq)
        
        # We then need to tokenize the data .. 
        tokenized_data = [
            clean_and_tokenize(sent) for sent in self.data
        ]
        # Transform words into lists of indexes ... (use the .get() method to redirect unknown words to the UNK token)
        unk_id = self.word2idx['UNK']
        indexed_data = [
            [self.word2idx.get(token, unk_id) for token in tok_sent]
            for tok_sent in tokenized_data
        ]
        # And transform this list of lists into a list of Pytorch LongTensors
        tensor_data = [
            torch.LongTensor(indexed_sent)
            for indexed_sent in indexed_data
        ]
        # And the categories into a FloatTensor
        tensor_y = torch.FloatTensor(categories)
        # To finally cut it when it's above the maximum length
        cut_tensor_data = cut_tensor_data = [
            tensor_sent[:self.max_length]
            for tensor_sent in tensor_data
        ]
        
        # Now, we need to use the pad_sequence function to have the whole dataset represented as one tensor,
        # containing sequences of the same length. We choose the padding_value to be 0, the we want the
        # batch dimension to be the first dimension 
        self.tensor_data = pad_sequence(
            cut_tensor_data,
            batch_first=True,
            padding_value=0
        )
        self.tensor_y = tensor_y
        
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # The iterator just gets one particular example with its category
        # The dataloader will take care of the shuffling and batching
        if torch.is_tensor(idx):
            idx = idx.tolist()
        return self.tensor_data[idx], self.tensor_y[idx] 
    
    def build_vocab(self, corpus, count_threshold):
        """
        We output word_index, a dictionary containing words 
        and their corresponding indexes as {word : indexes} 
        But also the reverse, which is a dictionary {indexes: word}
        We add a UNK token that we need when encountering unknown words
        We also choose '0' to represent the padding index, so begin the vocabulary index at 1 ! 
        """
        # Collect and count all distinct words
        word_counts = {}
        for sent in corpus:
            tok_sent = clean_and_tokenize(sent)
            for token in tok_sent:
                if token in word_counts:
                    word_counts[token] += 1
                else:
                    word_counts[token] = 1

        # Keep only the words that appear frequently enough
        filtered_word_counts = []
        for token, count in word_counts.items():
            if count >= count_threshold:
                filtered_word_counts.append([token, count])

        # Sort the words by frequency (optional)
        filtered_word_counts.sort(key= lambda x: x[1], reverse=True)

        # Build the vocabulary dictionaries
        word_index = dict(zip(
            [word_count[0] for word_count in filtered_word_counts],  # token
            range(1, 1+len(filtered_word_counts))  # index
        ))
        idx_word = dict(zip(
            range(1, 1+len(filtered_word_counts)),  # index
            filtered_word_counts[:][0]  # token
        ))

        # Add UNK token
        word_index['UNK'] = 1+len(filtered_word_counts)
        idx_word[1+len(filtered_word_counts)] = 'UNK'
        return word_index, idx_word
    
    def get_vocab(self):
        # A simple way to get the training vocab when building the valid/test 
        return self.word2idx, self.idx2word

def experiment_gridsearch(model, opt, criterion,
        num_epochs=10, early_stopping=True):
    """
    Modified version of 'experiment' function
    which returns final test loss and accuracy
    """
    train_losses = []
    if early_stopping: 
        best_valid_loss = 10. 
    print("Beginning training...")
    for e in range(num_epochs):
        print("Epoch " + str(e+1) + ":")
        train_losses += train_epoch(model, opt, criterion, training_dataloader)
        valid_loss, valid_acc = eval_model(model, criterion, valid_dataloader)
        print("Epoch " + str(e+1) + " : Validation loss = " + str(valid_loss) + "; Validation acc = " + str(valid_acc))
        if early_stopping:
            if valid_loss < best_valid_loss:
                best_valid_loss = valid_loss
            else:
                print("Early stopping.")
                break  
    test_loss, test_acc = eval_model(model, criterion, test_dataloader)
    return valid_acc, test_acc

## Define function to perform trainings

In [None]:
def trainings(model_type, vocabulary_size, embeddings=None, fine_tuning=False):
  """Declare and train a model"""
  # Initialize the model
  if model_type == 'Average' and embeddings is None:
    model = AveragingModel(300, vocabulary_size)
  elif model_type == 'Average' and embeddings is not None:
    model = PretrainedAveragingModel(GloveEmbeddings, fine_tuning=fine_tuning)
  elif model_type == 'LSTM':
    model = LSTMModel(
        embedding_dim=300, vocabulary_size=vocabulary_size,
        hidden_dim=300, embeddings=embeddings, fine_tuning=fine_tuning
    )
  elif model_type == 'CNN':
    model = CNNModel(
        embedding_dim=300, vocabulary_size=vocabulary_size,
        embeddings=embeddings, fine_tuning=fine_tuning
    )

  opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
  criterion = nn.BCEWithLogitsLoss()

  # Train the model
  valid_acc, test_acc = experiment_gridsearch(model, opt, criterion, num_epochs=10, early_stopping=True)
  return valid_acc, test_acc

## Load GloVe embeddings

In [None]:
loaded_glove_model = api.load("glove-wiki-gigaword-300")
loaded_glove_embeddings = loaded_glove_model.vectors

## Perform trainings for grid search

In [None]:
# Range for parameters
reduction_factors = [50, 25, 12.5]
min_frequences = [1, 3, 5, 10, 20]
nb_runs = 3

table = []

# Main loop
for run in range(nb_runs) :
  fichier_grid = open("grid_search_{}.csv".format(run), "a")
  for reduction_factor in reduction_factors:
    # Retrieve data
    random.seed(42)
    (train_f, train_c), (valid_f, valid_c), (test_f, test_c) = get_splits(filenames, categories, splits)
    reduct_len_train = int(len(train_f) / reduction_factor)
    reduct_len_valid = int(len(valid_f) / reduction_factor)
    reduct_len_test = int(len(test_f) / reduction_factor)
    (train_f, train_c) = (train_f[:reduct_len_train], train_c[:reduct_len_train])
    (valid_f, valid_c) = (valid_f[:reduct_len_valid], valid_c[:reduct_len_valid])
    (test_f, test_c) = (test_f[:reduct_len_test], test_c[:reduct_len_test])
    data = [open(f, encoding="utf8").read() for f in train_f]

    # Loop over minimum frequency
    for min_freq in min_frequences:
      # Build dataset, splits and vocabulary
      training_dataset = TextClassificationDataset(train_f, train_c, min_freq=min_freq, data=data)
      training_word2idx, training_idx2word = training_dataset.get_vocab()

      valid_dataset = TextClassificationDataset(valid_f, valid_c, (training_word2idx, training_idx2word))
      test_dataset = TextClassificationDataset(test_f, test_c, (training_word2idx, training_idx2word))
      size_voc = len(training_word2idx)

      training_dataloader = DataLoader(training_dataset, batch_size=200, shuffle=True)
      valid_dataloader = DataLoader(valid_dataset, batch_size=25)
      test_dataloader = DataLoader(test_dataset, batch_size=25)

      GloveEmbeddings = get_glove_adapted_embeddings(loaded_glove_model, training_word2idx)

      ## Average Model:
      # Without pre-training:
      valid_acc, test_acc = trainings(
          model_type='Average',
          vocabulary_size=size_voc,
          embeddings=None
      )
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'Average', 'No', 'No', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'Average', 'No embedding', 'No fine-tuning', valid_acc, test_acc])
      # With pre-training and without fine-tuning:
      valid_acc, test_acc = trainings(
          model_type='Average',
          vocabulary_size=size_voc,
          embeddings=GloveEmbeddings,
          fine_tuning=False
      )
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'Average', 'Yes', 'No', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'Average', 'GloveEmbeddings', 'No fine-tuning', valid_acc, test_acc])
      # With pre-training and with fine-tuning:
      valid_acc, test_acc = trainings(
          model_type='Average',
          vocabulary_size=size_voc,
          embeddings=GloveEmbeddings,
          fine_tuning=True)
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'Average', 'Yes', 'Yes', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'Average', 'GloveEmbeddings', 'Fine-tuning', valid_acc, test_acc])

      ## LSTM:
      # Without pre-training:
      valid_acc, test_acc = trainings(
          model_type='LSTM',
          vocabulary_size=size_voc,
          embeddings=None
      )
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'LSTM', 'No', 'No', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'LSTM', 'No embedding', 'No fine-tuning', valid_acc, test_acc])
      # With pre-training and without fine-tuning:
      valid_acc, test_acc = trainings(
          model_type='LSTM',
          vocabulary_size=size_voc,
          embeddings=GloveEmbeddings,
          fine_tuning=False
      )
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'LSTM', 'Yes', 'No', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'LSTM', 'GloveEmbeddings', 'No fine-tuning', valid_acc, test_acc])
      # With pre-training and with fine-tuning:
      valid_acc, test_acc = trainings(
          model_type='LSTM',
          vocabulary_size=size_voc,
          embeddings=GloveEmbeddings,
          fine_tuning=True)
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'LSTM', 'Yes', 'Yes', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'LSTM', 'GloveEmbeddings', 'Fine-tuning', valid_acc, test_acc])

      ## CNN:
      # Without pre-training:
      valid_acc, test_acc = trainings(
          model_type='CNN',
          vocabulary_size=size_voc,
          embeddings=None
      )
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'CNN', 'No', 'No', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'CNN', 'No embedding', 'No fine-tuning', valid_acc, test_acc])
      # With pre-training and without fine-tuning:
      valid_acc, test_acc = trainings(
          model_type='CNN',
          vocabulary_size=size_voc,
          embeddings=GloveEmbeddings,
          fine_tuning=False
      )
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'CNN', 'Yes', 'No', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'CNN', 'GloveEmbeddings', 'No fine-tuning', valid_acc, test_acc])
      # With pre-training and with fine-tuning:
      valid_acc, test_acc = trainings(
          model_type='CNN',
          vocabulary_size=size_voc,
          embeddings=GloveEmbeddings,
          fine_tuning=True
      )
      fichier_grid.write(" Reduction factor = {}, Min Freq = {}, Model_type = {}, Embedding = {}, Fine-tuning = {}, Valid_Acc = {}, Test_Acc = {} \n".format(reduction_factor, min_freq, 'CNN', 'Yes', 'Yes', valid_acc, test_acc))
      table.append([reduction_factor, min_freq, 'CNN', 'GloveEmbeddings', 'Fine-tuning', valid_acc, test_acc])
  fichier_grid.close()

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
Epoch 6 : Validation loss = 0.49053118005394936; Validation acc = 81.0
Epoch 7:
Batch 0 : training loss = 0.06259764730930328; training acc = 100.0
Epoch 7 : Validation loss = 0.5039448104798794; Validation acc = 82.0
Early stopping.
RUN COURANT = 0
Beginning training...
Epoch 1:
Batch 0 : training loss = 0.6905484795570374; training acc = 52.5
Epoch 1 : Validation loss = 0.6907703205943108; Validation acc = 54.5
Epoch 2:
Batch 0 : training loss = 0.6763463020324707; training acc = 70.5
Epoch 2 : Validation loss = 0.6841566637158394; Validation acc = 55.0
Epoch 3:
Batch 0 : training loss = 0.668154776096344; training acc = 66.5
Epoch 3 : Validation loss = 0.6763539984822273; Validation acc = 58.0
Epoch 4:
Batch 0 : training loss = 0.6418187022209167; training acc = 75.0
Epoch 4 : Validation loss = 0.6756289452314377; Validation acc = 57.0
Epoch 5:
Batch 0 : training loss = 0.6405028700828552; tr

## Analysis

Fixed and varying parameters are detailed in previous section. For reference, here is the correspondance between tested parameters and the size of the vocabulary.

*Size of the vocabulary for different min_freq and size of dataset* 

min_freq \ Training samples| 400 | 800 |1600
---------------------------|-----|-----|-----
1                          |11177|16683|24372
3                          |3400 |5726 |9162
5                          |2071 |3536 |5947
10                         |1036 |1903 |3267
20                         |528  |1004 |1794

**Note**: these values are dependent on the data samples in the dataset - depending on the split with random seed 42.

### Hypotheses

The Averaging models are Bag Of Words models: they do not take into account the order of the words. For example, "... is not good" could be interpreted as a positive review since "good" carry a strong positive meaning. On the contrary, LSTM can handle sequential data (and negations). CNN should also take advantage of the positions of the words. Hence, we would expect better results for LSTM models or CNN.

We expect better performances - or at least a faster convergence - for models using pretrained embeddings because they are based on meaningful decomposition of words. However, the meaning of the words could fit our current task not completely. Fine-tuning should be necessary to favor the positive/negative meaning in the embeddings of the tokens.

There is a trade-off to find between the size of vocabulary and the number of training samples. With more documents in the training set, there is a higher probability to find more occurences of a given word. If we insert each word we find in the vocabulary set (`min_freq`=1), there will be useless words and the computation time is higher. Conversely, we miss useful words if the `min_freq` number is too high. Some useful words may not appear frequently because they do not appear frequently in the samples and not because they are useless.

## Results

For each table, we present the results as *Final Validation Accuracy \ Testing Accuracy* rounded to the nearest tenth. The former helps selecting the best hyperparameters (in our case, the value of `min_freq`), the latter assesses the final model and its ability to generalize to new data. Following results are average accuracies over 3 runs for each model.

### Averaging models

*Averaging model: not-pretrained embedding* (Valid_acc \ Test_acc)

min_freq \ Training samples| 400       | 800       | 1600     
---------------------------|-----------|-----------|-----------
1                          |57.3 \ 60.7|59.7 \ 57.7|**72.7 \ 72.3**
3                          |**64.7 \ 62.0**|59.3 \ 61.7|69.3 \ 70.2
5                          |56.0 \ 57.3|59.7 \ 60.3|72.2 \ 71.7
10                         |54.0 \ 60.7|56.7 \ 62.3|70.3 \ 73.0
20                         |52.7 \ 54.7|56.7 \ 61.0|67.8 \ 71.3

*Averaging model: pretrained embedding, without fine-tuning*

min_freq \ Training samples| 400       | 800       |1600      
---------------------------|-----------|-----------|-----------
1                          |56.7 \ 62.7|**68.0 \ 66.7**|63.5 \ 65.5
3                          |56.0 \ 64.7|62.0 \ 68.7|53.2 \ 55.3
5                          |54.0 \ 59.3|58.7 \ 61.3|**64.3 \ 61.3**
10                         |54.0 \ 59.3|58.0 \ 60.0|55.3 \ 58.0
20                         |56.0 \ 60.7|56.7 \ 63.7|58.0 \ 60.2

*Averaging model: pretrained embedding, with fine-tuning*

min_freq \ Training samples| 400       | 800       |1600      
---------------------------|-----------|-----------|-----------
1                          |**70.7 \ 68.0**|68.0 \ 67.3|76.7 \ 77.0
3                          |66.0 \ 66.7|**70.0 \ 68.7**|**77.0 \ 77.7**
5                          |62.7 \ 68.7|69.3 \ 68.7|75.8 \ 76.8
10                         |64.7 \ 68.7|68.0 \ 68.0|75.7 \ 77.3
20                         |60.7 \ 69.3|60.0 \ 61.3|73.3 \ 75.2

Globally, bigger vocabularies implies better results. The best results are found with the fine-tuned pretrained embeddings, with a maximum of 77.7% for testing accuracy. This embedding also gives the best results for the smallest datatets and is more robust even if `min_freq` is higher, which implies smaller vocabularies.

The use of pretrained embeddings without fine-tuning gives the less satisfactory results here. It could mean that GloVe embedding is not adapted enough to our task, as it is.

### LSTM models

*LSTM model: not-pretrained embedding* (Valid_acc \ Test_acc)

min_freq \ Training samples| 400       | 800       |1600      
---------------------------|-----------|-----------|-----------
1                          |54.7 \ 54.0|52.7 \ 52.7|54.8 \ 56.0
3                          |53.3 \ 51.3|52.0 \ 54.3|53.5 \ 53.8
5                          |58.0 \ 52.7|48.3 \ 52.3|54.0 \ 53.7
10                         |54.0 \ 49.3|49.7 \ 50.7|51.8 \ 52.5
20                         |52.7 \ 48.0|48.7 \ 49.0|**55.2 \ 54.2**

*LSTM model: pretrained embedding, without fine-tuning*

min_freq \ Training samples| 400       | 800       |1600      
---------------------------|-----------|-----------|-----------
1                          |54.0 \ 46.7|56.7 \ 54.0|**60.0 \ 56.5**
3                          |55.3 \ 41.3|54.7 \ 54.7|56.3 \ 50.8
5                          |55.3 \ 43.3|54.0 \ 48.7|57.8 \ 55.3
10                         |56.0 \ 42.7|52.7 \ 53.7|53.8 \ 51.2
20                         |58.0 \ 47.3|50.3 \ 49.3|53.8 \ 53.3

*LSTM model: pretrained embedding, with fine-tuning*

min_freq \ Training samples| 400       | 800       |1600      
---------------------------|-----------|-----------|-----------
1                          |61.3 \ 43.3|55.0 \ 52.0|**56.2 \ 50.8**
3                          |60.0 \ 40.0|54.7 \ 56.7|54.5 \ 52.5
5                          |62.0 \ 44.0|55.3 \ 53.0|53.2 \ 49.8
10                         |57.3 \ 44.0|53.7 \ 52.0|52.0 \ 53.0
20                         |60.0 \ 43.3|55.7 \ 50.7|53.3 \ 53.2

The obtained performances are surprisingly unsatisfactory. The `min_freq` parameter does not seem to play a very discriminating role here, as we find the best accuracy values for low and high number of documents.

Regarding testing accuracy, the models struggles to generalize well. Maybe it overfits the data: 10 epochs may be too much for the trainings. However, early stopping should prevent that, and even validating accuracies are not so good. We may explain this phenomenom by the number of samples we have chosen. Among the 25 000 documents of the universe, the highest number of samples we selected is 1600. The conclusions might be different if we consider a higher number, but it requires more computation time than the one we have.

Considering the results of the other models on the same datasets, we simply would have expected better results from LSTM models.


### CNN models

*CNN model: not-pretrained embedding*

min_freq \ Training samples| 400       | 800       |1600      
---------------------------|-----------|-----------|-----------
1                          |57.3 \ 56.0|55.3 \ 58.7|60.3 \ 63.8
3                          |48.7 \ 48.3|59.7 \ 61.3|**64.3 \ 64.5**
5                          |54.7 \ 58.7|54.3 \ 57.3|56.2 \ 55.8
10                         |45.3 \ 50.7|52.0 \ 55.7|59.0 \ 59.8
20                         |49.3 \ 55.3|52.0 \ 52.7|54.5 \ 60.0

*CNN model: pretrained embedding, without fine-tuning*

min_freq \ Training samples| 400       | 800       |1600      
---------------------------|-----------|-----------|-----------
1                          |**61.3 \ 59.3**|**68.0 \ 67.0**|**76.2 \ 78.5**
3                          |49.3 \ 45.3|52.7 \ 51.0|71.8 \ 71.2
5                          |58.0 \ 52.0|65.7 \ 64.0|67.7 \ 66.3
10                         |48.7 \ 47.3|58.0 \ 59.0|69.3 \ 72.0
20                         |59.3 \ 52.7|49.7 \ 48.0|69.8 \ 69.7

*CNN model: pretrained embedding, with fine-tuning*

min_freq \ Training samples| 400       | 800       |1600      
---------------------------|-----------|-----------|-----------
1                          |**56.0 \ 54.0**|**77.7 \ 73.7**|66.5 \ 66.0
3                          |49.3 \ 46.7|54.3 \ 55.7|51.2 \ 52.3
5                          |48.0 \ 48.0|58.0 \ 59.7|**68.5 \ 68.5**
10                         |46.0 \ 42.7|46.3 \ 45.7|56.2 \ 57.5
20                         |47.3 \ 44.0|71.0 \ 67.3|67.7 \ 65.0

CNN models show satisfying results. It seems to logically favor bigger vocabularies, once again.

This time, the best embedding seems to be the pretrained one, but **without fine-tuning**. It differs from the conclusion on Averaging models. It could be explained by the fact that, in this case, the convolutional layer has enough expressivity to exploit well the pretrained GloVe embeddings features.

### In a nutshell


With every model we implemented, and regardless of the vocabulary size we consider, the best results are obtained with a higher number of training samples. The further we go into the learning, the higher accuracy score we get. We did our grid search by considering at most 10 epochs, and with `early_stopping` parameter at `True`. With `early_stopping` on, `experiment` function only allows the loss function to strictly decrease. Training the models with a higher number of epochs on a higher number of samples would allow the models to generalize better, then we would have found more diversity in the output scores and the conclusions might vary.

The LSTM model we implemented is the least performant one in the context of our experiment, with the hyperparameters we fixed. In the litterature, LSTM architecture is useful for POS-Tagging and sentiment classification, often combined with a softmax activation function for multi-labelling. Therefore, considering other optimizing methods and other criterions will also be useful to have a better viewpoint on the text classification problem. CNN architecture is widely spread for classification, especially classification of images, while LSTM architecture and more generally RNN architecture have a better prediction power, as they carry information throught time and then they can infer better what's come next in a sequence.

In conclusion, in the context of our experiments, we have found that on the biggest of analysed dataset (1600 training samples), the two models which can perform the best are:
- Averaging model with min_freq=3, pretrained and fine-tuned embedding
- CNN model, min_freq=1, pretrained embedding without fine-tuning
