# Text classification with Pytorch

The goal of this small lab is double: an introduction to using Pytorch for treating textual data, and reproduce the results of the previous lab using Pytorch.

In [3]:
import torch
import torch.nn as nn

## 1 - A (very small) introduction to pytorch

Pytorch Tensors are very similar to Numpy arrays, with the added benefit of being usable on GPU. For a short tutorial on various methods to create tensors of particular types, see [this link](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py).
The important things to note are that Tensors can be created empty, from lists, and it is very easy to convert a numpy array into a pytorch tensor, and inversely.

In [4]:
a = torch.LongTensor(5)
b = torch.LongTensor([5])

print(a)
print(b)

tensor([140640838406320, 140640838923760, 140640839003424, 140640844647504,
                      0])
tensor([5])


In [5]:
a = torch.FloatTensor([2])
b = torch.FloatTensor([3])

print(a + b)

tensor([5.])


The main interest in us using Pytorch is the ```autograd``` package. ```torch.Tensor```objects have an attribute ```.requires_grad```; if set as True, it starts to track all operations on it. When you finish your computation, can call ```.backward()``` and all the gradients are computed automatically (and stored in the ```.grad``` attribute).

One way to easily cut a tensor from the computational once it is not needed anymore is to use ```.detach()```.
More info on automatic differentiation in pytorch on [this link](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py).


In [6]:
x = torch.tensor(1., requires_grad=True)
w = torch.tensor(2., requires_grad=True)
b = torch.tensor(3., requires_grad=True)

# Build a computational graph.
y = w * x + b    # y = 2 * x + 3

# Compute gradients.
y.backward()

# Print out the gradients.
print(x.grad)    # x.grad = 2
print(w.grad)    # w.grad = 1
print(b.grad)    # b.grad = 1

tensor(2.)
tensor(1.)
tensor(1.)


The building bricks of models we can use are mostly in the `torch.nn` module of `PyTorch`. We can put together these blocks to create complex networks. 

We can use `nn.Linear(size_in, size_out)` to create a a linear layer. This will take a matrix of `(size_batch, *, size_in)` dimensions and output a matrix of `(size_batch, *, size_out)`. The `*` denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation `Ax+b`, where `A` and `b` are initialized randomly.

In [7]:
x = torch.randn(10, 3)

# Build a fully connected layer.
linear = nn.Linear(3, 2)
for name, p in linear.named_parameters():
    print(name)
    print(p)
    
# Forward pass.
linear_output = linear(x)
linear_output

weight
Parameter containing:
tensor([[-0.4882, -0.1790,  0.3123],
        [ 0.1453,  0.1891, -0.4097]], requires_grad=True)
bias
Parameter containing:
tensor([0.4561, 0.3127], requires_grad=True)


tensor([[-1.0513e-01,  8.1647e-01],
        [ 4.2545e-01,  9.5018e-02],
        [ 9.1606e-01, -1.1864e-03],
        [ 9.7760e-01, -8.7732e-02],
        [ 3.9352e-01, -1.7189e-01],
        [-9.4455e-01,  1.3338e+00],
        [ 1.3715e+00, -3.5865e-01],
        [ 5.3682e-01,  5.1226e-01],
        [ 8.0884e-01, -3.6995e-02],
        [-2.8448e-01,  7.4699e-01]], grad_fn=<AddmmBackward0>)

We can also use the `nn` module to apply activations functions to our tensors, in order to add non-linearity to our network.

In [8]:
# Activation functions operate on each element seperately, so the shape of the tensors we get as an output 
# are the same as the ones we pass in.
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
output

tensor([[0.4737, 0.6935],
        [0.6048, 0.5237],
        [0.7142, 0.4997],
        [0.7266, 0.4781],
        [0.5971, 0.4571],
        [0.2800, 0.7915],
        [0.7976, 0.4113],
        [0.6311, 0.6253],
        [0.6919, 0.4908],
        [0.4294, 0.6785]], grad_fn=<SigmoidBackward0>)

In [9]:
# We can then create blocks of layers
block = nn.Sequential(
    nn.Linear(3, 2),
    nn.Sigmoid()
)

input = torch.ones(1,4,3)
output = block(input)
output.shape

torch.Size([1, 4, 2])

We can define a loss that we want to optimize for. We can either define the loss ourselves, or use one of the predefined loss function in PyTorch, such as `nn.MSELoss()`. We can then compute the gradient !

In [10]:
y = torch.randn(1, 4, 2)

# Build loss function - Mean Square Error
criterion = nn.MSELoss()

# Compute loss.
loss = criterion(output, y)
print('Initial loss: ', loss.item())

# Backward pass.
loss.backward()

# Print out the gradients.
print ('dL/dw: ', block[0].weight.grad) 
print ('dL/db: ', block[0].bias.grad)

Initial loss:  1.3380049467086792
dL/dw:  tensor([[0.1866, 0.1866, 0.1866],
        [0.1249, 0.1249, 0.1249]])
dL/db:  tensor([0.1866, 0.1249])


Instead of using the predefined modules, we can also build our own by extending the `nn.Module` class.

To create a custom module, the first thing to do is to extend the `nn.Module`. We can then 
- Initialize our parameters in the `__init__` function, starting with a call to the `__init__` function of the super class.
- All the class attributes we define which are `nn` module objects are treated as parameters, which can be learned during the training.

All classes extending `nn.Module` are expected to implement a `forward(x)` function, where `x` is a tensor. This is the function that is called when a parameter is passed to our module, such as in `model(x)`.

In [11]:
class SimpleModel(nn.Module):
    def __init__(self, input_size, output_size):
        # Call to the __init__ function of the super class
        super(SimpleModel, self).__init__()

        # Bookkeeping: Saving the initialization parameters
        self.input_size = input_size 
        self.output_size = output_size 

        # Defining of our model
        # There isn't anything specific about the naming of `self.model`. It could
        # be something arbitrary.
        self.model = nn.Sequential(
            nn.Linear(self.input_size, self.output_size),
            nn.Sigmoid()
        )

    def forward(self, x):
        output = self.model(x)
        return output

We can replace `nn.Sequential` by defining the individual layers in the `__init__` method and connecting the in the `forward` method. 

In [12]:
class SimpleModel(nn.Module):
    def __init__(self, input_size, output_size):
        # Call to the __init__ function of the super class
        super(SimpleModel, self).__init__()

        # Bookkeeping: Saving the initialization parameters
        self.input_size = input_size 
        self.output_size = output_size 

        # Defining of our layers
        self.linear = nn.Linear(self.input_size, self.output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        linear = self.linear(x)
        output = self.sigmoid(linear)
        return output

In [13]:
# Make a sample input
input = torch.randn(2, 5)

print(input)
# Create our model
model = SimpleModel(5, 3)

# Pass our input through our model
model(input)

tensor([[-0.9912, -0.7427, -1.7878,  0.6246,  0.7314],
        [ 0.6982,  1.0264, -2.1627,  0.0996,  0.2866]])


tensor([[0.4501, 0.4115, 0.6404],
        [0.3764, 0.4531, 0.6393]], grad_fn=<SigmoidBackward0>)

In [14]:
list(model.named_parameters())

[('linear.weight',
  Parameter containing:
  tensor([[ 7.1670e-02, -7.3716e-02,  4.1136e-01,  2.5472e-04,  3.1683e-01],
          [ 1.9045e-01, -7.5062e-02, -1.7595e-02,  2.2684e-01, -2.0937e-01],
          [-3.4354e-01,  2.5099e-01, -1.8495e-01, -1.0614e-02, -1.2780e-01]],
         requires_grad=True)),
 ('linear.bias',
  Parameter containing:
  tensor([ 0.3197, -0.2446,  0.1923], requires_grad=True))]

How to update the parameters of our models ? `torch.optim module` contains optimizers that we can use: usually, `optim.SGD` and `optim.Adam`. 

When initializing optimizers, we pass our model parameters, telling the optimizers which values it will be optimizing. Optimizers also has a learning rate (`lr`) parameter, which determines how big of an update will be made in every step.

In [15]:
# Creating input and output
x = torch.randn(2, 3)
y = torch.randn(2, 1)

# Creating the model
model = SimpleModel(3, 1)
# And applying the forward
pred = model(x)
print(pred)

# Creating the loss and computing the gradient
criterion = nn.MSELoss()
loss = criterion(pred, y)
print('Initial Loss: ', loss.item())
loss.backward()

# You can perform gradient descent manually, with an in-place update ...
model.linear.weight.data.sub_(0.01 * model.linear.weight.grad.data)
model.linear.bias.data.sub_(0.01 * model.linear.bias.grad.data)

# Print out the loss after 1-step gradient descent.
pred = model(x)
loss = criterion(pred, y)
print('Loss after one update: ', loss.item())

tensor([[0.5706],
        [0.5070]], grad_fn=<SigmoidBackward0>)
Initial Loss:  3.154512882232666
Loss after one update:  3.13792085647583


In [16]:
# Use the optim package to define an Optimizer that will update the weights of the model.
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# By default, gradients are accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Before the backward pass, we need to use the optimizer object to zero all of the
# gradients.
optimizer.zero_grad()
loss.backward()

# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()

# Print out the loss after the second step of gradient descent.
pred = model(x)
loss = criterion(pred, y)
print('Loss after two updates: ', loss.item())

Loss after two updates:  3.121342897415161


## 2 - Tools for data processing 

```torch.utils.data.Dataset``` is an abstract class representing a dataset. Your custom dataset should inherit ```Dataset``` and override the following methods:
- ```__len__``` so that ```len(dataset)``` returns the size of the dataset.
- ```__getitem__``` to support the indexing such that ```dataset[i]``` can be used to get the i-th sample

Here is a toy example: 

In [17]:
toy_corpus = ['I walked down down the boulevard',
              'I walked down the avenue',
              'I ran down the boulevard',
              'I walk down the city',
              'I walk down the the avenue']

toy_categories = [0, 0, 1, 0, 0]

In [18]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    # A pytorch dataset class for holding data for a text classification task.
    def __init__(self, data, categories):
        # Upon creating the Dataset object, store the data in an attribute
        # Split the text data and labels from each other
        self.X, self.Y = [], []
        for x, y in zip(data, categories):
            # We will propably need to preprocess the data - have it done in a separate method
            # We do it here because we might need corpus-wide info to do the preprocessing 
            # For example, cutting all examples to the same length
            self.X.append(self.preprocess(x))
            self.Y.append(y)
                
    # Method allowing you to preprocess data                      
    def preprocess(self, text):
        text_pp = text.lower().strip()
        return text_pp
    
    # Overriding the method __len__ so that len(CustomDatasetName) returns the number of data samples                     
    def __len__(self):
        return len(self.Y)
   
    # Overriding the method __getitem__ so that CustomDatasetName[i] returns the i-th sample of the dataset                      
    def __getitem__(self, idx):
           return self.X[idx], self.Y[idx]

In [19]:
toy_dataset = CustomDataset(toy_corpus, toy_categories)

In [20]:
print(len(toy_dataset))
for i in range(len(toy_dataset)):
    print(toy_dataset[i])

5
('i walked down down the boulevard', 0)
('i walked down the avenue', 0)
('i ran down the boulevard', 1)
('i walk down the city', 0)
('i walk down the the avenue', 0)


```torch.utils.data.DataLoader``` is what we call an iterator, which provides very useful features:
- Batching the data
- Shuffling the data
- Load the data in parallel using multiprocessing workers.
and can be created very simply from a ```Dataset```. Continuing on our simple example: 

In [21]:
toy_dataloader = DataLoader(toy_dataset, batch_size = 2, shuffle = True)

In [22]:
for e in range(3):
    print("Epoch:" + str(e))
    for x, y in toy_dataloader:
        print("Batch: " + str(x) + "; labels: " + str(y))  

Epoch:0
Batch: ('i walk down the city', 'i ran down the boulevard'); labels: tensor([0, 1])
Batch: ('i walked down the avenue', 'i walk down the the avenue'); labels: tensor([0, 0])
Batch: ('i walked down down the boulevard',); labels: tensor([0])
Epoch:1
Batch: ('i ran down the boulevard', 'i walked down the avenue'); labels: tensor([1, 0])
Batch: ('i walk down the the avenue', 'i walked down down the boulevard'); labels: tensor([0, 0])
Batch: ('i walk down the city',); labels: tensor([0])
Epoch:2
Batch: ('i walked down down the boulevard', 'i ran down the boulevard'); labels: tensor([0, 1])
Batch: ('i walk down the city', 'i walk down the the avenue'); labels: tensor([0, 0])
Batch: ('i walked down the avenue',); labels: tensor([0])


## 3 - Data processing of a text dataset

Now, we would like to apply what we saw to our case, and **create a specific class** ```TextClassificationDataset``` **inheriting** ```Dataset``` that will:
- Create a vocabulary from the data (use what we saw in the previous TP)
- Preprocess the data using this vocabulary, adding whatever we need for our pytorch model
- Have a ```__getitem__``` method that allows us to use the class with a ```Dataloader``` to easily build batches.

In [23]:
import os
import sys
import os.path as op
from torch.nn import functional as F
import numpy as np
import random

from nltk import word_tokenize
from torch.nn.utils.rnn import pad_sequence

First, we get the filenames and the corresponding categories: 

In [33]:
# For those on google colab: you can download the files directly with this:
import gdown
gdown.download("http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", output="aclImdb_v1.tar.gz", quiet=False)
!tar xzf /content/aclImdb_v1.tar.gz

Downloading...
From: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
To: /home/nico/M2DS/NLP/TP-cours/aclImdb_v1.tar.gz
 28%|██▊       | 23.6M/84.1M [00:39<00:59, 1.01MB/s]

KeyboardInterrupt: 

 28%|██▊       | 23.6M/84.1M [00:52<00:59, 1.01MB/s]

In [34]:
from glob import glob
# We get the files from the path: ./aclImdb/train/neg for negative reviews, and ./aclImdb/train/pos for positive reviews
train_filenames_neg = sorted(glob(op.join('.', 'aclImdb','train', 'neg', '*.txt')))
train_filenames_pos = sorted(glob(op.join('.', 'aclImdb','train', 'pos', '*.txt')))

# Each files contains a review that consists in one line of text: we put this string in two lists, that we concatenate
train_texts_neg = [open(f, encoding="utf8").read() for f in train_filenames_neg]
train_texts_pos = [open(f, encoding="utf8").read() for f in train_filenames_pos]
train_texts = train_texts_neg + train_texts_pos

# The first half of the elements of the list are string of negative reviews, and the second half positive ones
# We create the labels, as an array of [1,len(texts)], filled with 1, and change the first half to 0
train_labels = np.ones(len(train_texts), dtype=int)
train_labels[:len(train_texts_neg)] = 0.

Example of one document:

In [36]:
open("./aclImdb/train/neg/0_3.txt", encoding="utf8").read()

"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

We can use a function from sklearn, ```train_test_split```, to separate data into training and validation sets:



In [37]:
from sklearn.model_selection import train_test_split

In [38]:
train_texts_splt, val_texts, train_labels_splt, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

We can now implement our ```TextClassificationDataset``` class, that we will build from:
- A list of path to the IMDB files in the training set: ```path_to_file```
- A list of the corresponding categories: ```categories```
We will add three optional arguments:
- First, a way to input a vocabulary (so that we can re-use the training vocabulary on the validation and training ```TextClassificationDataset```). By default, the value of the argument is ```None```.
- In order to work with batches, we will need to have sequences of the same size. That can be done via **padding** but we will still need to limit the size of documents (to avoid having batches of huge sequences that are mostly empty because of one very long documents) to a ```max_length```. Let's put it to 100 by default.
- Lastly, a ```min_freq``` that indicates how many times a word must appear to be taken in the vocabulary. 

The idea behind **padding** is to transform a list of pytorch tensors (of maybe different length) into a two dimensional tensor - which we can see as a batch. The size of the first dimension is the one of the longest tensor - and other are **padded** with a chosen symbol: here, we choose 0. 

In [39]:
tensor_1 = torch.LongTensor([1, 4, 5])
tensor_2 = torch.LongTensor([2])
tensor_3 = torch.LongTensor([6, 7])

In [40]:
tensor_padded = pad_sequence([tensor_1, tensor_2, tensor_3], batch_first=True, padding_value = 0)
print(tensor_padded)

tensor([[1, 4, 5],
        [2, 0, 0],
        [6, 7, 0]])


<div class='alert alert-block alert-info'>
            Code:</div>

In [49]:
class TextClassificationDataset(Dataset):
    def __init__(self, data, categories, vocab = None, max_length = 100, min_freq = 5):
        self.data = data
        # Set the maximum length we will keep for the sequences
        self.max_length = max_length

        # Allow to import a vocabulary (for valid/test datasets, that will use the training vocabulary)
        if vocab is not None:
            self.word2idx, self.idx2word = vocab
        else:
            # If no vocabulary imported, build it (and reverse)
            self.word2idx, self.idx2word = self.build_vocab(self.data, min_freq)

        # We then need to tokenize the data ..
        tokenized_data = [word_tokenize(self.data[i]) for i in range(len(self.data))]# To complete

        # Transform words into lists of indexes ... (use the .get() method to redirect unknown words to the UNK token)
        indexed_data = [[self.word2idx.get(w, len(self.word2idx)) for w in sent ] for sent in tokenized_data] # To complete
        # And transform this list of lists into a list of Pytorch LongTensors
        tensor_data = [torch.LongTensor(sent) for sent in indexed_data]# To complete
        # And the categories into a FloatTensor
        tensor_y = torch.FloatTensor(categories)# To complete
        # To finally cut it when it's above the maximum length
        cut_tensor_data = [tensor[:self.max_length] for tensor in tensor_data]# To complete
        
        # Now, we need to use the pad_sequence function to have the whole dataset represented as one tensor,
        # containing sequences of the same length. We choose the padding_value to be 0, the we want the
        # batch dimension to be the first dimension 
        self.tensor_data = pad_sequence(cut_tensor_data, batch_first=True, padding_value=0)# To complete
        self.tensor_y = tensor_y
        
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # The iterator just gets one particular example with its category
        # The dataloader will take care of the shuffling and batching
        if torch.is_tensor(idx):
            idx = idx.tolist()
        return self.tensor_data[idx], self.tensor_y[idx] 
    
    def build_vocab(self, corpus, count_threshold):
        # Same as in previous work
        word_count = {}
        for sentence in corpus:
            tokens = word_tokenize(sentence)
            for token in tokens:
                if token not in word_count:
                    word_count[token] = 1
                else:
                    word_count[token] += 1
        filtered_word_counts = {word: count for word, count in word_count.items() if count >= count_threshold}        
        words = sorted(filtered_word_counts.keys(), key=word_count.get, reverse=True) + ['UNK']  
        # But we need to shift the indexes by 1 to put the padding symbol to 0
        word_index = {words[i] : i+1 for i in range(len(words))}
        idx_word = {i+1 : words[i] for i in range(len(words))}
        return word_index, idx_word
    
    def get_vocab(self):
        # A simple way to get the training vocab when building the valid/test 
        return self.word2idx, self.idx2word

In [50]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/nico/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [51]:
training_dataset = TextClassificationDataset(train_texts_splt, train_labels_splt)
training_word2idx, training_idx2word = training_dataset.get_vocab()

In [52]:
valid_dataset = TextClassificationDataset(val_texts, val_labels, (training_word2idx, training_idx2word))

In [46]:
training_dataloader = DataLoader(training_dataset, batch_size = 200, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size = 25)

In [53]:
print(valid_dataset[1])

(tensor([   14,   146,  2667,   172,   742,     2,  2364,   317,    17,    47,
        16980,  4274,     3,   306,   314,     2,    14,   598,     7,   885,
           18,    41,     3,    50,    17,     1,   444,   719,  1463,  1612,
           25,     6,     1,  1454,    17,     2,    44, 31668,    68,     3,
        31668,    22, 31668, 17759,     2, 12172,  8410,   585,     4,  1056,
            6,  2654,     2,    15,    17,    45,   770,    25,     2,     7,
          172,    46,  1137,    18,   266,     6,   572,     3,   125,    75,
           33,    69, 31668,   149,  2910,    88,   213,     2,    33,    69,
          182,    96, 13481,    58,   160,    35,    56,    43,     1,  4237,
          456,    53,    43,     1,  3558,     6,     1,   175,    13,     1]), tensor(1.))


In [54]:
example_batch = next(iter(training_dataloader))
print(example_batch[0].size())
print(example_batch[1].size())

torch.Size([200, 2818])
torch.Size([200])


### 4 - A simple averaging model

Now, we will implement in Pytorch what we did in the previous TP: a simple averaging model. For each model we will implement, we need to create a class which inherits from ```nn.Module``` and redifine the ```__init__``` method as well as the ```forward``` method.

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class AveragingModel(nn.Module):
    def __init__(self, embedding_dim, vocabulary_size):
        super().__init__()
        # Create an embedding object. Be careful to padding - you need to increase the vocabulary size by one !
        # Look into the arguments of the nn.Embedding class
        self.embeddings = nn.Embedding(vocabulary_size + 1, embedding_dim, padding_idx=0)# To complete
        # Create a linear layer that will transform the mean of the embeddings into a classification score
        self.linear = nn.Linear(embedding_dim,1)# To complete

        # No need for sigmoid, it will be into the criterion !

    def forward(self, inputs):
        # Remember: the inpts are written as Batch_size * seq_length * embedding_dim
        # First, take the mean of the embeddings of the document
        x = # To complete
        # Then make it go through the linear layer and remove the extra dimension with the method .squeeze()
        o = # To complete
        return o

In [None]:
import torch.optim as optim

In [None]:
model = AveragingModel(300, len(training_word2idx))
# Create an optimizer
opt = optim.Adam(model.parameters(), lr=0.0025, betas=(0.9, 0.999))
# The criterion is a binary cross entropy loss based on logits - meaning that the sigmoid is integrated into the criterion
criterion = nn.BCEWithLogitsLoss()

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Implement a training function, which will train the model with the corresponding optimizer and criterion,
# with the appropriate dataloader, for one epoch.

def train_epoch(model, opt, criterion, dataloader):
    model.train()
    losses = []
    for i, (x, y) in enumerate(dataloader):
        opt.zero_grad()
        # (1) Forward
        pred = # To complete
        # (2) Compute the loss 
        loss = # To complete
        # (3) Compute gradients with the criterion
        # To complete
        # (4) Update weights with the optimizer
        # To complete       
        losses.append(loss.item())
        # Count the number of correct predictions in the batch - here, you'll need to use the sigmoid
        num_corrects = # To complete
        acc = 100.0 * num_corrects/len(y)
        
        if (i%20 == 0):
            print("Batch " + str(i) + " : training loss = " + str(loss.item()) + "; training acc = " + str(acc.item()))
    return losses

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
# Same for the evaluation ! We don't need the optimizer here. 
def eval_model(model, criterion, evalloader):
    model.eval()
    total_epoch_loss = 0
    total_epoch_acc = 0
    with torch.no_grad():
        for i, (x, y) in enumerate(evalloader):
            pred = # To complete
            loss = # To complete
            num_corrects = # To complete
            acc = 100.0 * num_corrects/len(y)
            total_epoch_loss += loss.item()
            total_epoch_acc += acc.item()

    return total_epoch_loss/(i+1), total_epoch_acc/(i+1)

In [None]:
# A function which will help you execute experiments rapidly - with a early_stopping option when necessary. 
def experiment(model, opt, criterion, num_epochs = 5, early_stopping = True):
    train_losses = []
    if early_stopping: 
        best_valid_loss = 10. 
    print("Beginning training...")
    for e in range(num_epochs):
        print("Epoch " + str(e+1) + ":")
        train_losses += train_epoch(model, opt, criterion, training_dataloader)
        valid_loss, valid_acc = eval_model(model, criterion, valid_dataloader)
        print("Epoch " + str(e+1) + " : Validation loss = " + str(valid_loss) + "; Validation acc = " + str(valid_acc))
        if early_stopping:
            if valid_loss < best_valid_loss:
                best_valid_loss = valid_loss
            else:
                print("Early stopping.")
                break  
    return train_losses

In [None]:
train_losses = experiment(model, opt, criterion)

In [None]:
import matplotlib.pyplot as plt
plt.plot(train_losses)

### With Glove embeddings: 

Now, we would like to integrate pre-trained word embeddings into our model ! Let's use again the functions that we used in the previous lab:

In [None]:
import gensim.downloader as api
loaded_glove_model = api.load("glove-wiki-gigaword-300")
loaded_glove_embeddings = loaded_glove_model.vectors

In [None]:
def get_glove_adapted_embeddings(glove_model, input_voc):
    keys = {i: glove_model.key_to_index.get(w, None) for w, i in input_voc.items()}
    index_dict = {i: key for i, key in keys.items() if key is not None}
    embeddings = np.zeros((len(input_voc)+1,glove_model.vectors.shape[1]))
    for i, ind in index_dict.items():
        embeddings[i] = glove_model.vectors[ind]
    return embeddings

GloveEmbeddings = get_glove_adapted_embeddings(loaded_glove_model, training_word2idx)

In [None]:
print(GloveEmbeddings.shape)
# We should check that the "padding" vector is at zero
print(GloveEmbeddings[0])

Here, implement a ```PretrainedAveragingModel``` very similar to the previous model, using the ```nn.Embedding``` method ```from_pretrained()``` to initialize the embeddings from a numpy array. Use the ```requires_grad_``` method to specify if the model must fine-tune the embeddings or not ! 

<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
class PretrainedAveragingModel(nn.Module):
    # To complete

In [None]:
model_pre_trained = PretrainedAveragingModel(300, len(training_word2idx), torch.FloatTensor(GloveEmbeddings), True)
opt_pre_trained = optim.Adam(model_pre_trained.parameters(), lr=0.0025, betas=(0.9, 0.999))

In [None]:
train_losses = experiment(model_pre_trained, opt_pre_trained, criterion)

In [None]:
model_pre_trained_no_fine_tuning = PretrainedAveragingModel(300, len(training_word2idx), torch.FloatTensor(GloveEmbeddings), False)
opt_pre_trained_no_fine_tuning = optim.Adam(model_pre_trained_no_fine_tuning.parameters(), lr=0.0025, betas=(0.9, 0.999))

In [None]:
train_losses = experiment(model_pre_trained_no_fine_tuning, opt_pre_trained_no_fine_tuning, criterion)