# Exercise Lecture "15: Neural Sequence Tagging"

In this assignement, we learn a model which can detect noun phrases referring to visual entities given the Flicker30k entities corpus as training data.

In this corpus, each word is labelled with either (B) if that word starts an NP, (I) if it occurs within an NP and (O) otherwise. There is one word and one label per line. Sentences are separated by blank lines. 

Data file:  f30kE-captions-bio.txt 

In [2]:
cd ..

C:\experiments\cours nlp\data science\lecture15


## Pre-processing

#### Exercise 1 - Creating a list of lists of tokens (one list per sentence) and the corresponding lists of labels

* From the input file, create two lists called "texts" and "labels"
* "text" is a list of lists, each list containing the tokens of a sentence
* "labels" contains the list of lists of labels for each sentence in "text"

In [None]:
texts = []
labels = []
with open('f30kE-captions-bio/f30kE-captions-bio.txt') as f:
    # Les mots et labels de la phrase en cours
    cur_text = []
    cur_labels = []
    for line in f: # On traite les lignes une par une
        line = line.strip()
        if line: # Si la ligne n'est pas vide, on reste sur la même phrase
            text, label = line.split()
            cur_text.append(text) # Et on lui ajoute le mot et label courants
            cur_labels.append(label)
        if not line.strip(): # Sinon, on est à la fin d'une phrase
            texts.append(cur_text) # On ajoute les mots et labels de la phrase à nos listees globales
            labels.append(cur_labels)
            cur_text = [] # Et on vide les listes de la phrase courante (on commence une nouvelle phrase)
            cur_labels = []
print(labels)
print(texts)

#### Exercise 2 -  Mapping labels to integers and sequence of labels to sequence of integers

* Create a dictionary label2int which maps each label to a distinct integer
* Apply this dictionary to the list of labels extracted in the previous exercise 

**Hint:** We did this in the preceding lab session.

In [None]:
all_labels = set() # Un set est comme une liste mais ne peut pas contenir de répétitions et n'a pas d'ordre fixe
for sent in labels: # Pour chaque phrase
    for label in sent:
        all_labels.add(label) # On ajoute tous les labels de notre phrase au set
        
print(all_labels) # Contient tous les labels uniques du texte (B, I et O)

label2int = dict()
for i, label in enumerate(all_labels):
    label2int[label] = i

print(label2int)

# J'utilise une compréhension de liste pour simplifier l'écriture. int_labels contient,
# pour chaque phrase (for sent in labels), une liste contenant, pour chaque label de la phrase (for label in sent),
# le label transformé en entier (label2int[label])
# Sinon il aurait fallu faire une boucle similaire à celle de l'exercice précédent
int_labels = [[label2int[label] for label in sent] for sent in labels]
print(int_labels)

#### Exercise 3 - Convert the tokens to integers

* Similarly define a token2int dictionary mapping each token in your corpus to an integer and use this dictionary to convert the texts in the list "texts" (cf. Exercise 1.1) into lists of integers, each integer representing a token

* **IMPORTANT** make sure to lowercase the tokens as the pre-trained embeddings we'll be using only include lowercased tokens. 

In [None]:
# Presque exactement pareil que l'exercice précédent
all_tokens = set()
for sent in texts:
    for token in sent:
        all_tokens.add(token.lower()) # On passe bien en minuscule
        
print(all_tokens)

token2int = dict()
for i, token in enumerate(all_tokens):
    token2int[token] = i

print(token2int)

int_tokens = [[token2int[token.lower()] for token in sent] for sent in texts] # On passe aussi en minuscule ici
print(int_tokens)

#### Exercise 4 - Create the reverse dictionaries (int2label, int2token) to map integer labels and integer tokens back to labels and tokens 

This is useful to be able to inspect results later on.

In [None]:
# Version "classique"
int2label = dict()
for key, value in label2int.items():
    int2label[value] = key # On inverse le dictionnaire, c-a-d les clés et les valeurs
    
int2token = dict() # Pareil pour les tokens
for key, value in token2int.items():
    int2token[value] = key

    
# Version qui utilise une compréhension de dictionnaire pour simplifier les choses
int2label = {value: key for key, value in label2int.items()}
int2token = {value: key for key, value in token2int.items()}


print(int2label)
print(int2token)

## Creating training and validation data

##### Pytorch import and key constants (PROVIDED)

- `max_len` is the maximum sentence length
- `batch_size` is the batch size
- `embed_size` is the size of the pre-trained embeddings (word vectors). We use fasttext pre-trained embeddings of size 300.
- `hidden_size` is the size of the RNN hidden state

In [9]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

max_len = 16
batch_size = 64
embed_size = 300
hidden_size = 128

#### Exercise 2.1  - Creating tensors

To work with pytorch, all data must be converted to tensors. 

* X and Y are tensors of integers of size (number of sentences, max sentence length) initialised with 0 (this is the value of the padding symbol)
* For each x and y, compute the length and cut down any instance that is over the max sentence length to that length
* We populate the zeros tensors with the input data from exercise 1.1

**Hint:** You did this in the previous lab session

In [10]:
# On crée un tableau de dimension len(text) × max_len, et on le remplit de 0
X = torch.zeros(len(texts), max_len, dtype=torch.long)

# On remplit le tensor ligne par ligne avec nos textes convertis en nombres entiers
for i, int_text in enumerate(int_tokens): # Pour chaque texte
    if len(int_text) < max_len: # Si le texte est trop court
        int_text = int_text + [len(token2int)] * (max_len - len(int_text)) # Alors on lui rajoute des tokens vides jusqu'à
        # atteindre la bonne longueur
    X[i] = torch.LongTensor(int_text[:max_len]) # On remplit la rangée correspondante (et on coupe à max_len tokens)

# Même chose pour Y
Y = torch.zeros(len(texts), max_len, dtype=torch.long)
for i, int_label in enumerate(int_labels):
    if len(int_label) < max_len:
        int_label = int_label + [len(label2int)] * (max_len - len(int_label))
    Y[i] = torch.LongTensor(int_label[:max_len])

print(X.size())
print(Y.size())


torch.Size([5500, 16])
torch.Size([5500, 16])


#### Exercise  6 - Create train and validation data

* Split X into two parts, one called X_train which consists of the first 5000 items and the other called X_valid which includes the rest of the data
* Do the same for Y

In [11]:
X_train = X[:5000]
X_valid = X[5000:]

Y_train = Y[:5000]
Y_valid = Y[5000:]

#### Exercise  7 - Use torch DataLoader to split training and validation data into batches

**Hint:** This was provided in the previous lab sessions

In [12]:
# Copié du TD 16

from torch.utils.data import TensorDataset, DataLoader
train_set = TensorDataset(X_train, Y_train)
valid_set = TensorDataset(X_valid, Y_valid)

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size)

## Using pre-trained Fasttext embeddings

Instead of learning word embeddings using the network, we use pre-trained Fasttext embeddings. These are available [here](https://fasttext.cc/docs/en/english-vectors.html).

However these cover several millions words and the files are large. Instead we'll use a smaller version which is restricted to the corpus vocabulary and is available on arche (wiki.en.filtered.vec). Each line in that file contains a token followed by the Fasttext embedding of that token(300 dimensions). 
E.g., auditorium -0.054196 -0.37375 ....

#### Exercise 8
To use the pre-trained embeddings, we do the following:

* first, we create a tensor of size (vocab_size, embedding_size) with embedding_size = 300 and values 0 (use torch.zeros method)

* we store each line in wiki.en.filtered.vec  into a list "tokens" whose first element is the word and the second the pretrained fasttext emebdding

* If the word is in our vocabulary we set the corresponding index (use your token2int dictionary) in our 0 tensor to the corresponding fasttext embedding. 

**N.B.** fasttext embeddings elements needs to be converted to float so you'll need to do something like

torch.FloatTensor([float(x) for x in FSEmbedding])

when setting the word index in the tensor to the fasttext embeddng

In [23]:
# Premier point
token2int['<eos>'] = len(token2int) # Ajoute le token vide à notre dictionnaire
label2int['<eos>'] = len(label2int)
vocab_size = len(token2int)
embeddings = torch.zeros(vocab_size, embed_size)

tokens = []

with open('wiki.en.filtered/wiki.en.filtered.vec') as f:
    for line in f: # Pour chaque liste
        token = line.split(' ')[0] # On lit le token et son embedding
        embed = line.split(' ')[1:]
        tokens.append((token, embed)) # On les ajoute à la liste
        
        if token in token2int.keys(): # Si le token est dans le vocabulaire
            embeddings[token2int[token]] = torch.FloatTensor([float(x) for x in embed]) # On le met dans le tensor

print(embeddings.shape)

torch.Size([4596, 300])


## Create, train and evaluate your neural network

#### Exercise  9 - Define your neural network (TODO: Provide missing values indicated by ??)

As in the preceding Exercise sheet on neural classification, we define our RNN network as a subclass of pytorch RNN module. 

Our RNN consist of three layers:
* the embedding layer: wich maps each token in the input to its fasttext embedding
* A GRU layer: the recurrent layers
* A decision layer which maps each input token to a label

##### Padding
If the input sentence is shorter than the maximum length, the remaining positions are filled with 0, the integer associated with the padding symbol. To exclude padding symbols  from the learning process (they are uninformative), include the padding_idx=vocab['<eos>'] option in the definition of the embedding layer and the 
"bias=False" option in the definition of the GRU layer. This forces the GRU hidden state to be null for all padding symbols. 
    
##### Pre-trained Embeddings
To ensure that the pretrained word embeddings are used:
* set the `weight` attribute of the embedding layer to the pretrained embeddings
* Use `requires_grad=False` to freeze the embedding layer so that the fasttext embeddings are not modified during learning.   
If you do not use this option the embeddings are fine tuned during training. 

In [None]:
# Les ligne à compléter étaient "self.embed = nn.Embedding(???, ???, padding_idx=token2int['<eos>'])"
# Et output, hidden = self.rnn(???)
class RNN(nn.Module):
    def __init__(self):
        super().__init__()
        # Le premier paramètre pour une couche Embedding doit être la taille du vocabulaire
        # Et son deuxième paramètre doit être la dimension des embeddings
        # Ici le premier doit donc être vocab_size et le deuxième embed_size
        self.embed = nn.Embedding(vocab_size, embed_size)#, padding_idx=token2int['<eos>'])
        self.embed.weight = nn.Parameter(embeddings, requires_grad=False)
        self.rnn = nn.GRU(embed_size, hidden_size, bias=False, num_layers=1, bidirectional=False, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        self.decision = nn.Linear(hidden_size * 1 * 1, len(label2int))
        
    def forward(self, x):
        embed = self.embed(x)
        output, hidden = self.rnn(embed) # On veut appeler notre réseau de neurones sur la sortie de la couche précédente
        # c-a-d les embeddings qu'on vient de créer à la ligne du dessus
        return self.decision(self.dropout(output))

rnn_model = RNN()
rnn_model

#### Define a function to evaluate the performance of the model (PROVIDED)

* the CrossEntropyLoss takes as input 2D matrices of shape (batch_size * sequence_length, num_labels)
* Scores shape is adjusted accordingly
* References are reshaped to (batch_size * sequence_length).
* the max used to compute predictions applies to the last dimension of the y_scores tensors
* To ignore padding symbols when computing the score, we create a matrix "mask" which contains 1 for all non nul elements of the Y matrix and O otherwise

In [26]:
def perf(model, loader):
    criterion = nn.CrossEntropyLoss()
    model.eval()
    total_loss = correct = num_loss = num_perf = 0
    for x, y in loader:
      with torch.no_grad():
        y_scores = model(x)
        loss = criterion(y_scores.view(y.size(0) * y.size(1), -1), y.view(y.size(0) * y.size(1)))
        y_pred = torch.max(y_scores, 2)[1]
        mask = (y != 0)
        correct += torch.sum((y_pred.data == y) * mask)
        total_loss += loss.item()
        num_loss += len(y)
        num_perf += torch.sum(mask).item()
    return total_loss / num_loss, correct.item() / num_perf

perf(rnn_model, valid_loader)

(0.02226072406768799, 0.4908746556473829)

#### Exercise 10 - Define the training function

In [27]:
# Copié du TD précédent

def fit(model, epochs):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters())

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        num_samples = 0
        
        for x_data, y_data in train_loader:
            x_data = x_data.to(device)
            y_data = y_data.to(device)
            optimizer.zero_grad()
            y_scores = model(x_data)
            loss = criterion(y_scores.transpose(1, 2), y_data) # Modifications faites ici
            num_samples += len(y_data)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        valid_loss, valid_acc = perf(model, valid_loader)
        print(f'Epoch {epoch + 1}/{epochs} | Train loss: {total_loss/num_samples:.4f} | Valid loss: {valid_loss:.4f} | Acc: {valid_acc:.4%}')


In [28]:
device = torch.device('cpu')
fit(rnn_model, 25)

Epoch 1/25 | Train loss: 0.0085 | Valid loss: 0.0024 | Acc: 94.5764%
Epoch 2/25 | Train loss: 0.0016 | Valid loss: 0.0014 | Acc: 97.7445%
Epoch 3/25 | Train loss: 0.0011 | Valid loss: 0.0012 | Acc: 98.0544%
Epoch 4/25 | Train loss: 0.0009 | Valid loss: 0.0011 | Acc: 97.7961%
Epoch 5/25 | Train loss: 0.0007 | Valid loss: 0.0010 | Acc: 98.3471%
Epoch 6/25 | Train loss: 0.0007 | Valid loss: 0.0010 | Acc: 98.0372%
Epoch 7/25 | Train loss: 0.0006 | Valid loss: 0.0010 | Acc: 98.0716%
Epoch 8/25 | Train loss: 0.0006 | Valid loss: 0.0010 | Acc: 98.2782%
Epoch 9/25 | Train loss: 0.0005 | Valid loss: 0.0009 | Acc: 98.2955%
Epoch 10/25 | Train loss: 0.0005 | Valid loss: 0.0010 | Acc: 98.2955%
Epoch 11/25 | Train loss: 0.0005 | Valid loss: 0.0010 | Acc: 98.3988%
Epoch 12/25 | Train loss: 0.0004 | Valid loss: 0.0009 | Acc: 98.5709%
Epoch 13/25 | Train loss: 0.0004 | Valid loss: 0.0009 | Acc: 98.5537%
Epoch 14/25 | Train loss: 0.0004 | Valid loss: 0.0010 | Acc: 98.1921%
Epoch 15/25 | Train loss: 0.0

## Apply your model to a sentence

Accuracy scores might be deceiving. We also need to look at the predictions on some example sentences. 

#### Exercise 11 Apply your model to a sentence

We define a `tag_sentence` function which:

* takes a input the learned model and a sentence identifier i
* retrieves from the data tensor X_valid the tensor for the i-th sentence (call it "sentence")
* retrieves from the data tensor Y_valid the tensor of labels for the i-th sentence 
* put the model into evaluation mode
* execute the model on the sentence tensor ("sentence")
* extract the top predictions (use argmax)
* print out the list of predicted tags 
   - use t.item() to get a value out of a tensor
   - use your dictionary int2label to print out the results

In [137]:
def tag_sentence(model, i):
    int2label[3] = 'N/A'
    sentence = X_valid[i]
    labels = Y_valid[i]
    model.eval()
    y_scores = model(sentence)
    y_pred = y_scores.argmax(1) # On choisit la classe avec les scores les plus élevés
    print('TOKEN'.ljust(10), 'PRED'.ljust(5), 'TRUE')
    print('-'*20)
    for j, pred in enumerate(y_pred):
        print(
              int2token[sentence[j].item()].ljust(10),
              int2label[pred.item()].ljust(5),
              int2label[labels[j].item()]
        )

In [139]:
tag_sentence(rnn_model, 15)

TOKEN      PRED  TRUE
--------------------
some       B     B
people     I     I
stare      O     O
into       O     O
the        B     B
distance   I     I
as         O     O
a          B     B
barber     I     I
gives      O     O
an         B     B
asian      I     I
man        I     I
a          B     B
haircut    I     I
.          O     O


In [2]:
int2label

NameError: name 'int2label' is not defined