# Exercise Lecture "15: Neural Sequence Tagging"

In this assignment, we learn a model which can detect noun phrases referring to visual entities given the **Flicker30k** entities corpus as training data.

In this corpus, each word is labelled with either **(B)** if that word starts an NP, **(I)** if it occurs within an NP and **(O)** otherwise. There is one word and one label per line. Sentences are separated by blank lines. 

> Data file:  f30kE-captions-bio.txt 

You will also extend the RNN model defined in the previous Exercise by downloading pretrained English **FastText** embeddings, and selecting the set of pretrained vectors associated with the vocabulary of the _f30kE-captions-bio.txt_ file, to use as the embedding layer weights for the RNN model.

> Pretrained FastText: wiki-news-300d-1M.vec.zip downloadable [here](https://fasttext.cc/docs/en/english-vectors.html) NOTE: this zip file is 2.26GB after unzipping.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Pre-processing

#### Exercise 1 - Creating a list of lists of tokens (one list per sentence) and the corresponding lists of labels

* From the input file, create two lists called "texts" and "labels"
    * "text" is a list of lists, each list containing the tokens of a sentence
    * "labels" contains the list of lists of labels for each sentence in "text"

Note: each line in the file is a token and its label. Sentences are separated by an empty line. Your processing should take this into account to recover the full sentence. 

In [2]:
texts = []
labels = []
src = '/content/drive/MyDrive/NLP Masters/M2/Data Science/Data Science Labs/tmp/f30kE-captions-bio.txt'
with open(src) as f:
  cur_text = [] ##current text
  cur_label = [] ## current label
  for line in f: ## for all lines in the document
    line = line.strip() ## strip the line
    if line: ## if the line is not empty
      text, label = line.split() 
      cur_text.append(text) ## Add the word to current text
      cur_label.append(label) ## Add the label to current label
    if not line.strip(): ## We are at the end of the string.
      texts.append(cur_text) ## We append the information.
      labels.append(cur_label)
      cur_text = []
      cur_label = []

print(f"Texts:\n{texts}")
print(f"Labels:\n{labels}")

Texts:
Labels:
[['B', 'I', 'I', 'I', 'I', 'I', 'O', 'O', 'B', 'I', 'O', 'O', 'B', 'I', 'I', 'O'], ['B', 'O', 'B', 'I', 'O', 'B', 'I', 'I', 'I', 'O', 'B', 'I', 'O'], ['B', 'I', 'O', 'O', 'O', 'B', 'I', 'O'], ['B', 'I', 'I', 'O', 'O', 'B', 'I', 'O', 'B', 'I', 'O', 'O', 'O', 'B', 'O', 'O'], ['B', 'I', 'O', 'O', 'O', 'B', 'I', 'O', 'B', 'I', 'O'], ['B', 'O', 'O', 'B', 'I', 'I', 'I'], ['B', 'I', 'O', 'B', 'I', 'I', 'I', 'O', 'O', 'B', 'I'], ['B', 'I', 'O', 'O', 'B', 'I', 'O', 'O', 'B', 'I', 'O'], ['B', 'I', 'I', 'I', 'O', 'O', 'B', 'I', 'O', 'B', 'I', 'O', 'O', 'B', 'O'], ['B', 'I', 'O', 'O', 'B', 'I', 'O', 'O', 'B', 'I', 'I', 'O', 'B', 'I', 'O'], ['B', 'I', 'O', 'O', 'B', 'I', 'I', 'I', 'O'], ['B', 'I', 'O', 'B', 'I', 'I', 'O', 'O', 'B', 'I', 'O'], ['B', 'I', 'O', 'O', 'B', 'I', 'I', 'O', 'O', 'B', 'I', 'O'], ['B', 'I', 'I', 'O', 'O', 'B', 'O', 'B', 'I', 'O'], ['B', 'I', 'O', 'B', 'I', 'O', 'B', 'I', 'O'], ['B', 'I', 'O', 'O', 'O', 'B', 'I', 'I', 'O'], ['B', 'I', 'I', 'O', 'O', 'O', 'B', '

#### Exercise 2 -  Mapping labels to integers and sequence of labels to sequence of integers

* Create a dictionary label2int which maps each label to a distinct integer
* Apply this dictionary to the list of labels extracted in the previous exercise 
* Make sure to include an "\<eos>" token in your vocabulary

**Hint:** We did this in the preceding lab session.

In [5]:
label2int = dict()

all_labels = set()
for sent in labels:
  for label in sent: 
    all_labels.add(label)

print(all_labels)

for i, label in enumerate(all_labels):
  label2int[label] = i

print(label2int)

int_labels = [[label2int[label] for label in sent] for sent in labels]
print(int_labels)

{'B', 'I', 'O'}
{'B': 0, 'I': 1, 'O': 2}
[[0, 1, 1, 1, 1, 1, 2, 2, 0, 1, 2, 2, 0, 1, 1, 2], [0, 2, 0, 1, 2, 0, 1, 1, 1, 2, 0, 1, 2], [0, 1, 2, 2, 2, 0, 1, 2], [0, 1, 1, 2, 2, 0, 1, 2, 0, 1, 2, 2, 2, 0, 2, 2], [0, 1, 2, 2, 2, 0, 1, 2, 0, 1, 2], [0, 2, 2, 0, 1, 1, 1], [0, 1, 2, 0, 1, 1, 1, 2, 2, 0, 1], [0, 1, 2, 2, 0, 1, 2, 2, 0, 1, 2], [0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 1, 2, 2, 0, 2], [0, 1, 2, 2, 0, 1, 2, 2, 0, 1, 1, 2, 0, 1, 2], [0, 1, 2, 2, 0, 1, 1, 1, 2], [0, 1, 2, 0, 1, 1, 2, 2, 0, 1, 2], [0, 1, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], [0, 1, 1, 2, 2, 0, 2, 0, 1, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2], [0, 1, 2, 2, 2, 0, 1, 1, 2], [0, 1, 1, 2, 2, 2, 0, 1, 2, 2, 2, 0, 2], [0, 1, 2, 2, 2, 2, 0, 1, 1, 2], [0, 1, 2, 2, 2, 0, 1, 1, 2], [0, 1, 2, 0, 1, 1, 2, 2, 0, 1, 1, 2, 0, 1, 1, 2], [0, 1, 1, 1, 2, 2, 0, 2, 0, 2, 0, 1, 2], [0, 1, 2, 0, 1, 2, 2, 2, 0, 1, 1, 2], [0, 1, 1, 2, 2, 0, 1, 2, 2, 0, 1, 1, 2], [0, 2, 0, 2, 0, 2, 0, 2], [0, 1, 2, 0, 1, 1, 2, 2, 2, 2, 0, 1, 2], [0, 1, 1, 1, 2, 0, 1, 2, 0, 2, 2], [0,

#### Exercise 3 - Convert the tokens to integers

* Similarly define a token2int dictionary mapping each token in your corpus to an integer and use this dictionary to convert the texts in the list "texts" (cf. Exercise 1 above) into lists of integers, each integer representing a token

**IMPORTANT** make sure to lowercase the tokens as the pre-trained embeddings we'll be using only include lowercased tokens. 

In [6]:
token2int = dict()

all_tokens = set()
for sent in texts:
  for token in sent: 
    all_tokens.add(token.lower())

print(all_tokens)

for i, token in enumerate(all_tokens):
  token2int[token] = i

print(token2int)

int_tokens = [[token2int[token.lower()] for token in sent] for sent in texts]
print(int_tokens)

[[907, 4170, 4052, 2333, 3349, 1894, 1004, 2286, 645, 3211, 2054, 717, 907, 900, 1293, 2363], [4, 3302, 4008, 3481, 1620, 907, 430, 4052, 3532, 2845, 3773, 2265, 2363], [907, 2236, 616, 1620, 1300, 907, 1247, 2363], [907, 3202, 2171, 3548, 980, 180, 319, 1823, 645, 3745, 841, 717, 501, 3714, 2078, 2363], [3054, 3168, 3804, 4265, 2286, 907, 4224, 2378, 645, 2014, 2363], [3342, 182, 980, 998, 2496, 3323, 4253], [907, 4, 2845, 907, 1828, 4223, 1688, 2622, 717, 645, 833], [907, 1770, 3906, 182, 3714, 352, 1620, 1795, 3714, 859, 2363], [907, 4170, 4052, 3342, 4136, 3763, 907, 1258, 2286, 645, 387, 406, 3492, 3164, 2363], [907, 532, 3906, 2286, 645, 3381, 2622, 2939, 907, 4292, 1019, 2845, 645, 387, 2363], [907, 1770, 2508, 3302, 907, 1260, 4052, 800, 2363], [907, 3321, 3302, 907, 1180, 276, 585, 3763, 907, 3843, 2363], [907, 4, 2164, 3429, 907, 1480, 3960, 1823, 644, 907, 45, 2363], [907, 55, 2171, 3906, 2845, 2273, 700, 2496, 2248, 2363], [907, 532, 2568, 907, 836, 2286, 907, 3860, 2363], 

#### Exercise 4 - Create the reverse dictionaries for labels and tokens

- create 2 dictionaries (int2label, int2token) to map integer labels and integer tokens back to labels and tokens 

This will be useful to be able to inspect results later on.

In [7]:
int2label = {value: key for key, value in label2int.items()}
int2token = {value: key for key, value in token2int.items()}

print(int2label)
print(int2token)

{0: 'B', 1: 'I', 2: 'O'}


## Creating training and validation data

##### Pytorch import and key constants (PROVIDED)

- `max_len` is the maximum sentence length (set to this to the size of the longest sequence in the _f30kE-captions-bio.txt_ corpus)
- `batch_size` is the batch size
- `embed_size` is the size of the pre-trained embeddings (word vectors). We use fasttext pre-trained embeddings of size 300.
- `hidden_size` is the size of the RNN hidden state

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

max_len = max([len(l) for l in labels])
print('Max Length set at:', max_len)
batch_size = 64
embed_size = 300
hidden_size = 128

Max Length set at: 16


#### Exercise 5  - Creating tensors

To work with pytorch, all data must be converted to tensors. Do the following

* Create **X** and **Y**, which are tensors of dtype "long" that is initialised with 0s. It should be of the following size: (number of sentences, max sentence length) (cf. CS_pytorch)
* We populate the zeros tensors with the input data from exercise 1.1: i.e.
> For each element of "texts" and labels", compute the length and if they are < max_length, make sure to pad them with \<eos> to max_length 

**Hint:** You did this in the previous lab session

In [9]:
## We create tables of dimension len(text) * max_length and fill with 0
X = torch.zeros(len(texts), max_len, dtype=torch.long)

## We fill the tensor line by line with the texts converted to integer
for i, int_text in enumerate(int_tokens): ## For each text
    if len(int_text) < max_len: ## In case the text is shorter than what needed
        ## Let it get to the good length
        int_text = int_text + [len(token2int)] * (max_len - len(int_text)) 

    ## We fill it.  
    X[i] = torch.LongTensor(int_text[:max_len]) 

## Same process for Y
Y = torch.zeros(len(texts), max_len, dtype=torch.long)
for i, int_label in enumerate(int_labels):
    if len(int_label) < max_len:
        int_label = int_label + [len(label2int)] * (max_len - len(int_label))
    Y[i] = torch.LongTensor(int_label[:max_len])

print(X.size())
print(Y.size())


torch.Size([5500, 16])
torch.Size([5500, 16])


#### Exercise  6 - Create train and validation data

* Split X into two parts, one called X_train which consists of the first 5000 items and the other called X_valid which includes the rest of the data
* Do the same for Y

In [10]:
X_train = X[:5000]
X_valid = X[5000:]

Y_train = Y[:5000]
Y_valid = Y[5000:]

#### Exercise  7 - Use torch DataLoader to split training and validation data into batches

**Hint:** This was provided in the previous lab sessions

In [11]:
from torch.utils.data import TensorDataset, DataLoader
train_set = TensorDataset(X_train, Y_train)
valid_set = TensorDataset(X_valid, Y_valid)

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size)

## Using pre-trained Fasttext embeddings

Instead of learning word embeddings using the network, we will use pre-trained Fasttext embeddings. These are available [here](https://fasttext.cc/docs/en/english-vectors.html).

However the pre-trained Fasttext embeddings cover several millions words and the full file is large. Instead we'll learn to extract a smaller subset that is restricted to the corpus vocabulary. Each line in that file contains a token followed by the Fasttext embedding of that token(300 dimensions). 
> e.g., auditorium -0.054196 -0.37375 ....

#### Exercise 8 (PROVIDED)

* read the .vec file holding the FastText embeddings. Handle each line (a word and associated vector)
* extract the set of vectors corresponding to the vocabulary of the _f30kE-captions-bio.txt_ corpus

In [12]:
# from the FastText site, with amendments to check for vocab_set (set for faster check)
def load_vectors(fname, vocab_set):
    fin = open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        if tokens[0] not in vocab_set: continue
        # map each element (str) to float. 
        # set to list type (else it remain a map generator, and can be spent once called)
        # avoid situation where students rerun cell and throws an error (after generator spent)
        data[tokens[0]] = list(map(float, tokens[1:])) 
    return data

# using https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
src = '/content/drive/MyDrive/NLP Masters/M2/Data Science/Data Science Labs/tmp/wiki.en.filtered.vec'
ft_vec_dict = load_vectors(src, set(token2int))
print(f'{len(ft_vec_dict)}, {len(token2int)}')

4595, 4595


#### Exercise 9: Associating pretrained vectors to our vocabulary
To use the pre-trained embeddings, we do the following:

* Firstly, we create a zeros  tensor of size (vocab_size, embedding_size) with values 0 (use torch.zeros method). The embedding_size should match the dimensions of the FastText embeddings. 

* Secondly, for each word in the _f30kE-captions-bio.txt_ corpus vocabulary, retrieve the FastText embedding from *ft_vec_dict* above.

    * If the word is in our vocabulary we set the corresponding index (use your token2int dictionary) in our zeros tensor to the corresponding fasttext embedding. 
    * **NOTE**: If the word in our vocabulary is **not** available in the set of vectors, we set it to a vector of random values (use torch.rand_like()). 


In [14]:
token2int['<eos>'] = len(token2int) ## Add the empty token to our dictionnary
label2int['<eos>'] = len(label2int)
vocab_size = len(token2int)
embeddings = torch.zeros(vocab_size, embed_size)

tokens = []

with open(src) as f:
    for line in f: 
        token = line.split(' ')[0]  ## We read the token and its embedding
        embed = line.split(' ')[1:]
        tokens.append((token, embed)) ## We add them to the list.
        
        if token in token2int.keys():  ## In the case if the token is in the dict
            embeddings[token2int[token]] = torch.FloatTensor([float(x) for x in embed]) 
        elif token != '':  # add a check for empty string
            embeddings[token2int[token]] = torch.rand_like(embeddings[token2int[token]])

print(embeddings)


tensor([[ 0.2507, -0.1397, -0.2025,  ...,  0.5051,  0.6004,  0.1544],
        [ 0.2136,  0.0621, -0.2979,  ...,  0.4043,  0.3377,  0.2278],
        [-0.0539,  0.3328,  0.0437,  ...,  0.3503,  0.1910,  0.2237],
        ...,
        [-0.0798, -0.2114,  0.0474,  ...,  0.1259,  0.1861,  0.1219],
        [-0.2484, -0.2459, -0.4016,  ...,  0.0757,  0.3146, -0.3994],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])


## Create, train and evaluate your neural network

#### Exercise  10 - Define your neural network (TODO: Provide missing values indicated by ??)

As in the preceding Exercise sheet on neural classification, we define our RNN network as a subclass of pytorch RNN module. 

Our RNN consist of three layers:
* the embedding layer: wich maps each token in the input to its fasttext embedding
* A GRU layer: the recurrent layers
* A decision layer which maps each input token to a label

##### Padding
If the input sentence is shorter than the maximum length, the remaining positions are filled with 0, the integer associated with the padding symbol. To exclude padding symbols  from the learning process (they are uninformative), include the padding_idx=vocab['<eos>'] option in the definition of the embedding layer and the 
"bias=False" option in the definition of the GRU layer. This forces the GRU hidden state to be null for all padding symbols. 
    
##### Pre-trained Embeddings
To ensure that the pretrained word embeddings are used:
* set the `weight` attribute of the embedding layer to the pretrained embeddings
* Use `requires_grad=False` to freeze the embedding layer so that the fasttext embeddings are not modified during learning.   
If you do not use this option the embeddings are fine tuned during training. 

In [20]:
class RNN(nn.Module):
    def __init__(self, pretrained_weights, num_layers, hidden_size , bidirectional = False, 
                 padding_idx = token2int['<eos>']):
        super().__init__()
        self.embed = nn.Embedding(pretrained_weights.size(0), 
                                  pretrained_weights.size(1), 
                                  padding_idx=token2int['<eos>'])
        ## pretrained_weights.size(0) = vocab size
        ## pretrained_weights.size(1) = embeddings size
        self.embed.weight = nn.Parameter(pretrained_weights, requires_grad=False)
        embed_size = self.embed.weight.size(-1)
        
        self.rnn = nn.GRU(embed_size, hidden_size, bias=False, num_layers=1, 
                            bidirectional=bidirectional, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        num_directions = 2 if bidirectional else 1
        self.decision = nn.Linear(hidden_size * num_layers * num_directions, len(label2int))
        
    def forward(self, x):
        embed = self.embed(x)
        output, hidden = self.rnn(embed)
        return self.decision(self.dropout(output))

rnn_model = RNN(embeddings,  num_layers = 1, hidden_size = hidden_size, 
                padding_idx = token2int['<eos>']+1)
rnn_model

RNN(
  (embed): Embedding(4596, 300, padding_idx=4595)
  (rnn): GRU(300, 128, bias=False, batch_first=True)
  (dropout): Dropout(p=0.3, inplace=False)
  (decision): Linear(in_features=128, out_features=4, bias=True)
)

#### Exercise 11: Define a function to evaluate the performance of the model (PROVIDED)

* the CrossEntropyLoss takes as input 2D matrices of shape (batch_size * sequence_length, num_labels)
* Scores shape is adjusted accordingly
* References are reshaped to (batch_size * sequence_length).
* the max used to compute predictions applies to the last dimension of the y_scores tensors
* To ignore padding symbols when computing the score, we create a matrix "mask" which contains 1 for all non nul elements of the Y matrix and O otherwise

In [21]:
def perf(model, loader, pad_idx = token2int['<eos>']):
    criterion = nn.CrossEntropyLoss()
    model.eval()
    total_loss, correct, num_loss, num_perf = 0, 0, 0, 0
    for x, y in loader:
        with torch.no_grad():
            y_scores = model(x) # should be of shape (bsz, max_len, vocab_size)
            
            # reshape y to a long sequence instead of batches of sequences
            target = y.view(y.size(0) * y.size(1)) # or y.view(1,-1).squeeze()
            scores = y_scores.view(y.size(0) * y.size(1), -1)
            
            loss = criterion(scores, target)
            
            y_pred = torch.max(y_scores, -1)[1] # argmax on last dim (vocab)

            mask = (y != pad_idx-1) # ignore pads in original target
            
            correct += torch.sum((y_pred.data == y) * mask)
            
            total_loss += loss.item()
            num_loss += len(y)
            num_perf += torch.sum(mask).item()
    return total_loss / num_loss, correct.item() / num_perf

perf(rnn_model, valid_loader)

(0.021974591732025147, 0.260375)

#### Exercise 12 - Define the training function

In [22]:
## Same code from previous lab

def fit(model, epochs):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters())

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0
        num_samples = 0
        
        for x_data, y_data in train_loader:
            optimizer.zero_grad()
            y_scores = model(x_data)
            loss = criterion(y_scores.transpose(1, 2), y_data) # Modifications faites ici
            num_samples += len(y_data)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

In [23]:
fit(rnn_model, 25)

## Apply your model to a sentence

Accuracy scores might be deceiving. We also need to look at the predictions on some example sentences. 

#### Exercise 13: Apply your model to a sentence

We define a `tag_sentence` function which:

* takes a input the learned model and a sentence identifier i
* retrieves from the data tensor X_valid the tensor for the i-th sentence (call it "sentence")
* retrieves from the data tensor Y_valid the tensor of labels for the i-th sentence 
* put the model into evaluation mode
* execute the model on the sentence tensor ("sentence")
* extract the top predictions (use argmax)
* print out the list of predicted tags 
   - use t.item() to get a value out of a tensor
   - use your dictionary int2label (cf. Exercise 4 above) to print out the results

In [None]:
## Takes a input the learned model and a sentence identifier i
def tag_sentence(model, i):
  
    int2label[3] = 'N/A'
    ## retrieves from the data tensor X_valid the tensor for the i-th sentence 
    ## (call it "sentence")
    sentence = X_valid[i]

    ## retrieves from the data tensor Y_valid the tensor of labels 
    ## for the i-th sentence
    labels = Y_valid[i]

    ## put the model into evaluation mode
    model.eval()

    
    y_scores = model(sentence)
    y_pred = y_scores.argmax(1) ## We choose the highest number of classes 
    print('TOKEN'.ljust(10), 'PRED'.ljust(5), 'TRUE')
    print('-'*20)
    for j, pred in enumerate(y_pred):
        print(
              int2token[sentence[j].item()].ljust(10),
              int2label[pred.item()].ljust(5),
              int2label[labels[j].item()]
        )

In [25]:
tag_sentence(rnn_model, 15)

TOKEN      PRED  TRUE
--------------------
some       B     B
people     I     I
stare      O     O
into       O     O
the        B     B
distance   I     I
as         O     O
a          B     B
barber     I     I
gives      O     O
an         B     B
asian      I     I
man        I     I
a          I     B
haircut    I     I
.          O     O
