# Exercise Lecture "15: Neural Sequence Tagging"

In this assignment, we learn a model which can detect noun phrases referring to visual entities given the **Flicker30k** entities corpus as training data.

In this corpus, each word is labelled with either **(B)** if that word starts an NP, **(I)** if it occurs within an NP and **(O)** otherwise. There is one word and one label per line. Sentences are separated by blank lines. 

> Data file:  f30kE-captions-bio.txt 

You will also extend the RNN model defined in the previous Exercise by downloading pretrained English **FastText** embeddings, and selecting the set of pretrained vectors associated with the vocabulary of the _f30kE-captions-bio.txt_ file, to use as the embedding layer weights for the RNN model.

> Pretrained FastText: wiki-news-300d-1M.vec.zip downloadable [here](https://fasttext.cc/docs/en/english-vectors.html) NOTE: this zip file is 2.26GB after unzipping.


## Pre-processing

#### Exercise 1 - Creating a list of lists of tokens (one list per sentence) and the corresponding lists of labels

* From the input file, create two lists called "texts" and "labels"
    * "text" is a list of lists, each list containing the tokens of a sentence
    * "labels" contains the list of lists of labels for each sentence in "text"

Note: each line in the file is a token and its label. Sentences are separated by an empty line. Your processing should take this into account to recover the full sentence. 

In [1]:
import pandas as pd

# Load the data
with open("f30kE-captions-bio.txt", "r") as f:
    text, labels = [], []

    # Split the data into sentences
    sentences = f.read().split("\n\n")

    for sentence in sentences:
        text_l, labels_l = [], []
        
        # Split the sentence into words
        for word in sentence.split("\n"):
            split_ = word.split(" ")
            if len(split_) > 1:
                token, label = split_[0], split_[1]
                text_l.append(token)
                labels_l.append(label)
            
        text_l.append("\n")
        labels_l.append("EOS")
        
        text.append(text_l)
        labels.append(labels_l)

In [2]:
len(text), len(labels)

(5501, 5501)

In [3]:
text[0], labels[0]

(['A',
  'group',
  'of',
  '5',
  'scuba',
  'divers',
  'talk',
  'on',
  'the',
  'surface',
  'next',
  'to',
  'a',
  'barrier',
  'island',
  '.',
  '\n'],
 ['B',
  'I',
  'I',
  'I',
  'I',
  'I',
  'O',
  'O',
  'B',
  'I',
  'O',
  'O',
  'B',
  'I',
  'I',
  'O',
  'EOS'])

#### Exercise 2 -  Mapping labels to integers and sequence of labels to sequence of integers

* Create a dictionary label2int which maps each label to a distinct integer
* Apply this dictionary to the list of labels extracted in the previous exercise 
* Make sure to include an "\<eos>" token in your vocabulary

**Hint:** We did this in the preceding lab session.

In [4]:
label2int = {"O": 0, "B": 1, "I": 2, "EOS": 3}
int_labels = [[label2int[label] for label in labels_l] for labels_l in labels]

In [5]:
labels[0], int_labels[0]

(['B',
  'I',
  'I',
  'I',
  'I',
  'I',
  'O',
  'O',
  'B',
  'I',
  'O',
  'O',
  'B',
  'I',
  'I',
  'O',
  'EOS'],
 [1, 2, 2, 2, 2, 2, 0, 0, 1, 2, 0, 0, 1, 2, 2, 0, 3])

#### Exercise 3 - Convert the tokens to integers

* Similarly define a token2int dictionary mapping each token in your corpus to an integer and use this dictionary to convert the texts in the list "texts" (cf. Exercise 1 above) into lists of integers, each integer representing a token

**IMPORTANT** make sure to lowercase the tokens as the pre-trained embeddings we'll be using only include lowercased tokens. 

In [6]:
voc_set = set([word.lower() for text_l in text for word in text_l])
voc_list = list(voc_set)

print("Vocabulary size:", len(voc_set))
print("some words:", voc_list[:100])

Vocabulary size: 4596
some words: ['house', 'roaming', 'snowshoes', 'helps', 'dunks', 'specific', 'sparing', 'shrubs', 'outdated', 'crowded', 'day', 'chooses', 'him', 'competitive', 'violins', 'curve', 'boarding', 'matching', 'swims', 'tournament', 'hiker', 'raincoats', 'what', 'loaf', 'got', 'fashioned', 'crocodile', 'tandem', 'card', 'biden', '&', 'drying', 'zara', 'twirling', 'wheelchair', 'courtyard', 'homeless', 'animal', 'well-lit', 'oceanside', 'instructor', 'dancer', 'attendant', 'scooter', 'sandy', 'flame', 'law', 'almost', 'surrounding', 'aa', 'some', 'strips', 'produce', 'puppet', 'geometrical', 'bean', 'walking', 'chair', 'ties', 'macaroni', 'rail', 'playground', 'rear', 'urban', 'actions', 'white-haired', 'snowbank', 'stance', 'paddling', 'cyclist', 'rodeo', 'attending', 'blackboard', 'alone', 'interesting', 'exchange', 'similarly', 'scrubbing', 'floral', 'lead', 'racetrack', 'commuters', 'dread', 'welds', 'shoulders', 'challenged', 'pitch', 'waffle', 'pizza', 'punch', 'br

In [7]:
tokens2int = {token:idx for idx,token in enumerate(voc_list)}
int_text = [[tokens2int[token.lower()] for token in text_l] for text_l in text]

In [8]:
tokens2int["\n"]

4031

In [9]:
text[0], int_text[0]

(['A',
  'group',
  'of',
  '5',
  'scuba',
  'divers',
  'talk',
  'on',
  'the',
  'surface',
  'next',
  'to',
  'a',
  'barrier',
  'island',
  '.',
  '\n'],
 [1371,
  3511,
  3686,
  2980,
  4486,
  1378,
  222,
  3911,
  2286,
  1341,
  2715,
  4255,
  1371,
  3260,
  2247,
  2213,
  4031])

#### Exercise 4 - Create the reverse dictionaries for labels and tokens

- create 2 dictionaries (int2label, int2token) to map integer labels and integer tokens back to labels and tokens 

This will be useful to be able to inspect results later on.

In [10]:
# create reverse mappings
int2tokens = {idx:token for token,idx in tokens2int.items()}
int2label = {idx:label for label,idx in label2int.items()}

## Creating training and validation data

##### Pytorch import and key constants (PROVIDED)

- `max_len` is the maximum sentence length (set to this to the size of the longest sequence in the _f30kE-captions-bio.txt_ corpus)
- `batch_size` is the batch size
- `embed_size` is the size of the pre-trained embeddings (word vectors). We use fasttext pre-trained embeddings of size 300.
- `hidden_size` is the size of the RNN hidden state

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

max_len = max([len(l) for l in labels])
print('Max Length set at:', max_len)
batch_size = 64
embed_size = 300
hidden_size = 128

Max Length set at: 17


#### Exercise 5  - Creating tensors

To work with pytorch, all data must be converted to tensors. Do the following

* Create **X** and **Y**, which are tensors of dtype "long" that is initialised with 0s. It should be of the following size: (number of sentences, max sentence length) (cf. CS_pytorch)
* We populate the zeros tensors with the input data from exercise 1.1: i.e.
> For each element of "texts" and labels", compute the length and if they are < max_length, make sure to pad them with \<eos> to max_length 

**Hint:** You did this in the previous lab session

In [12]:
# Create tensors
X = torch.zeros((len(int_text), max_len), dtype=torch.long)
Y = torch.zeros((len(int_text), max_len), dtype=torch.long)

# Populate tensors with data
for i, (int_text_l, int_labels_l) in enumerate(zip(int_text, int_labels)):
    X[i, :len(int_text_l)] = torch.LongTensor(int_text_l)
    Y[i, :len(int_labels_l)] = torch.LongTensor(int_labels_l)

In [13]:
X

tensor([[1371, 3511, 3686,  ..., 2247, 2213, 4031],
        [2731,  209, 1777,  ...,    0,    0,    0],
        [1371, 3715, 3456,  ...,    0,    0,    0],
        ...,
        [1371, 2791, 3217,  ..., 2213, 4031,    0],
        [1371, 2700, 3308,  ..., 2058, 2213, 4031],
        [4031,    0,    0,  ...,    0,    0,    0]])

In [14]:
Y

tensor([[1, 2, 2,  ..., 2, 0, 3],
        [1, 0, 1,  ..., 0, 0, 0],
        [1, 2, 0,  ..., 0, 0, 0],
        ...,
        [1, 2, 2,  ..., 0, 3, 0],
        [1, 2, 0,  ..., 0, 0, 3],
        [3, 0, 0,  ..., 0, 0, 0]])

#### Exercise  6 - Create train and validation data

* Split X into two parts, one called X_train which consists of the first 5000 items and the other called X_valid which includes the rest of the data
* Do the same for Y

In [15]:
X_train = X[:5000]
Y_train = Y[:5000]

X_valid = X[5000:]
Y_valid = Y[5000:]

print(X_train.shape, X_valid.shape)

torch.Size([5000, 17]) torch.Size([501, 17])


#### Exercise  7 - Use torch DataLoader to split training and validation data into batches

**Hint:** This was provided in the previous lab sessions

In [16]:
from torch.utils.data import TensorDataset, DataLoader

# the TensorDataset is a ready to use class to represent your data as list of tensors. 
# Note that input_features and labels must match on the length of the first dimension
train_set = TensorDataset(X_train, Y_train)
valid_set = TensorDataset(X_valid, Y_valid)

# DataLoader shuffles and batches the data and load its in parallel using multiprocessing workers

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size)

## Using pre-trained Fasttext embeddings

Instead of learning word embeddings using the network, we will use pre-trained Fasttext embeddings. These are available [here](https://fasttext.cc/docs/en/english-vectors.html).

However the pre-trained Fasttext embeddings cover several millions words and the full file is large. Instead we'll learn to extract a smaller subset that is restricted to the corpus vocabulary. Each line in that file contains a token followed by the Fasttext embedding of that token(300 dimensions). 
> e.g., auditorium -0.054196 -0.37375 ....

#### Exercise 8 (PROVIDED)

* read the .vec file holding the FastText embeddings. Handle each line (a word and associated vector)
* extract the set of vectors corresponding to the vocabulary of the _f30kE-captions-bio.txt_ corpus

In [17]:
# from the FastText site, with amendments to check for vocab_set (set for faster check)
def load_vectors(fname, vocab_set):
    fin = open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        if tokens[0] not in vocab_set: continue
        # map each element (str) to float. 
        # set to list type (else it remain a map generator, and can be spent once called)
        # avoid situation where students rerun cell and throws an error (after generator spent)
        data[tokens[0]] = list(map(float, tokens[1:])) 
    return data

# using https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
ft_vec_dict = load_vectors('wiki.en.filtered.vec', set(tokens2int))
print(f'{len(ft_vec_dict)}, {len(tokens2int)}')

4595, 4596


#### Exercise 9: Associating pretrained vectors to our vocabulary
To use the pre-trained embeddings, we do the following:

* Firstly, we create a zeros  tensor of size (vocab_size, embedding_size) with values 0 (use torch.zeros method). The embedding_size should match the dimensions of the FastText embeddings. 

* Secondly, for each word in the _f30kE-captions-bio.txt_ corpus vocabulary, retrieve the FastText embedding from *ft_vec_dict* above.

    * If the word is in our vocabulary we set the corresponding index (use your token2int dictionary) in our zeros tensor to the corresponding fasttext embedding. 
    * **NOTE**: If the word in our vocabulary is **not** available in the set of vectors, we set it to a vector of random values (use torch.rand_like()). 


In [18]:
Tokens2Vec = torch.zeros((len(voc_list), embed_size), dtype=torch.float)

for i, token in enumerate(voc_list):

    # if the token has a fasttext vector, use it
    if token in ft_vec_dict:
        Tokens2Vec[i, :] = torch.FloatTensor(ft_vec_dict[token])

    # else, create a random vector
    else:
        Tokens2Vec[i, :] = torch.randn(embed_size)

In [19]:
Tokens2Vec

tensor([[ 0.3295, -0.3003,  0.1981,  ...,  0.1219,  0.0210, -0.0637],
        [-0.4088,  0.1644, -0.0666,  ...,  0.5124,  0.1331,  0.0500],
        [ 0.3054, -0.2421, -0.4033,  ...,  0.5293,  0.3725,  0.6178],
        ...,
        [ 0.0280,  0.3455, -0.4234,  ...,  0.1321,  0.7562,  0.2801],
        [-0.0907, -0.4747,  0.0026,  ...,  0.4564,  0.1347, -0.0868],
        [-0.1047,  0.1041,  0.0323,  ...,  0.3182, -0.3078,  0.4265]])

## Create, train and evaluate your neural network

#### Exercise  10 - Define your neural network (TODO: Provide missing values indicated by ??)

As in the preceding Exercise sheet on neural classification, we define our RNN network as a subclass of pytorch RNN module. 

Our RNN consist of three layers:
* the embedding layer: wich maps each token in the input to its fasttext embedding
* A GRU layer: the recurrent layers
* A decision layer which maps each input token to a label

##### Padding
If the input sentence is shorter than the maximum length, the remaining positions are filled with 0, the integer associated with the padding symbol. To exclude padding symbols  from the learning process (they are uninformative), include the padding_idx=vocab['<eos>'] option in the definition of the embedding layer and the 
"bias=False" option in the definition of the GRU layer. This forces the GRU hidden state to be null for all padding symbols. 
    
##### Pre-trained Embeddings
To ensure that the pretrained word embeddings are used:
* set the `weight` attribute of the embedding layer to the pretrained embeddings
* Use `requires_grad=False` to freeze the embedding layer so that the fasttext embeddings are not modified during learning.   
If you do not use this option the embeddings are fine tuned during training. 

In [20]:
Tokens2Vec.size()

torch.Size([4596, 300])

In [21]:
num_layers = 1

In [22]:
class RNN(nn.Module):
    def __init__(self, pretrained_weights, hidden_size, num_layers, bidirectional = False, 
                 padding_idx = tokens2int['\n']):
        super().__init__()
        self.embed = nn.Embedding(Tokens2Vec.size(0), pretrained_weights.size(1), 
                                  padding_idx= tokens2int['\n'])
        self.embed.weight = nn.Parameter(pretrained_weights, requires_grad= False)
        embed_size = self.embed.weight.size(-1)
        
        self.rnn = nn.GRU(embed_size, hidden_size, bias=False, num_layers=1, 
                            bidirectional=bidirectional, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        num_directions = 2 if bidirectional else 1
        self.decision = nn.Linear(hidden_size * num_layers * num_directions, len(label2int))
        
    def forward(self, x):
        embed = self.embed(x)
        output, hidden = self.rnn(embed)
        return self.decision(self.dropout(output))

rnn_model = RNN(Tokens2Vec, hidden_size, num_layers = 1, padding_idx = tokens2int['\n'])
rnn_model

RNN(
  (embed): Embedding(4596, 300, padding_idx=4031)
  (rnn): GRU(300, 128, bias=False, batch_first=True)
  (dropout): Dropout(p=0.3, inplace=False)
  (decision): Linear(in_features=128, out_features=4, bias=True)
)

#### Exercise 11: Define a function to evaluate the performance of the model (PROVIDED)

* the CrossEntropyLoss takes as input 2D matrices of shape (batch_size * sequence_length, num_labels)
* Scores shape is adjusted accordingly
* References are reshaped to (batch_size * sequence_length).
* the max used to compute predictions applies to the last dimension of the y_scores tensors
* To ignore padding symbols when computing the score, we create a matrix "mask" which contains 1 for all non nul elements of the Y matrix and O otherwise

In [23]:
def perf(model, loader, pad_idx):
    criterion = nn.CrossEntropyLoss()
    model.eval()
    total_loss, correct, num_loss, num_perf = 0, 0, 0, 0
    for x, y in loader:
        
        with torch.no_grad():
            y_scores = model(x)
            # should be of shape (bsz, max_len, vocab_size)
            
            # reshape y to a long sequence instead of batches of sequences
            target = y.view(y.size(0) * y.size(1)) # or y.view(1,-1).squeeze()
            scores = y_scores.view(y.size(0) * y.size(1), -1)
            
            loss = criterion(scores, target)
            
            y_pred = torch.max(y_scores, -1)[1] # argmax on last dim (vocab)

            mask = (y != pad_idx) # ignore pads in original target
            
            correct += torch.sum((y_pred.data == y) * mask)
            
            total_loss += loss.item()
            num_loss += len(y)
            num_perf += torch.sum(mask).item()
    return total_loss / num_loss, correct.item() / num_perf

pad_idx = tokens2int['\n']
perf(rnn_model, valid_loader, pad_idx)

(0.022857621996226662, 0.12809674768110837)

#### Exercise 12 - Define the training function

In [26]:
def fit(model, epochs):

    # define the loss
    criterion = nn.CrossEntropyLoss()
    # define the optimiser
    optimiser = optim.Adam(model.parameters(), lr=0.001)
    # iterate over epochs    
    for epoch in range(epochs):
        
        # set the model in training mode
        model.train()
        # initialize the total_loss to 0
        total_loss = 0

        # iterate over batches
        for x, y in train_loader:
                
            # reset the gradients
            optimiser.zero_grad()
            # predict the batch scores
            y_scores = model(x)

            # calculate the loss
            loss = criterion(y_scores.transpose(1,2), y)

            # Compute the gradients (backpropagation)
            loss.backward()

            # Update the weights (optimization)
            optimiser.step()

            # Update the batch loss
            total_loss += loss.item()

        print("Epoch: ", epoch, ", Total loss: ", total_loss)
        print(" - Train: ", perf(model, train_loader, pad_idx))
        print(" - Validation: ", perf(model, valid_loader, pad_idx))

In [27]:
# train the model for 10 epochs
fit(rnn_model, 10)

Epoch:  0 , Total loss:  7.406806465238333
 - Train:  (0.0009451055426150561, 0.9806705882352941)
 - Validation:  (0.001241931539333747, 0.9748737818480686)
Epoch:  1 , Total loss:  4.3728285119868815
 - Train:  (0.0007108289811760187, 0.9857764705882353)
 - Validation:  (0.0010171270968314416, 0.9789832100504873)
Epoch:  2 , Total loss:  3.602778360247612
 - Train:  (0.0006021221715956926, 0.9882)
 - Validation:  (0.0009320896572457578, 0.9820359281437125)
Epoch:  3 , Total loss:  3.225078333169222
 - Train:  (0.0005310760729014874, 0.9893176470588235)
 - Validation:  (0.000893633507801863, 0.9825055770811318)
Epoch:  4 , Total loss:  2.86615295894444
 - Train:  (0.0004869469454512, 0.9907294117647059)
 - Validation:  (0.0008609373412446348, 0.9830926382529059)
Epoch:  5 , Total loss:  2.603946018964052
 - Train:  (0.00043190165236592294, 0.9919764705882353)
 - Validation:  (0.0008579653923858902, 0.9834448749559704)
Epoch:  6 , Total loss:  2.344331029802561
 - Train:  (0.00040228365

## Apply your model to a sentence

Accuracy scores might be deceiving. We also need to look at the predictions on some example sentences. 

#### Exercise 13: Apply your model to a sentence

We define a `tag_sentence` function which:

* takes a input the learned model and a sentence identifier i
* retrieves from the data tensor X_valid the tensor for the i-th sentence (call it "sentence")
* retrieves from the data tensor Y_valid the tensor of labels for the i-th sentence 
* put the model into evaluation mode
* execute the model on the sentence tensor ("sentence")
* extract the top predictions (use argmax)
* print out the list of predicted tags 
   - use t.item() to get a value out of a tensor
   - use your dictionary int2label (cf. Exercise 4 above) to print out the results

In [None]:
### RUN YOUR SOLUTION HERE ###