# NLP Assignment: Extraction of Named Entities
Author: Pierre Nugues

In this assignment, you will create a system to extract named entities from a text. You will use the CoNLL 2003 dataset and you will train your models with PyTorch.

Be aware that with PyTorch, the data matrices, by default, have an unconventional ordering with recurrent networks. To have a batch ordering similar to what we saw during the course, you must use the `batch_first=True` argument. See here https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html and https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

Before you start the assignment, please run the prerequistites notebook. The 100d vectors should give better results than the 50d, but they take a longer time to train. Start with the 50d vectors. Then, optionally, run the experiments with 100d vectors, if your machine is fast enough.

## Objectives

The objectives of this assignment are to:
* Write a program to recognize named entities in text
* Learn how to manage a text data set
* Apply recurrent neural networks to text with PyTorch
* Know what word embeddings are
* Write a short report of 2 to 3 pages on your experiments. This report is mandatory to pass the assignment.

## Organization and location

You can work alone or collaborate with another student:
* Each group will have to write Python programs to recognize named entities in text.
* You will have to experiment different architectures, namely RNN and LSTM, and compare the results you obtained.
* Each student will have to write an individual report on these experiments.

## Preliminaries

### Imports
For the vector and matrix operations, use pytorch only. __Do not use numpy__.

In [58]:
import matplotlib.pyplot as plt
from tqdm import tqdm
import random
import os


import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import TensorDataset, DataLoader

import conlleval

### Seeds
Making things reproduceable

In [59]:
random.seed(1234)
torch.manual_seed(1234)

<torch._C.Generator at 0x13bbe4030>

### Constants

In [60]:
EPOCHS = 20
MOD = 2
LSTM_HIDDEN_DIM = 98 if MOD == 2 else 64
LSTM_LAYERS = 3 if MOD == 2 else 2
DROPOUT = 0.4 if MOD == 2 else 0.3
EMB_LARGE = True # GloVe 50 or 100
FREEZE_EMBS = True
LARGE_MEM = False
LSTM = True # Toggle between LSTM and RNN
RETRAIN = 0

In [61]:
config = {'EPOCHS': EPOCHS,
'LSTM_HIDDEN_DIM': LSTM_HIDDEN_DIM,
'LSTM_LAYERS': LSTM_LAYERS,
'DROPOUT': DROPOUT,
'EMB_LARGE': EMB_LARGE,
'FREEZE_EMBS': FREEZE_EMBS,
'LSTM': LSTM}

### The datasets

You may need to adjust the paths to load the datasets from your machine.

In [62]:
train_file = "./conll2003/train.txt"
val_file = "./conll2003/valid.txt"
test_file = "./conll2003/test.txt"

## Reading the files

You will now convert the dataset in a Python data structure. Read the functions below to load the datasets. They store the corpus in a list of sentences. Each sentence is a list of rows, where each row is a dictionary.

In [63]:
def read_sentences(file):
    """
    Creates a list of sentences from the corpus
    Each sentence is a string
    :param file:
    :return:
    """
    f = open(file).read().strip()
    sentences = f.split('\n\n')
    return sentences

In [64]:
def split_rows(sentences, column_names):
    """
    Creates a list of sentences where each sentence is a list of lines
    Each line is a dictionary of columns
    :param sentences:
    :param column_names:
    :return:
    """
    new_sentences = []
    for sentence in sentences:
        rows = sentence.split('\n')
        sentence = [dict(zip(column_names, row.split())) for row in rows]
        new_sentences.append(sentence)
    return new_sentences

### Loading dictionaries

The CoNLL 2002 files have four columns: The wordform, `form`, its predicted part of speech, `ppos`, the predicted tag denoting the syntactic group also called the chunk tag, `pchunk`, and finally the named entity tag `ner`.

In [65]:
column_names = ['form', 'ppos', 'pchunk', 'ner']

We load the corpus as a list of dictionaries

In [66]:
train_sentences = read_sentences(train_file)
train_dict = split_rows(train_sentences, column_names)

val_sentences = read_sentences(val_file)
val_dict = split_rows(val_sentences, column_names)

train_dict[1]

[{'form': 'EU', 'ppos': 'NNP', 'pchunk': 'B-NP', 'ner': 'B-ORG'},
 {'form': 'rejects', 'ppos': 'VBZ', 'pchunk': 'B-VP', 'ner': 'O'},
 {'form': 'German', 'ppos': 'JJ', 'pchunk': 'B-NP', 'ner': 'B-MISC'},
 {'form': 'call', 'ppos': 'NN', 'pchunk': 'I-NP', 'ner': 'O'},
 {'form': 'to', 'ppos': 'TO', 'pchunk': 'B-VP', 'ner': 'O'},
 {'form': 'boycott', 'ppos': 'VB', 'pchunk': 'I-VP', 'ner': 'O'},
 {'form': 'British', 'ppos': 'JJ', 'pchunk': 'B-NP', 'ner': 'B-MISC'},
 {'form': 'lamb', 'ppos': 'NN', 'pchunk': 'I-NP', 'ner': 'O'},
 {'form': '.', 'ppos': '.', 'pchunk': 'O', 'ner': 'O'}]

## Embeddings

### Reading the embeddings

Adjust your folders

In [67]:
if EMB_LARGE:
    embedding_file = './glove/glove.6B.100d.txt'
    EMBEDDING_DIM = 100
else:
    embedding_file = './glove/glove.6B.50d.txt'
    EMBEDDING_DIM = 50

Apply the function below that reads GloVe embeddings and store them in a dictionary, where the keys will be the words and the values, the embedding vectors.

In [68]:
def read_embeddings(file):
    """
    Return the embeddings in the from of a dictionary
    :param file:
    :return:
    """
    embeddings = {} # Create embedding dictionary
    glove = open(file, encoding='utf8') # open embedding file
    for line in glove: # for word and embedding
        values = line.strip().split() #removes leading or trailing whitespaces and then splits by whitespace, giving a list of each value in a embedding
        word = values[0] # word = first position
        vector = torch.FloatTensor(list(map(float, values[1:]))) # vector embedding is remaining values
        embeddings[word] = vector #put word and vector in dictionary
    glove.close()
    return embeddings

In [69]:
# We read the embeddings
embeddings_dict = read_embeddings(embedding_file)
embedded_words = sorted(list(embeddings_dict.keys()))

In [70]:
'# words in embedding dictionary: {}'.format(len(embedded_words))

'# words in embedding dictionary: 400000'

### Understanding the embeddings

In [71]:
embedded_words[100000:100010]

['chording',
 'chordoma',
 'chordophones',
 'chords',
 'chore',
 'chorea',
 'chorene',
 'choreograph',
 'choreographed',
 'choreographer']

In [72]:
embeddings_dict['chords'][:20]

tensor([-0.5197,  1.0395,  0.2092,  0.1629,  0.7209,  0.8152, -0.3464, -0.7665,
        -0.4958,  0.2463,  0.4409,  0.3770, -0.1640,  0.2775,  0.1656,  0.4387,
        -1.0887,  0.1266,  0.6692,  0.3578])

#### Embedding Matrix
For the vectors in `embeddings_dict`, create a unique `E` matrix of the embeddings. To keep track of the word index, create also an `emb_word_idx` dictionary that will associate the row index its corresponding word.

To build `E`, you may first store the vectors in a list and then use `torch.stack()` to convert it in a tensor.

In [73]:
# Write your code
E = []
emb_word_idx = {}
index = 0
for index, (key, values) in enumerate(embeddings_dict.items()):
    E.append(values)
    emb_word_idx[index] = key
    index += 1
E = torch.stack(E)


In [74]:
emb_word_idx[21359]

'chords'

In [75]:
E[21359][:20]

tensor([-0.5197,  1.0395,  0.2092,  0.1629,  0.7209,  0.8152, -0.3464, -0.7665,
        -0.4958,  0.2463,  0.4409,  0.3770, -0.1640,  0.2775,  0.1656,  0.4387,
        -1.0887,  0.1266,  0.6692,  0.3578])

Normalize the rows so that each row has a norm of 1

In [76]:
# Write your code here
E = F.normalize(E, dim=1) # Normalises over rows in E.

Using a cosine similarity, write a `closest(target_word_embeddings, embeddings, count=10)` that computes the 10 closest rows of a given vector `target_word_embeddings`.

Remember that:
$$
\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{||\mathbf{u}|| ||\mathbf{v}||}
$$

In [77]:
# Write your code here
def closest(target_word_emb, E, count=10):
    # target_word_emb = F.normalize(target_word_emb.unsqueeze(0), p=2, dim=1).squeeze(0)
    similarity = []
    for embedding in E:
        if not torch.all(target_word_emb.eq(embedding)):
            cos = torch.dot(target_word_emb, embedding)/(torch.norm(target_word_emb)*torch.norm(embedding))
            similarity.append(cos.item())

    enumerateted_similarity = list(enumerate(similarity)) # add index to cosine similarity
    sorted_list = sorted(enumerateted_similarity, key=lambda x: x[1], reverse=True) # sort by similarity in DESC order
    # print(enumerateted_similarity)
    # print(sorted_list)

    closest_indices = [x[0] for x in sorted_list[:count]] # extract indices for first ten similarities.
    # print(closest_indices)
    return closest_indices

    # print(similarity)

Using the `closest()` function find the words closest to _table_, _france_, and _sweden_.

In [78]:
embeddings_dict['table'][:20]

tensor([-0.6145,  0.8969,  0.5677,  0.3910, -0.2244,  0.4904,  0.1087,  0.2741,
        -0.2383, -0.5215,  0.7355, -0.3265,  0.5130,  0.3241, -0.4671,  0.6805,
        -0.2550, -0.0405, -0.5442, -1.0548])

In [79]:
closest(embeddings_dict['table'], E, count=10)

[1801, 7221, 241, 2389, 927, 437, 3162, 220, 187, 3216]

In [80]:
list(map(emb_word_idx.get, closest(embeddings_dict['table'], E, count=10)))

['table',
 'tables',
 'place',
 'bottom',
 'room',
 'side',
 'sit',
 'top',
 'here',
 'pool']

## Extracting the ${X}$ and ${Y}$ Lists of Symbols from the Datasets

For each sentence, you will build an input sequence, $\mathbf{x}$, corresponding to the words and an output one, $\mathbf{y}$, corresponding to the NER tags.

Write a `build_sequences(corpus_dict, key_x='form', key_y='ner', tolower=True)` function that, for each sentence, returns the $\mathbf{x}$ and $\mathbf{y}$ lists of symbols consisting of words and chunk tags. Set the words in lower case if `tolower` is true.

For the 2nd sentence of the training set, you should have:<br/>
`x = ['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.']`

`y = ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']`

In [81]:
# Write your code
def build_sequences(corpus_dict, key_x='form', key_y='pos', tolower=True):
   # column_names = [key_x, key_y]
   x = []
   y = []
   for i in range(len(corpus_dict)):
      seq = corpus_dict[i]
      if tolower:
         x.append([d[key_x].lower() for d in seq])
         y.append([d[key_y] for d in seq])
      else:
         x.append([d[key_x] for d in seq])
         y.append([d[key_y] for d in seq])
      # print("\n", x)
      # print("\n", y)
   return x, y

In [82]:
X_train_symbs, Y_train_symbs = build_sequences(train_dict, key_x='form', key_y='ner')
X_val_symbs, Y_val_symbs = build_sequences(val_dict, key_x='form', key_y='ner')

In [83]:
X_train_symbs[1]

['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.']

In [84]:
Y_train_symbs[1]

['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

## Vocabulary

Create a vocabulary of all the words observed in the training set as well as in GloVe. You should find 402,595 different words. You will proceed in two steps.

First extract the list of unique words `words` from the CoNLL training set and the list of NER tags, `ner`. You will sort them

In [85]:
# Write your code: List of words and tags in CoNLL

words = []

for sentence in X_train_symbs:
    for word in sentence:
        words.append(word)

words = sorted(list(set(words)), reverse=False) #to remove eventual duplicates

tags = []
for sentence in Y_train_symbs:
    for tag in sentence:
        if tag not in tags:
            tags.append(tag)
tags = sorted(list(set(tags)), reverse=False) #to remove eventual duplicates

In [86]:
print('# words seen in training corpus:', len(words))
print('# NER tags seen:', len(tags))

# words seen in training corpus: 21010
# NER tags seen: 9


In [87]:
words[4000:4010]

['adequate',
 'adige',
 'adj',
 'adjourned',
 'adjust',
 'adjusted',
 'adjusting',
 'adjustments',
 'adkins',
 'administer']

In [88]:
tags[:10]

['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O']

Then, merge the list of unique CoNLL words with the words in the embeddings file. You will sort this list

In [89]:
# Write your code: Add vocabulary of embedded words
vocabulary_words = embedded_words

for sentence in X_train_symbs:
    for word in sentence:
        vocabulary_words.append(word)

vocabulary_words = sorted(list(set(vocabulary_words)), reverse=False) #to remove eventual duplicates

In [90]:
print('# words in the vocabulary: embeddings and corpus:', len(vocabulary_words))

# words in the vocabulary: embeddings and corpus: 402595


In [91]:
vocabulary_words[200000:200010]

['jmurray',
 'jmw',
 'jmy',
 'jn',
 'jn-4',
 'jna',
 'jnana',
 'jnanpith',
 'jnc',
 'jne']

## Index

Create the indices `word2idx`, `tag2idx` and inverted indices `idx2word`, `idx2tag` for the words and the tags: i.e. you will associate each word with a number. You will use index 0 for the padding symbol and 1 for unknown words. This means that your first word will start at index 2. For the tags, you will start at index 1.

In [92]:
# Write your code:
padding_symbol = "0"
word2idx = {word: index for index, word in enumerate(vocabulary_words, start=2)}
word2idx[padding_symbol] = 0

tag2idx = {tag: index for index, tag in enumerate(tags, start=1)}
idx2word = {index: word for word, index in word2idx.items()}
# word2idx["unknown"] = 1
idx2tag = {index: tag for tag, index in tag2idx.items()}

The word indices

In [93]:
print(list(word2idx.items())[:25])

[('!', 2), ('!!', 3), ('!!!', 4), ('!!!!', 5), ('!!!!!', 6), ('!?', 7), ('!?!', 8), ('"', 9), ('#', 10), ('##', 11), ('###', 12), ('#a', 13), ('#aabccc', 14), ('#b', 15), ('#c', 16), ('#cc', 17), ('#ccc', 18), ('#cccccc', 19), ('#ccccff', 20), ('#d', 21), ('#daa', 22), ('#dcdcdc', 23), ('#e', 24), ('#f', 25), ('#faf', 26)]


The tag indices

In [94]:
print(tag2idx)

{'B-LOC': 1, 'B-MISC': 2, 'B-ORG': 3, 'B-PER': 4, 'I-LOC': 5, 'I-MISC': 6, 'I-ORG': 7, 'I-PER': 8, 'O': 9}


## Embedding Matrix

Create a numpy matrix of dimensions $(M, N)$, where $M$ will be the size of the vocabulary: The unique words in the training set and the words in GloVe, and $N$, the dimension of the embeddings.
The padding symbol and the unknown word symbol will be part of the vocabulary at respectively index 0 and 1.

Initialize the matrix with random values with the `torch.rand()`

In [95]:
# We add two dimensions for the padding symbol at index 0 and unknown words at index 1
embedding_matrix = torch.rand((len(vocabulary_words) + 2, EMBEDDING_DIM))/10 - 0.05 # range: -0.05, 0.05,
# embedding_matrix = torch.rand((len(vocabulary_words) + 2, EMBEDDING_DIM))
# embedding_matrix = torch.zeros((len(vocabulary_words) + 2, EMBEDDING_DIM))

The shape of your matrix is: (402597, 100) or (402597, 50)

In [96]:
embedding_matrix.shape

torch.Size([402597, 100])

Fill the matrix with the GloVe embeddings when available. This means: Replace the random vector with an embedding when available. You will use the indices from the previous section. You will call `out_of_embeddings` the list of words in CoNLL, but not in the embedding list.

In [97]:
# Write your code
out_of_embeddings = []

for word in vocabulary_words:
    index = word2idx[word] # extract index of word in embedding index
    try:
        embedding_matrix[index] = embeddings_dict[word] # replace values in
    except KeyError:
        out_of_embeddings.append(word) # if word in in embeddings_dict add to out_of_embeddings
# out_of_embeddings = list(set(out_of_embeddings))
out_of_embeddings = sorted(list(set(out_of_embeddings)), reverse=False)

In [98]:
len(out_of_embeddings)

2595

In [99]:
out_of_embeddings[-10:]

['zelezarny',
 'zhilan',
 'zieger',
 'zighayer',
 'zilinskiene',
 'zirka-nibas',
 'zuleeg',
 'zundra',
 'zwingmann',
 'zyrecha']

Embeddings of the padding symbol, idx 0, random numbers

In [100]:
embedding_matrix[0][:10]

tensor([-0.6149,  0.9273,  0.5583,  0.0057, -0.6717,  0.6119,  0.9923,  0.2764,
        -0.6489, -0.5167])

Embeddings of the word _table_, the GloVe values

In [101]:
embedding_matrix[word2idx['table']][:10]

tensor([-0.6145,  0.8969,  0.5677,  0.3910, -0.2244,  0.4904,  0.1087,  0.2741,
        -0.2383, -0.5215])

Embeddings of _zarett_, a word in CoNLL 2003, but not in GloVe, random numbers

In [102]:
embedding_matrix[word2idx['zwingmann']][:10]

tensor([-0.0150,  0.0476,  0.0197, -0.0334, -0.0267,  0.0237,  0.0041, -0.0454,
         0.0163,  0.0111])

## Creating the ${X}$ and ${Y}$ Sequences

You will now create the input and output sequences with numerical indices. First, convert the
${X}_\text{train\_symbs}$ and ${Y}_\text{train\_symbs}$
lists of symbols in lists of numbers using the indices you created. Call them `X_train_idx` and `Y_train_idx`.

In [103]:
# Write your code
# We create the parallel sequences of indexes
def mapword(x):
    try:
        index = word2idx[x]
        return index
    except KeyError:
        return 1

def maptag(x):
    try:
        index = tag2idx[x]
        return index
    except KeyError:
        return 0

# X_train_idx = [mapword(word) for sentences in X_train_symbs for word in sentences]
X_train_idx = [list(map(lambda x: mapword(x), sentence)) for sentence in X_train_symbs]
Y_train_idx = [list(map(lambda x: maptag(x), sentence)) for sentence in Y_train_symbs]
# Y_train_idx = [tag2idx[tag] for sentences in Y_train_symbs for tag in sentences]

Do the same for the validation set. Be aware that some words may be unknown.

In [104]:
# Write your code
# We create the parallel sequences of indexes
X_val_idx = [list(map(lambda x: mapword(x), sentence)) for sentence in X_val_symbs]
Y_val_idx = [list(map(lambda x: maptag(x), sentence)) for sentence in Y_val_symbs]

# for sentence in X_val_symbs:
#         indexes = list(map(lambda x: mapword(x), sentence))
#         X_val_idx.append(indexes)

# for sentence in Y_val_symbs:
#         indexes = list(map(lambda x: maptag(x), sentence))
#         Y_val_idx.append(indexes)


Word indices of the three first sentences

In [105]:
print(X_train_idx[:3])
print(X_val_idx[:3])

[[935], [142143, 307143, 161836, 91321, 363368, 83766, 85852, 218260, 936], [284434, 79019]]
[[935], [113351, 679, 221875, 354360, 275584, 63471, 364505, 49150, 192163, 381011, 936], [227217, 15431]]


NER tag indices of the three first sentences

In [106]:
print(Y_train_idx[:3])
print(Y_val_idx[:3])

[[9], [3, 9, 2, 9, 9, 9, 2, 9, 9], [4, 8]]
[[9], [9, 9, 3, 9, 9, 9, 9, 9, 9, 9, 9], [1, 9]]


Now, pad the sentences using the `pad_sequences` function. After padding, the second sentence you look like (the indices are not necessarily the same).


```
x = tensor([142143, 307143, 161836,  91321, 363368,  83766,  85852, 218260,    936,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0])
y = tensor([3, 9, 2, 9, 9, 9, 2, 9, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
```

You will call the results `X_train_padded` and `Y_train_padded`. Do the same for the validation set.

In [107]:
X_train_idx = list(map(torch.LongTensor, X_train_idx))
Y_train_idx = list(map(torch.LongTensor, Y_train_idx))

X_val_idx = list(map(torch.LongTensor, X_val_idx))
Y_val_idx = list(map(torch.LongTensor, Y_val_idx))

In [108]:
# Write your code here
X_train_padded = pad_sequence(X_train_idx, batch_first=True)
Y_train_padded = pad_sequence(Y_train_idx, batch_first=True)

X_val_padded = pad_sequence(X_val_idx, batch_first=True)
Y_val_padded = pad_sequence(Y_val_idx, batch_first=True)

In [109]:
X_train_padded[1]

tensor([142143, 307143, 161836,  91321, 363368,  83766,  85852, 218260,    936,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0,      0,      0,      0,      0,      0,
             0,      0,      0,      0, 

In [110]:
Y_train_padded[1]

tensor([3, 9, 2, 9, 9, 9, 2, 9, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Network Architecture

Create your network consisting of one embedding layer, a simple recurrent neural network, either RNN or LSTM, and a linear layer. You will initialize the embedding layer with `embedding_matrix` using `from_pretrained()`. You may try other configurations after. As number of RNN/LSTM units use 128.

In [111]:
class RNNModel(nn.Module):

    def __init__(self, embedding_matrix, rnn_units, nbr_classes, freeze_embs=True, num_layers=1, bidi_lstm=True):
        super().__init__()
        self.emb_layer = nn.Embedding.from_pretrained(embedding_matrix, freeze=freeze_embs)
        self.dropout_layer = nn.Dropout(p=DROPOUT, inplace=False)
        self.rnn_layer = nn.RNN(input_size=embedding_matrix.size(1), dropout=DROPOUT, hidden_size=rnn_units, batch_first=True, num_layers=num_layers, bidirectional=bidi_lstm)
        self.dropout_layer_2 = nn.Dropout(p=DROPOUT, inplace=False)
        rnn_units = rnn_units if not bidi_lstm else rnn_units * 2
        self.linear = nn.Linear(rnn_units, nbr_classes)
        # self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.emb_layer(x)
        x = self.dropout_layer(x)
        x, _ = self.rnn_layer(x)
        x = F.relu(x)
        x = self.dropout_layer_2(x)
        x = self.linear(x)
        # output = self.softmax(x)
        output = x
        return output

In [112]:
# Write your code
class Model(nn.Module):

    def __init__(self, embedding_matrix, lstm_units, nbr_classes, freeze_embs=True, num_layers=1, bidi_lstm=False):
        super().__init__()
        self.emb_layer = nn.Embedding.from_pretrained(embedding_matrix, freeze=freeze_embs, padding_idx=0)
        self.dropout_layer = nn.Dropout(p=DROPOUT, inplace=False)
        self.lstm_layer = nn.LSTM(input_size=embedding_matrix.size(1), dropout=DROPOUT, hidden_size=lstm_units, batch_first=True, num_layers=num_layers, bidirectional=bidi_lstm)

        l = lstm_units if not bidi_lstm else lstm_units * 2

        self.linear = nn.Linear(l, nbr_classes)
        # self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.emb_layer(x)
        x = self.dropout_layer(x)
        x, _ = self.lstm_layer(x)
        x = F.leaky_relu(x)
        #x = F.relu(x)
        x = self.dropout_layer(x)
        x = self.linear(x)

        #output = self.softmax(x)

        return x

Create your model

In [113]:
if LSTM:
    model = Model(embedding_matrix,
                LSTM_HIDDEN_DIM,
                len(tags) + 1,
                freeze_embs=FREEZE_EMBS,
                num_layers=LSTM_LAYERS,
                bidi_lstm=True)
else:
    model = RNNModel(embedding_matrix,
                LSTM_HIDDEN_DIM,
                len(tags) + 1,
                freeze_embs=FREEZE_EMBS,
                num_layers=LSTM_LAYERS,)

In [114]:
if not RETRAIN:
    if LSTM:
        state_dict = torch.load(f"./outputs/LSTM/LSTM_model_{MOD}.pth", map_location=torch.device('cpu'))
    else:
        state_dict = torch.load(f"./outputs/RNN/RNN_model_{MOD}.pth", map_location=torch.device('cpu'))
    model.load_state_dict(state_dict)
model.eval()

Model(
  (emb_layer): Embedding(402597, 100, padding_idx=0)
  (dropout_layer): Dropout(p=0.4, inplace=False)
  (lstm_layer): LSTM(100, 98, num_layers=3, batch_first=True, dropout=0.4, bidirectional=True)
  (linear): Linear(in_features=196, out_features=10, bias=True)
)

In [115]:
model
#device = torch.device("cuda:0")
device = torch.device("mps")
model.to(device)

Model(
  (emb_layer): Embedding(402597, 100, padding_idx=0)
  (dropout_layer): Dropout(p=0.4, inplace=False)
  (lstm_layer): LSTM(100, 98, num_layers=3, batch_first=True, dropout=0.4, bidirectional=True)
  (linear): Linear(in_features=196, out_features=10, bias=True)
)

Write the loss `loss_fn` and optimizer `optimizer`.

Note that to compute the loss, you need to discard the padding symbols from the results and specify their index
https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html (ignore_index)

In [None]:
# Write your code
loss_fn = nn.CrossEntropyLoss(ignore_index=0)    # cross entropy loss
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001)
loss_fn.to(device)

## Data Loaders

In [None]:
X_train = torch.LongTensor(X_train_padded).to(device)
Y_train = torch.LongTensor(Y_train_padded).to(device)

X_val = torch.LongTensor(X_val_padded).to(device)
Y_val = torch.LongTensor(Y_val_padded).to(device)

In [None]:
dataset = TensorDataset(X_train, Y_train)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

## A Few Experiments

### Flattening the tensors

In [None]:
Y_train.size()

In [None]:
Y_train.view(-1)

In [None]:
Y_train.view(-1).size()

### Applying the Model

We apply the model to the whole training set. You can do it in one shot with the statements below. This can use up all your memory. Do not do it you do not have a lot of memory.

In [None]:
if RETRAIN:
    if LARGE_MEM:
        with torch.no_grad():
            Y_train_pred = model(X_train)

It is prefereble to use smaller batches instead. This is less legible but safer.

In [None]:
def batch_inference(model, X, batchsize=2048): # Original batch size 2048
     with torch.no_grad():
         partial = []
         for i in range(0, X.shape[0], batchsize):
             partial.append(model(X[i:i+batchsize].to(device)))

     return torch.vstack(partial)

In [None]:
if RETRAIN:
    if not LARGE_MEM:
        Y_train_pred = batch_inference(model, X_train)

In [None]:
if RETRAIN:
    Y_train_pred.size()

In [None]:
if RETRAIN:
    Y_train_pred.view(-1, Y_train_pred.size()[-1]).size()

## Training the Model

We create a dictionary to store the accuracy and the loss. You will compute them in the training loop. You should exclude the the padding symbols from your counts. To do this, use a multiplicative mask with the terms Y_train > 0 or Y_val > 0. This is not critical though as you will evaluate the final results with another script.

In [None]:
history = {}
history['accuracy'] = []
history['loss'] = []
history['val_accuracy'] = []
history['val_loss'] = []

We fit the model

In [None]:
# Write your code
if RETRAIN:
    for epoch in range(EPOCHS):
        train_loss = 0
        train_accuracy = 0
        val_loss = 0
        val_accuracy = 0
        word_cnt = 0
        batch_cnt = 0
        train_correct = 0
        total_samples = 0
        model.train()

        # TRAINING
        loop = tqdm(dataloader, desc=f'Epoch {epoch+1}/{EPOCHS}', leave=True)
        for data, target in loop:
            data = data.to(device)
            target = target.to(device)

            output = model(data)

            loss = loss_fn(output.view(-1, output.shape[-1]), target.view(-1))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            # train_correct += torch.sum(torch.mul(torch.argmax(output, dim=-1) == target, target > 0)) # Does not compare with unknown words
            predicted = torch.argmax(output, dim=-1)
            train_correct += torch.sum(torch.eq(predicted, target))

            total_samples += data.size(0)
            loop.set_postfix(loss=train_loss/total_samples, accuracy=train_correct/total_samples)
            batch_cnt += 1

        # EVAL
        model.eval()
        with torch.no_grad():
            acc = torch.sum(torch.mul(torch.argmax(batch_inference(model, X_train), dim=-1) == Y_train, Y_train > 0)) # Does not compare with unknown words

            history['accuracy'] += [acc.item()/torch.sum(Y_train > 0)]
            history['loss'] += [train_loss/batch_cnt]

            y_val_pred =  model(X_val)
            loss = loss_fn(y_val_pred.view(-1,y_val_pred.shape[-1]), Y_val.view(-1))
            history['val_loss'] += [loss.item()]
            acc = torch.sum(torch.mul(torch.argmax(model(X_val), dim=-1) == Y_val, Y_val > 0))
            history['val_accuracy'] += [acc.item()/torch.sum(Y_val > 0)]
        torch.cuda.empty_cache()


And we visualize the training curves. We compare them with a validation set.

In [None]:
if RETRAIN:
    acc = [value.cpu() if isinstance(value, torch.Tensor) else value for value in history['accuracy']]
    loss = [value.cpu() if isinstance(value, torch.Tensor) else value for value in history['loss']]
    val_acc = [value.cpu() if isinstance(value, torch.Tensor) else value for value in history['val_accuracy']]
    val_loss = [value.cpu() if isinstance(value, torch.Tensor) else value for value in history['val_loss']]

    print(len(acc))
    print(len(val_acc))

    epochs = range(1, len(acc) + 1)
    plt.plot(epochs, acc, 'bo', label='Training accuracy')
    plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
    plt.title('Training and validation accuracies')
    plt.legend()

    plt.figure()
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation losses')
    plt.legend()

    plt.show()

In [None]:
# if not RETRAIN:
#     state_dict = torch.load("./outputs/LSTM/LSTM_model_2.pth", map_location=torch.device('cpu'))
#     model.load_state_dict(state_dict)
# model.eval()

We try the model on a test sentence

In [None]:
sentence = 'The United States might collapsez .'.lower().split()

Convert the sentence words to indices

In [None]:
# Write your code
# The indexes or the unknown word idx
sentence_word_idxs = [mapword(word) for word in sentence]
print(sentence_word_idxs)

The indices. Note the 1 at the end.

In [None]:
print('Sentence', sentence)
print('Sentence word indexes', sentence_word_idxs)

Predict the tags. Call the variable `sent_tag_predictions`

In [None]:
# Write your code
sentence_word_idxs = torch.tensor(sentence_word_idxs, dtype=torch.long).to(device)
sent_tag_predictions = model(sentence_word_idxs)

In [None]:
sent_tag_predictions.shape

The estimated probabilities of the first tag

In [None]:
F.softmax(sent_tag_predictions[0], dim=-1)

In [None]:
torch.argmax(F.softmax(sent_tag_predictions, dim=-1), dim=-1)

We apply argmax to select the tag

In [None]:
for word_nbr, tag_predictions in enumerate(sent_tag_predictions):
    if int(sentence_word_idxs[word_nbr]) in idx2word:
        print(idx2word[int(sentence_word_idxs[word_nbr])], end=': ')
    else:
        print(sentence[word_nbr], '/ukn', end=': ')
    print(idx2tag.get(int(torch.argmax(F.softmax(tag_predictions, dim=-1), dim=-1))))

## Evaluating the Model

In [None]:
test_sentences = read_sentences(test_file)
test_dict = split_rows(test_sentences, column_names)
test_dict[1:2]

We create the ${X}$ and ${Y}$ sequences of symbols

In [None]:
X_test_symbs, Y_test_symbs = build_sequences(test_dict, key_x='form', key_y='ner')
print('X_test:', X_test_symbs[1])
print('Y_test', Y_test_symbs[1])

Convert the ${X}$ symbol sequence into an index sequence and pad it. Call the results `X_test_idx` and `X_test_padded`.

In [None]:
# Write your code
X_test_idx = []
for x in X_test_symbs:
    # We map the unknown words to index 1
    x_idx = list(map(lambda a: word2idx.get(a, 1), x))
    X_test_idx += [x_idx]

In [None]:
X_test_idx = map(torch.LongTensor, X_test_idx)

In [None]:
X_test_padded = pad_sequence(X_test_idx, batch_first=True)

In [None]:
print('X_test_padded:', X_test_padded[1])

In [None]:
X_test_padded.shape

Predict the NER tags. Call the result `Y_test_hat_probs`

In [None]:
# Write your code
Y_test_hat_probs = batch_inference(model, X_test_padded)

In [None]:
print('Predictions', Y_test_hat_probs[1])

In [None]:
Y_test_hat_probs = F.softmax(Y_test_hat_probs, dim=-1)

In [None]:
Y_test_hat_probs[1]

We now predict the whole test set and we store the results in each dictionary with the key `pner`

In [None]:
for sent, y_hat_probs in zip(test_dict, Y_test_hat_probs):
    sent_len = len(sent)
    y_hat_probs = y_hat_probs[:sent_len]
    y_hat = torch.argmax(y_hat_probs, dim=-1) # This statement sometimes predicts 0 (the padding symbol)
    # y_hat = torch.argmax(y_hat_probs[:, 1:], dim=-1) + 1 # Never predicts 0
    for word, ner_hat in zip(sent, y_hat):
        word['pner'] = idx2tag.get(int(ner_hat))
        if word['pner'] == None:
            print(sent)

A sentence example: `ner` is the hand annotation and `pner` is the prediction.

In [None]:
test_dict[1]

We save the test set in a file to evaluate the performance of our model.

In [None]:
column_names = ['form', 'ppos', 'pchunk', 'ner', 'pner']

In [None]:
def save(file, corpus_dict, column_names):
    """
    Saves the corpus in a file
    :param file:
    :param corpus_dict:
    :param column_names:
    :return:
    """
    with open(file, 'w', encoding='utf8') as f_out:
        for sentence in corpus_dict:
            sentence_lst = []
            for row in sentence:
                items = map(lambda x: str(row.get(x, '_')), column_names)  # Convert to string
                sentence_lst.append(' '.join(items) + '\n')  # Append to list
            sentence_lst.append('\n')  # Add empty line after each sentence
            f_out.write(''.join(sentence_lst))

In [None]:

if LSTM:
    outfile = f'outputs/LSTM/lstm_model_{MOD}.out'
else:
    outfile = f'outputs/RNN/RNN_model_{MOD}.out'
if RETRAIN:
    save(outfile, test_dict, column_names)

In [None]:
lines = open(outfile, encoding='utf8').read().splitlines()
res = conlleval.evaluate(lines)
chunker_score = res['overall']['chunks']['evals']['f1']
chunker_score

# First test of LSTM gave 0.8911 - {'EPOCHS': 10 'LSTM_HIDDEN_DIM': 64,'LSTM_LAYERS': 2, 'DROPOUT': 0.3,'EMB_LARGE': True, 'FREEZE_EMBS': True}
# Second test of LSTM gave 0.9079 - {'EPOCHS': 20 'LSTM_HIDDEN_DIM': 98,'LSTM_LAYERS': 3, 'DROPOUT': 0.3,'EMB_LARGE': True, 'FREEZE_EMBS': True}
# First test of RNN gave 0.8247 - {'EPOCHS': 10 'RNN_HIDDEN_DIM': 64,'RNN_LAYERS': 2, 'DROPOUT': 0.3,'EMB_LARGE': True, 'FREEZE_EMBS': True}
# Second test of RNN gave 0.8533 - {'EPOCHS': 20 'RNN_HIDDEN_DIM': 98,'RNN_LAYERS': 3, 'DROPOUT': 0.4,'EMB_LARGE': True, 'FREEZE_EMBS': True}

In [None]:
if RETRAIN:
    config

In [None]:
if RETRAIN:
    if LSTM:
        torch.save(model.state_dict(), f'/LSTM_model_{MOD}.pth')
    else:
        torch.save(model.state_dict(), f'/RNN_model_{MOD}.pth')

## Experiments

You will carry out experiments with two different recurrent networks: RNN and LSTM. You will also try at least two sets of parameters per network. In your report, you will present your results in a table like this one:

|Method|Parameters|Score|
|------|-----|-----|
|Baseline|  xx | xx |
|RNN|  xx |xx |
|RNN |  xx |xx |
|LSTM |  xx |xx |
|LSTM |  xx |xx |

The baseline is the one from the CoNLL 2003 shared task. See here: https://aclanthology.org/W03-0419.pdf

You need to reach 80 to pass the lab

## Turning in your assignment

Now your are done with the program. To complete this assignment, you will:
1. Write a short individual report on your program. You will describe the architecture your used the different experiments you carried out and your results.


Submit your report as well as your notebook (for archiving purposes) to Canvas: https://canvas.education.lu.se/. To write your report, you can either
1. Write directly your text in Canvas, or
2. Use Latex and Overleaf (www.overleaf.com). This will probably help you structure your text. You will then upload a PDF file in Canvas.
