# Classroom 6 - Training a Named Entity Recognition Model with a LSTM

The classroom today is primarily geared towards preparing you for Assignment 4 which you'll be working on after today. The notebook is split into three main parts to get you thinking. You should work through these sections in groups together in class. 

If you have any questions or things you don't understand, make a note of them so you can remember to ask - or, even better, post them to Slack!

If you get through everything here, make a start on the assignment. If you don't, dont' worry about it - but I suggest you finish all of the exercises here before starting the assignment.

## 1. A very short intro to NER
Named entity recognition (NER) also known as named entity extraction, and entity identification is the task of tagging an entity is the task of extracting which seeks to extract named entities from unstructured text into predefined categories such as names, medical codes, quantities or similar.

The most common variant is the [CoNLL-20003](https://www.clips.uantwerpen.be/conll2003/ner/) format which uses the categories, person (PER), organization (ORG) location (LOC) and miscellaneous (MISC), which for example denote cases such nationalies. For example:

*Hello my name is $Ross_{PER}$ I live in $Aarhus_{LOC}$ and work at $AU_{ORG}$.*

For example, let's see how this works with ```spaCy```. NB: you might need to remember to install a ```spaCy``` model:

```python -m spacy download en_core_web_sm```

In [13]:
import spacy

# Importing a language model using spacy.
nlp = spacy.load("en_core_web_sm")

def tprint(object):
    print("is an object of type", type(object))


In [14]:
print("nlp")
tprint(nlp)
print("doc")
tprint(doc)

nlp
is an object of type <class 'spacy.lang.en.English'>
doc
is an object of type <class 'spacy.tokens.doc.Doc'>


In [11]:
from spacy import displacy
displacy.render(doc, style="ent")

print("Here, we're importing displacy, and using its .render method upon the doc object from before, in ent-style, to get the above output.")

Here, we're importing displacy, and using its .render method upon the doc object from before, in ent-style, to get the above output.


## Tagging standards
There exist different tag standards for NER. The most used one is the BIO-format which frames the task as token classification denoting inside, outside and beginning of a token. 

Words marked with *O* are not a named entity. Words with NER tags which start with *B-\** indicate the start of a multiword entity (i.e. *B-ORG* for the *Aarhus* in *Aarhus University*), while *I-\** indicate the continuation of a token (e.g. University).

    B = Beginning
    I = Inside
    O = Outside

<details>
<summary>Q: What other formats and standards are available? What kinds of entities do they make it possible to tag?</summary>
<br>
You can see more examples on the spaCy documentation for their [different models(https://spacy.io/models/en)
</details>


In [17]:
for t in doc: # For each token (which is a string-like spacy.tokens.token.Token object) in doc:
    #tprint(t)
    if t.ent_type:
        print(t, f"{t.ent_iob_}-{t.ent_type_}")
    else:
        print(t, t.ent_iob_)

is an object of type <class 'spacy.tokens.token.Token'>
Hello O
is an object of type <class 'spacy.tokens.token.Token'>
my O
is an object of type <class 'spacy.tokens.token.Token'>
name O
is an object of type <class 'spacy.tokens.token.Token'>
is O
is an object of type <class 'spacy.tokens.token.Token'>
Ross B-PERSON
is an object of type <class 'spacy.tokens.token.Token'>
. O
is an object of type <class 'spacy.tokens.token.Token'>
I O
is an object of type <class 'spacy.tokens.token.Token'>
live O
is an object of type <class 'spacy.tokens.token.Token'>
in O
is an object of type <class 'spacy.tokens.token.Token'>
Denmark B-GPE
is an object of type <class 'spacy.tokens.token.Token'>
and O
is an object of type <class 'spacy.tokens.token.Token'>
work O
is an object of type <class 'spacy.tokens.token.Token'>
at O
is an object of type <class 'spacy.tokens.token.Token'>
Aarhus B-ORG
is an object of type <class 'spacy.tokens.token.Token'>
University I-ORG
is an object of type <class 'spacy.toke

### Some challenges with NER
While NER is currently framed as above this formulating does contain some limitations. 

For instance the entity Aarhus University really refers to both the location Aarhus, the University within Aarhus, thus nested NER (N-NER) argues that it would be more correct to tag it in a nested fashion as \[\[$Aarhus_{LOC}$\] $University$\]$_{ORG}$ (Plank, 2020). 

Other task also include named entity linking. Which is the task of linking an entity to e.g. a wikipedia entry, thus you have to both know that it is indeed an entity and which entity it is (if it is indeed a defined entity).

In this assignment, we'll be using Bi-LSTMs to train an NER model on a predifined data set which uses IOB tags of the kind we outlined above.

## 2. Training in batches

When you trained your document classifier for the last assignment, you probably noticed that the neural network was quite brittle. Small changes in the hyperparameters could cause massive changes in performance. Likewise, you probably noticed that they tend to substantially overfit the training data and underperform on the validation and test data.

One way we can get around this is by processing the data in smaller chunks known as *batches*. 

<details>
<summary>Q: Why might it be a good idea to train on batches, rather than the whole dataset?</summary>
<br>
These batches are usually small (something like 32 instances at a time) but they have couple of important effects on training:

- Batches can be processed in parallel, rather the sequentially. This can result in substantial speed up from computational perspective
- Similarly, smaller batch sizes make it easier to fit training data into memory
- Lastly,  smaller batch sizes are noisy, meaning that they have a regularizing effect and thus lead to less overfitting.

In this assignment, we're going to be using batches of data to train our NER model. To do that, we first have to prepare our batches for training. You can read more about batching in [this blog post](https://machinelearningmastery.com/how-to-control-the-speed-and-stability-of-training-neural-networks-with-gradient-descent-batch-size/).

</details>



In [18]:
# this allows us to look one step up in the directory
# for importing custom modules from src
import sys
sys.path.append("..")
from src.util import batch
from src.LSTM import RNN
from src.embedding import gensim_to_torch_embedding

# numpy and pytorch
import numpy as np
import torch
from torch import nn

# loading data and embeddings
from datasets import load_dataset
import gensim.downloader as api

We can download the datset using the ```load_dataset()``` function we've already seen. Here we take only the training data.

When you've downloaded the dataset, you're welcome to save a local copy so that we don't need to constantly download it again everytime the code runs.

Q: What do the ```train.features``` values refer to?

In [19]:
# DATASET
dataset = load_dataset("conllpp")

Downloading builder script:   0%|          | 0.00/8.73k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

Downloading and preparing dataset conllpp/conllpp (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /home/coder/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/650k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/163k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conllpp downloaded and prepared to /home/coder/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [20]:

# inspect the dataset
print("dataset")
tprint(dataset)
print(dataset)


dataset
is an object of type <class 'datasets.dataset_dict.DatasetDict'>
DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


In [24]:
# inspect train
train = dataset["train"]

print(train["tokens"][:5])
tprint(train["tokens"][0])
print(train["ner_tags"][:5])

# get number of classes
num_classes = train.features["ner_tags"].feature.num_classes

[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['Peter', 'Blackburn'], ['BRUSSELS', '1996-08-22'], ['The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 'shun', 'British', 'lamb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.'], ['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.']]
is an object of type <class 'list'>
[[3, 0, 7, 0, 0, 0, 7, 0, 0], [1, 2], [5, 0], [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0]]


In [28]:
print(num_classes)
print(train.features["ner_tags"].feature)
print(train.features["chunk_tags"].feature)
print(train.features["pos_tags"].feature)

9
ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None)
ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None)
ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None)


We then use ```gensim``` to get some pretrained word embeddings for the input layer to the model. 

In this example, we're going to use a GloVe model pretrained on Wikipedia, with 50 dimensions.

I've provided a helper function to take the ```gensim``` embeddings and prepare them for ```pytorch```.

In [29]:
# CONVERTING EMBEDDINGS
model = api.load("glove-wiki-gigaword-50")
print("Loaded!")

Loaded!


In [56]:
import imp
import src.embedding as embedding
imp.reload(embedding)
from src.embedding import gensim_to_torch_embedding
# convert gensim word embedding to torch word embedding
embedding_layer, vocab = gensim_to_torch_embedding(model)


True


In [55]:
#print("model")
#tprint(model)
#print("which looks like this when printed as a whole. Evidently, it can be indexed.")
#print(model)
#print("The first index (model[1])")
#tprint(model[1])
#print("and looks like this!")
#print(model[1])
# print("In short, this GloVe model is 400000 50-dimensional word-2-vec models; 
# or 50-dimensional vectors corresponding to words in a semantic space. Neat!")
keytoindex = model.key_to_index
tprint(keytoindex)
len(keytoindex) # It has 400002 words!
print(keytoindex["car"]) # and the word "car" has the index 569. In other words, this is a pretrained model in which VECTOR
# 569 encodes "car" in the semantic vector space!

is an object of type <class 'dict'>
569


In [39]:
#print(model.vectors)
tprint(model.vectors)
model.vectors.shape # Model contains a 400000 x 50 dimensional matrix, with each row being a word vector!

is an object of type <class 'numpy.ndarray'>


(400000, 50)

### Preparing a batch

The first thing we want to do is to shuffle our dataset before training. 

Why might it be a good idea to shuffle the data?

In [59]:
# shuffle dataset
shuffled_train = dataset["train"].shuffle(seed=1) # Reproducible seed is nice. Dataset has a builtin shuffle function. Neat.



[['--',
  'FLNC',
  'Corsican',
  'nationalist',
  'movement',
  'announces',
  'end',
  'of',
  'truce',
  'after',
  'last',
  'night',
  "'s",
  'attacks',
  '.'],
 ['4-6', '7-6', '(', '7-4', ')'],
 ['Dealers',
  'said',
  'that',
  'the',
  'volume',
  'of',
  'longer-term',
  'government',
  'paper',
  'declined',
  'due',
  'to',
  'market',
  'nervousness',
  '.'],
 ['MONTREAL', 'AT', 'SAN', 'FRANCISCO'],
 ['"',
  'I',
  'would',
  'love',
  'to',
  'speak',
  'about',
  'everything',
  ',',
  '"',
  'said',
  'Simpson',
  ',',
  'who',
  'vowed',
  'after',
  'his',
  'acquittal',
  'to',
  'find',
  'the',
  'killers',
  'and',
  'offered',
  'a',
  'substantial',
  'reward',
  '.']]

Next, we want to bundle the shuffled training data into smaller batches of predefined size. I've written a small utility function here to help. 

<details>
<summary>Q: Can you explain how the ```batch()``` function works?</summary>
<br>
 Hint: Check out [this link](https://realpython.com/introduction-to-python-generators/).
</details>



In [63]:
batch_size = 32
batches_tokens = batch(shuffled_train["tokens"], batch_size)
batches_tags = batch(shuffled_train["ner_tags"], batch_size)

In [64]:
# Inspecting vocab

len(model.key_to_index)

400002

Next, we want to use the ```tokens_to_idx()``` function below on our batches.

<details>
<summary>Q: What is this function doing? Why is it doing it?</summary>
<br>
We're making everything lowercase and adding a new, arbitrary token called <UNK> to the vocabulary. This <UNK> means "unknown" and is used to replace out-of-vocabulary tokens in the data - i.e. tokens that don't appear in the vocabulary of the pretrained word embeddings.
</details>


In [65]:
def tokens_to_idx(tokens, vocab=model.key_to_index):
    """
    - Write documentation for this function including type hints for each argument and return statement
    
    - What does the .get method do?
    Gets an index from model.key_to_index for the token, assuming it exists in the vocabulary.
    - Why lowercase?
    Because the model only has lowercase words.
    """

    
    return [vocab.get(t.lower(), vocab["UNK"]) for t in tokens]

# This is a beautiful, synergetic marriage of vocab (a dictionary), the .get method of a dictionary, and a list comprehension.
# Basically, give me a list of token-indexes (i.e., what row of my word2vec matrix does each token correspond to?), in the same order as the tokens appear.

We'll check below that everything is working as expected as expected by testing it on a single batch.

In [74]:
# sample using only the first batch
batch_tokens = next(batches_tokens)
batch_tags = next(batches_tags)
batch_tok_idx = [tokens_to_idx(sent) for sent in batch_tokens] # DOUBLE LIST COMPREHENSION!

#print(batch_tok_idx)
batch_tok_idx[2]

(['THURSDAY', ',', 'AUGUST', '29TH', 'SCHEDULE'], ['Zenith', 'lands', '$', '1', 'billion', 'contract', ',', 'plans', '$', '100', 'million', 'plant', '.'], ['Corporate', 'America', 'taking', 'new', 'view', 'on', 'compensation', '.'], ['Puchon', '3', 'Chonan', '0', '(', 'halftime', '1-0', ')'], ['A', 'drought', 'scoured', 'the', 'steppes', 'in', 'May', 'and', 'June', ',', 'stunting', 'the', 'growing', 'wheat', '.'], ['--', 'Paris', 'newsroom', '+33', '1', '4221', '5452'], ['Dole', 'also', 'said', 'he', 'opposed', 'California', 'Proposition', '215', 'which', ',', 'if', 'approved', 'by', 'voters', ',', 'would', 'allow', 'the', 'cultivation', 'of', 'marijuana', 'plants', 'for', 'medicinal', 'uses', '.'], ['England', '326', 'all', 'out'], ['But', 'analysts', 'noted', 'that', 'Sierra', 'still', 'has', 'much', 'painful', 'work', 'ahead', 'of', 'it', ',', 'including', 'cutting', 'as', 'many', 'as', '150', 'jobs', 'from', 'its', 'workforce', ',', 'which', 'currently', 'has', '500', 'people', ','

[1668, 453, 582, 50, 1139, 13, 3531, 2]

As with document classification, our model needs to take input sequences of a fixed length. To get around this we do a couple of different steps.

- Find the length of the longest sequence in the batch
- Pad shorter sequences to the max length using an arbitrary token like <PAD>
- Give the <PAD> token a new label ```-1``` to differentiate it from the other labels

In [70]:
# compute length of longest sentence in batch
batch_max_len = max([len(s) for s in batch_tok_idx])

Q: Can you figure out the logic of what is happening in the next two cells?

In [72]:
batch_input = vocab["PAD"] * np.ones((batch_size, batch_max_len))
#vocab["PAD"] is an integer. Specifically, 400001 in this case. It is multiplied by an array of 1's, with 32 rows and max sequence length columns.

batch_labels = -1 * np.ones((batch_size, batch_max_len))
# Same here, only we're multiplying with -1.
print(batch_input)
print(batch_labels)


[[400001. 400001. 400001. ... 400001. 400001. 400001.]
 [400001. 400001. 400001. ... 400001. 400001. 400001.]
 [400001. 400001. 400001. ... 400001. 400001. 400001.]
 ...
 [400001. 400001. 400001. ... 400001. 400001. 400001.]
 [400001. 400001. 400001. ... 400001. 400001. 400001.]
 [400001. 400001. 400001. ... 400001. 400001. 400001.]]
[[-1. -1. -1. ... -1. -1. -1.]
 [-1. -1. -1. ... -1. -1. -1.]
 [-1. -1. -1. ... -1. -1. -1.]
 ...
 [-1. -1. -1. ... -1. -1. -1.]
 [-1. -1. -1. ... -1. -1. -1.]
 [-1. -1. -1. ... -1. -1. -1.]]


In [26]:
# copy the data to the numpy array
for i in range(batch_size): # For 0, 1, 2, ..., 31
    tok_idx = batch_tok_idx[i] # Token indexes from the i'th sequence in the batch.
    tags = batch_tags[i] # Token tags for the i'th sequence in the batch.
    size = len(tok_idx) # Length of token indexes = length of sequence.

    batch_input[i][:size] = tok_idx # The batch_input's i'th row (sequence) is filled up with token indexes, until there are no more.
    batch_labels[i][:size] = tags # Same with batch labels.
    # The rest are "PAD" indexed and -1 labelled (-1 in ner-tags means pad, based on our current definitions)

The last step is to conver the arrays into ```pytorch``` tensors, ready for the NN model.

In [27]:
# since all data are indices, we convert them to torch LongTensors (integers)
batch_input, batch_labels = torch.LongTensor(batch_input), torch.LongTensor(
    batch_labels
)

With our data now batched and processed, we want to run it through our RNN the same way as when we trained a clasifier. Note that this cell is incomplete and won't yet run; that's part of the assignment!

Q: Why is ```output_dim = num_classes + 1```?

In [None]:
import imp
import src.LSTM as LSTM
imp.reload(LSTM)
from src.LSTM import RNN


# CREATE MODEL
model = RNN(
    embedding_layer=embedding_layer, output_dim = num_classes + 1, hidden_dim_size = 256
)

# FORWARD PASS
X = batch_input
y = model(X)

loss = model.loss_fn(outputs=y, labels=batch_labels)

# etc, etc



## 3. Creating an LSTM with ```pytorch```

In the file [LSTM.py](../src/LSTM.py), I've aready created an LSTM for you using ```pytorch```. Take some time to read through the code and make sure you understand how it's built up.

Some questions for you to discuss in groups:

- How is an LSTM layer created using ```pytorch```? How does the code compare to the classifier code you wrote last week?
- What's going on with that weird bit that says ```@staticmethod```?
  - [This might help](https://realpython.com/instance-class-and-static-methods-demystified/).
- On the forward pass, we use ```log_softmax()``` to make output predictions. What is this, and how does it relate to the output from the sigmoid function that we used in the document classification?
- How would we make this LSTM model *bidirectional* - i.e. make it a Bi-LSTM? 
  - Hint: Check the documentation for the LSTM layer on the ```pytorch``` website.