# Lab 6: LSTM for POS tagging

Long Short-Term Memory networks (LSTMs) are a variant of Recurrent Neural Networks (RNNs).

RNNs represent the prior context using recurrent network connections between some of their nodes, so that output from these nodes affects the subsequent input to the same nodes.

However, RNNs are infamous for the vanishing gradient problem. LSTMs deal with this problem by using gates that control the flow of information into and out of the units of its layers.

In today's lab, we will be working on a part-of-speech (POS) tagging task using an LSTM network.
More specifically:
1) We will work on a simple LSTM network. You can experiment with different values for:
   * word embedding and hidden layer dimensions
   * number of training epochs
2) We will use a 10% sample of the Penn Treebank corpus as training set.
3) We will perform some simple preprocessing on our training data in order to transform them into word embeddings for our LSTM network. You are free to experiment further, e.g. lowercasing the text, removing punctuation, etc.
4) We will finally test our LSTM POS tagger in a few sentences by comparing its predictions to the ones of the NLTK POS tagger.

You may need to install a few libraries first (there are relevant comments in the cells that potentially require installation of some libraries).

In [1]:
# install pytorch by commenting out the next line:
# !pip install torch

import torch
import torch.nn as nn # network layers and loss functions
import torch.nn.functional as F # log_softmax function
import torch.optim as optim # optimisation algorithm

torch.manual_seed(1)

<torch._C.Generator at 0x112309330>

### Load the training data

We will work on a subset of the Penn Treebank dataset for POS tagging that is available through NLTK.
Each sentence in ```treebank_chunk``` (the dataset) is represented as a list of tuples, each tuple consisting of a word and its POS tag.

In particular, the first sentence of the dataset looks like this:

```[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]```

[Here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) is a list of the Penn Treebank POS tags.

We want to tranform each sentence into a tuple that consists of a list of the sentence's words and a list of their corresponding POS tags.

This is the purpose of the function ```prepare_treebank_data``` below:

In [2]:
# install nltk
# !pip install nltk

from nltk.corpus import treebank_chunk  # we will use treebank_chunk.tagged_sents()

treebank_data = []

# Represent each sentence found in treebank_chunk as a tuple,
# consisting of a list of words and a list of corresponding POS tags

def prepare_treebank_data(data):
    for sentence in data:
        words = []
        tags = []
        for word, tag in sentence:
            words.append(word)
            tags.append(tag)
        treebank_data.append((words, tags))
    return treebank_data


# load our training data as list of sentences
# each sentence is a tuple
# each tuple contains a list of the sentence's words, and a list of their corresponding POS tags.

training_data = prepare_treebank_data(treebank_chunk.tagged_sents())

# create our vocabulary
vocabulary = set()
for words, _ in training_data:
    vocabulary.update(words)

You can run ```help(treebank_chunk)``` to see what other methods are available apart from ```tagged_sents```.

In [3]:
# help(treebank_chunk)

### Preprocessing

Now that we have loaded our training set, we need to preprocess each sentence before it is passed as input to the LSTM.

Every word must be assigned with a unique identifier (i.e., an integer value). These identifiers should then be used to represent each sentence. For example, the sentence ```That day was a wonderful day``` could be represented as ```309 12 5 0 98 12```.

Finally, we have to transform these numerical values into tensor values, using ```torch.tensor(idxs, dtype=torch.long)```, where ```idxs``` is a list of the integer values corresponding to each word of the sentence. In the previous example, it would be ```[309, 12, 5, 0, 98, 12]```.

Each sentence can be tranformed into a tensor using the ```preprocess_sequence``` function defined below. However, we need to create two dictionaries first:
* a dictionary ```word_to_idx``` that maps a vocabulary word into an integer value, and
* a dictionary ```tag_to_idx``` that maps a POS tag into an integer value.

In [4]:
def preprocess_sequence(seq, to_idx):
    '''
    It processes a sequence of words/tags by transforming them into numerical values, 
    based on a {word : index} dictionary.
    
    Input:
    seq: a sequence of words or POS tags in the form of a list
    to_idx: a dictionary from word or POS tag to index value
    
    Output:
    a torch tensor with numerical values that correspond to the words or POS tags of the input sequence.
    '''
    
    idxs = [to_idx[w] for w in seq]  
    
    # You can apply further preprocessing steps if you want, 
    # such as change into lowercase or remove punctuation.

    return torch.tensor(idxs, dtype=torch.long)



word_to_idx = {}
tag_to_idx = {}

# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_idx:  # word has not been assigned an index yet
            word_to_idx[word] = len(word_to_idx)  # Assign each word with a unique index
    for tag in tags:
        if tag not in tag_to_idx:
            tag_to_idx[tag] = len(tag_to_idx)

Does ```word_to_idx``` have the same length with ```vocabulary```?

If not (while you have not applied preprocessing steps that have changed the length of the vocabulary, such as punctuation removal), you may have some error.

If you have removed words from the vocabulary using punctuation removal or some other preprocessing step, you should then update the vocabulary.

Additionally, if you remove punctuation, you should also remove it from the tag sequences.

### Create the network

Now, we will build our LSTM network step-by-step:
1) We will create a simple word embedding layer (dimensions: vocabulary size, embedding dimension).
2) The output of the embedding layer is the input to our LSTM. The LSTM outputs hidden states, with dimensionality hidden_dim.
3) We then need a linear layer to create a mapping from the hidden state space (dimensionality: hidden_dim) to the POS tag space (dimensionality: equal to the number of POS tags in our training set).
4) Log softmax will be used in order to predict the best-scoring POS tag.

In [5]:
class POS_Tagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(POS_Tagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

### Training stage

We will first set the embedding and hidden dimensions, as well as the number of training epochs.

We also need to choose a loss function and an [optimisation algorithm](https://pytorch.org/docs/stable/optim.html) (and its learning rate). Some options for loss function are [cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) and [negative log likelihood loss](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html).

In [6]:
EMBEDDING_DIM = 8   # try different values
HIDDEN_DIM = 16    # try different values

VOCAB_SIZE = len(word_to_idx) # should be the same as len(vocabulary), otherwise there is some error
TAGS_COUNT = len(tag_to_idx)

EPOCHS_NUM = 5   # try different values (maybe between 10-40)

model = POS_Tagger(EMBEDDING_DIM, HIDDEN_DIM, VOCAB_SIZE, TAGS_COUNT)

loss_function = nn.CrossEntropyLoss() # or nn.NLLLoss() for negative log likelihood

optimizer = optim.SGD(model.parameters(), lr=0.1)  # stochastic gradient descent with learning rate = 0.1

In [7]:
# train the model

for epoch in range(EPOCHS_NUM):

    model.train()
    for sentence, tags in training_data:

        model.zero_grad() # pytorch accumulates gradients; clear them before training each instance

        sentence_in = preprocess_sequence(sentence, word_to_idx) # turn sentences into tensors of word indices
        targets = preprocess_sequence(tags, tag_to_idx) # now repeat for POS tags per sentence

        predictions = model(sentence_in) # run the forward pass (defined earlier in the POS_Tagger class)

        # Compute the loss, gradients, and update the parameters by calling optimizer.step()
        loss = loss_function(predictions, targets)
        loss.backward()
        optimizer.step()


    model.eval()
    loss = 0
    with torch.no_grad():
        for sentence, tags in training_data:
            sentence_in = preprocess_sequence(sentence, word_to_idx)
            targets = preprocess_sequence(tags, tag_to_idx)
            y_pred = model(sentence_in)
            loss += loss_function(y_pred, targets)
        print("Epoch %d: Loss: %.4f" % (epoch, loss))


Epoch 0: Loss: 7088.1719
Epoch 1: Loss: 5983.7241
Epoch 2: Loss: 5381.6421
Epoch 3: Loss: 4965.8486
Epoch 4: Loss: 4631.2202


### Summary of the model, showing the layers, their dimensions and the number of parameters

In [8]:
print(model)

POS_Tagger(
  (word_embeddings): Embedding(11993, 8)
  (lstm): LSTM(8, 16)
  (hidden2tag): Linear(in_features=16, out_features=46, bias=True)
)


In [9]:
# more detailed summary of the model
# it requires installation of the torchinfo library
# (you can install it by commenting out the next line)

#!pip install torchinfo

from torchinfo import summary
summary(model)

Layer (type:depth-idx)                   Param #
POS_Tagger                               --
├─Embedding: 1-1                         95,944
├─LSTM: 1-2                              1,664
├─Linear: 1-3                            782
Total params: 98,390
Trainable params: 98,390
Non-trainable params: 0

In [10]:
# even more detailed summary

summary(model, verbose=2, row_settings=["var_names"])

Layer (type (var_name))                  Param #
POS_Tagger (POS_Tagger)                  --
├─Embedding (word_embeddings)            95,944
│    └─weight                            └─95,944
├─LSTM (lstm)                            1,664
│    └─weight_ih_l0                      ├─512
│    └─weight_hh_l0                      ├─1,024
│    └─bias_ih_l0                        ├─64
│    └─bias_hh_l0                        └─64
├─Linear (hidden2tag)                    782
│    └─weight                            ├─736
│    └─bias                              └─46
Total params: 98,390
Trainable params: 98,390
Non-trainable params: 0


Layer (type (var_name))                  Param #
POS_Tagger (POS_Tagger)                  --
├─Embedding (word_embeddings)            95,944
│    └─weight                            └─95,944
├─LSTM (lstm)                            1,664
│    └─weight_ih_l0                      ├─512
│    └─weight_hh_l0                      ├─1,024
│    └─bias_ih_l0                        ├─64
│    └─bias_hh_l0                        └─64
├─Linear (hidden2tag)                    782
│    └─weight                            ├─736
│    └─bias                              └─46
Total params: 98,390
Trainable params: 98,390
Non-trainable params: 0

### Compare with NLTK POS tagger

We will now see how our LSTM POS tagger and the POS tagger of the NLTK library perform on a few example sentences. Note that the POS tagger of NLTK does not always predict the correct POS tag, so you should not consider its predictions as gold labels.

In the training stage, we used ```tag_to_idx``` to transform POS tags into numerical values. Running our model with the example sentences as input will return values that correspond to POS tags. We should therefore change these values into POS tag names using a dictionary ```idx_to_tag```.

In [11]:
# do *not* try to make changes to the test sentences before reaching the end of the notebook

test_sentences = [
    'London is the capital of England .',
    'Paris is the capital city of France .',
    'They are my best friends .',
    'This was their favorite toy .',
    'I will be reading this book today , tomorrow and the day after .',
    'And now for something completely different',
    'We successfully completed the task',
    'The company had a net loss of $ 2 million .',
    'The discussions are still in preliminary stages , and the specific details have n\'t been worked out',
    'But for small American companies , it also provides a growing source of capital and even marketing help .',
]

In [12]:
# we need a mapping (idx_to_tag) from tag ids to tag names
# this is basically the reverse of tag_to_idx

idx_to_tag = {tag_to_idx[k]:k for k in tag_to_idx}

In [15]:
from nltk import pos_tag

for s in test_sentences:
    sentence_in = preprocess_sequence(s.split(), word_to_idx)
    output = model(sentence_in)
    _, predicted_tags = output.max(dim = -1)

    preds = []
    for tag in predicted_tags:
        preds.append(idx_to_tag[int(tag)])

    print(s)
    print('LSTM:', preds)
    print('NLTK:', [nltk_tag[1] for nltk_tag in pos_tag(s.split())])
    print()

London is the capital of England .
LSTM: ['NNP', 'VBZ', 'DT', 'NN', 'IN', 'JJ', '.']
NLTK: ['NNP', 'VBZ', 'DT', 'NN', 'IN', 'NNP', '.']

Paris is the capital city of France .
LSTM: ['NNP', 'VBZ', 'DT', 'NN', 'NN', 'IN', 'NNP', '.']
NLTK: ['NNP', 'VBZ', 'DT', 'NN', 'NN', 'IN', 'NNP', '.']

They are my best friends .
LSTM: ['PRP', 'VBP', 'NNS', 'NNP', 'NN', '.']
NLTK: ['PRP', 'VBP', 'PRP$', 'JJS', 'NNS', '.']

This was their favorite toy .
LSTM: ['DT', 'VBD', 'VBN', 'RB', 'NN', '.']
NLTK: ['DT', 'VBD', 'PRP$', 'JJ', 'NN', '.']

I will be reading this book today , tomorrow and the day after .
LSTM: ['PRP', 'MD', 'VB', 'NNS', 'DT', 'NN', 'NN', ',', 'JJ', 'CC', 'DT', 'NN', 'IN', '.']
NLTK: ['PRP', 'MD', 'VB', 'VBG', 'DT', 'NN', 'NN', ',', 'NN', 'CC', 'DT', 'NN', 'IN', '.']

And now for something completely different
LSTM: ['CC', 'RB', 'IN', 'NN', 'JJ', 'NN']
NLTK: ['CC', 'RB', 'IN', 'NN', 'RB', 'JJ']

We successfully completed the task
LSTM: ['PRP', 'NN', 'VBD', 'DT', 'NN']
NLTK: ['PRP', 'R

The results depend on the training parameters and hyperparameters. Remember that NLTK is not always right; for example, in the sentence "Paris is the capital city of France ." both the NLTK and our POS tagger agree both that 'capital' is a noun (NN), while it is an adjective (JJ).

If you try to add more test sentences, you may get a KeyError. Try to think about why this happens...

The reason why some new test sentences raise a KeyError is that they contain words that do not exist in our vocabulary, and, as a result, in our word_to_idx mapping. It is therefore impossible for our implementation to work with such cases. Can you think of possible ways to solve this problem?

This is also a good opportunity to think about our ```tag_to_idx``` mapping. Did you implement it based on the link with the POS tag list given at the beginning or based on the POS tags that exist in our training set? Are they the same? 