# RNN & BiLSTM for PoS Tagging

## Introduction

In this series we'll be building a machine learning model that produces an output for every element in an input sequence, using PyTorch and TorchText. Specifically, we will be inputting a sequence of text and the model will output a part-of-speech (PoS) tag for each token in the input text. This can also be used for named entity recognition (NER), where the output for each token will be what type of entity, if any, the token is.

In this notebook, we'll be implementing a multi-layer bi-directional RNN and LSTM (BiLSTM) to predict PoS tags using the Universal Dependencies English Web Treebank (UDPOS) dataset.

## Preparing Data

First, let's import the necessary Python modules.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext import data  
from torchtext import datasets  

import spacy
import numpy as np

import time
import random

Next, we'll set the random seeds for reproducability.

In [None]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

One of the key parts of TorchText is the `Field`. The `Field` handles how your dataset is processed.

Our `TEXT` field handles how the text that we need to tag is dealt with. All we do here is set `lower = True` which lowercases all of the text.

Next we'll define the `Fields` for the tags. This dataset actually has two different sets of tags, [universal dependency (UD) tags](https://universaldependencies.org/u/pos/) and [Penn Treebank (PTB) tags](https://www.sketchengine.eu/penn-treebank-tagset/). We'll only train our model on the UD tags, but will load the PTB tags to show how they could be used instead.

`UD_TAGS` handles how the UD tags should be handled. Our `TEXT` vocabulary - which we'll build later - will have *unknown* tokens in it, i.e. tokens that are not within our vocabulary. However, we won't have unknown tags as we are dealing with a finite set of possible tags. TorchText `Fields` initialize a default unknown token, `<unk>`, which we remove by setting `unk_token = None`.

`PTB_TAGS` does the same as `UD_TAGS`, but handles the PTB tags instead.

**UD_TAGS (Universal Dependencies 通用依存标签集)** 的**中英文对照表格**:

| **UD 标签** | **中文名称**       | **英文全称**                  | **英文示例**              |
|-------------|-------------------|------------------------------|---------------------------|
| `ADJ`       | 形容词            | Adjective                    | "happy", "big"           |
| `ADP`       | 介词              | Adposition                   | "in", "about"            |
| `ADV`       | 副词              | Adverb                       | "very", "slowly"         |
| `AUX`       | 助动词            | Auxiliary verb               | "will", "should"         |
| `CCONJ`     | 并列连词          | Coordinating conjunction     | "and", "or"              |
| `DET`       | 限定词            | Determiner                   | "this", "those"          |
| `INTJ`      | 感叹词            | Interjection                 | "oh", "wow"              |
| `NOUN`      | 名词              | Noun                         | "apple", "student"       |
| `NUM`       | 数词              | Numeral                      | "three", "100"           |
| `PART`      | 助词/小品词       | Particle                     | "'s" (possessive), "to"  |
| `PRON`      | 代词              | Pronoun                      | "I", "they"              |
| `PROPN`     | 专有名词          | Proper noun                  | "London", "John"         |
| `PUNCT`     | 标点符号          | Punctuation                  | ",", "."                 |
| `SCONJ`     | 从属连词          | Subordinating conjunction    | "if", "because"          |
| `SYM`       | 符号              | Symbol                       | "$", "%"                 |
| `VERB`      | 动词              | Verb                         | "eat", "study"           |
| `X`         | 其他              | Other                        | Unclassified words       |

In [None]:
TEXT = data.Field(lower = True)
UD_TAGS = data.Field(unk_token = None)
PTB_TAGS = data.Field(unk_token = None)

We then define `fields`, which handles passing our fields to the dataset.

Note that order matters, if you only wanted to load the PTB tags your field would be:

```
fields = (("text", TEXT), (None, None), ("ptbtags", PTB_TAGS))
```

Where `None` tells TorchText to not load those tags.

In [None]:
fields = (("text", TEXT), ("udtags", UD_TAGS), ("ptbtags", PTB_TAGS))

Next, we load the UDPOS dataset using our defined fields.

In [None]:
train_data, valid_data, test_data = datasets.UDPOS.splits(fields)

We can check how many examples are in each section of the dataset by checking their length.

In [None]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Let's print out an example:

In [None]:
print(vars(train_data.examples[0]))

We can also view the text and tags separately:

In [None]:
print(vars(train_data.examples[0])['text'])

In [None]:
print(vars(train_data.examples[0])['udtags'])

In [None]:
print(vars(train_data.examples[0])['ptbtags'])

Next, we'll build the vocabulary - a mapping of tokens to integers. 

We want some unknown tokens within our dataset in order to replicate how this model would be used in real life, so we set the `min_freq` to 2 which means only tokens that appear twice in the training set will be added to the vocabulary and the rest will be replaced by `<unk>` tokens.

We also load the [GloVe](https://nlp.stanford.edu/projects/glove/) pre-trained token embeddings. Specifically, the 100-dimensional embeddings that have been trained on 6 billion tokens. Using pre-trained embeddings usually leads to improved performance - although admittedly the dataset used in this tutorial is too small to take advantage of the pre-trained embeddings. 

`unk_init` is used to initialize the token embeddings which are not in the pre-trained embedding vocabulary. By default this sets those embeddings to zeros, however it is better to not have them all initialized to the same value, so we initialize them from a Normal/Gaussian distribution.

These pre-trained vectors are now loaded into our vocabulary and we will initialize our model with these values later.

In [None]:
MIN_FREQ = 2

TEXT.build_vocab(train_data, 
                 min_freq = MIN_FREQ,
                 vectors = "glove.6B.100d",
                 unk_init = torch.Tensor.normal_)


UD_TAGS.build_vocab(train_data)
PTB_TAGS.build_vocab(train_data)

We can check how many tokens and tags are in our vocabulary by getting their length:

In [None]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in UD_TAG vocabulary: {len(UD_TAGS.vocab)}")
print(f"Unique tokens in PTB_TAG vocabulary: {len(PTB_TAGS.vocab)}")

Exploring the vocabulary, we can check the most common tokens within our texts:

In [None]:
print(TEXT.vocab.freqs.most_common(20))

We can see the vocabularies for both of our tags:

In [None]:
print(UD_TAGS.vocab.itos)

In [None]:
print(PTB_TAGS.vocab.itos)

We can also see how many of each tag are in our vocabulary:

In [None]:
print(UD_TAGS.vocab.freqs.most_common())

In [None]:
print(PTB_TAGS.vocab.freqs.most_common())

We can also view how common each of the tags are within the training set:

In [None]:
def tag_percentage(tag_counts):
    
    total_count = sum([count for tag, count in tag_counts])
    
    tag_counts_percentages = [(tag, count, count/total_count) for tag, count in tag_counts]
        
    return tag_counts_percentages

In [None]:
print("Tag\t\tCount\t\tPercentage\n")

for tag, count, percent in tag_percentage(UD_TAGS.vocab.freqs.most_common()):
    print(f"{tag}\t\t{count}\t\t{percent*100:4.1f}%")

In [None]:
print("Tag\t\tCount\t\tPercentage\n")

for tag, count, percent in tag_percentage(PTB_TAGS.vocab.freqs.most_common()):
    print(f"{tag}\t\t{count}\t\t{percent*100:4.1f}%")

The final part of data preparation is handling the iterator. 

This will be iterated over to return batches of data to process. Here, we set the batch size and the `device` - which is used to place the batches of tensors on our GPU, if we have one. 

In [None]:
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

## Building the Model

### **CustomRNN**

In [None]:
# Simple RNN model for POS tagging
class CustomRNNPOSTagger(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, pad_idx):
        super().__init__()
        
        # Embedding layer to convert word indices to dense vectors
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)
        
        # Custom RNN cell implementation
        self.rnn_cell = nn.Linear(embedding_dim + hidden_dim, hidden_dim)
        self.hidden_dim = hidden_dim
        
        # Output layer to predict POS tags
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        # text = [sent len, batch size]
        
        # Convert words to embeddings
        embedded = self.embedding(text)
        # embedded = [sent len, batch size, emb dim]
        
        # Initialize hidden state
        batch_size = text.shape[1]
        hidden = torch.zeros(batch_size, self.hidden_dim).to(text.device)
        
        # List to store outputs at each time step
        outputs = []
        
        # Process sequence one token at a time
        for t in range(embedded.shape[0]):
            # Get current token embeddings
            x_t = embedded[t]  # [batch size, emb dim]
            
            # Concatenate with previous hidden state
            combined = torch.cat((x_t, hidden), dim=1)  # [batch size, emb dim + hid dim]
            
            # Apply RNN cell
            hidden = torch.tanh(self.rnn_cell(combined))  # [batch size, hid dim]
            
            # Store output
            outputs.append(hidden)
        
        # Stack outputs
        outputs = torch.stack(outputs)  # [sent len, batch size, hid dim]
        
        # Pass through linear layer to get predictions
        predictions = self.fc(outputs)  # [sent len, batch size, output dim]
        
        return predictions

### **CustomBiRNN**

In [None]:
# Bidirectional RNN model for POS tagging
class CustomBiRNNPOSTagger(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, pad_idx):
        super().__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)
        
        # Forward and backward RNN cells
        self.forward_rnn_cell = nn.Linear(embedding_dim + hidden_dim, hidden_dim)
        self.backward_rnn_cell = nn.Linear(embedding_dim + hidden_dim, hidden_dim)
        self.hidden_dim = hidden_dim
        
        # Output layer (note: input is 2*hidden_dim due to bidirectionality)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
    def forward(self, text):
        # text = [sent len, batch size]
        
        # Convert words to embeddings
        embedded = self.embedding(text)
        # embedded = [sent len, batch size, emb dim]
        
        batch_size = text.shape[1]
        seq_length = text.shape[0]
        
        # Initialize hidden states for forward and backward passes
        forward_hidden = torch.zeros(batch_size, self.hidden_dim).to(text.device)
        backward_hidden = torch.zeros(batch_size, self.hidden_dim).to(text.device)
        
        # Lists to store outputs
        forward_outputs = []
        backward_outputs = []
        
        # Forward pass
        for t in range(seq_length):
            x_t = embedded[t]
            combined = torch.cat((x_t, forward_hidden), dim=1)
            forward_hidden = torch.tanh(self.forward_rnn_cell(combined))
            forward_outputs.append(forward_hidden)
        
        # Backward pass
        for t in range(seq_length-1, -1, -1):
            x_t = embedded[t]
            combined = torch.cat((x_t, backward_hidden), dim=1)
            backward_hidden = torch.tanh(self.backward_rnn_cell(combined))
            backward_outputs.insert(0, backward_hidden)
        
        # Stack outputs
        forward_outputs = torch.stack(forward_outputs)  # [sent len, batch size, hid dim]
        backward_outputs = torch.stack(backward_outputs)  # [sent len, batch size, hid dim]
        
        # Concatenate forward and backward outputs
        outputs = torch.cat((forward_outputs, backward_outputs), dim=2)  # [sent len, batch size, hid dim * 2]
        
        # Pass through linear layer to get predictions
        predictions = self.fc(outputs)  # [sent len, batch size, output dim]
        
        return predictions

### **BiRNN**

In [None]:
# RNN model for POS tagging using PyTorch built-in RNN
class RNNPOSTagger(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 n_layers,
                 bidirectional,
                 dropout,
                 pad_idx):
        super().__init__()
        
        # Embedding layer to convert word indices to dense vectors
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx=pad_idx)
        
        # Built-in RNN layer with added parameters
        self.rnn = nn.RNN(embedding_dim, 
                          hidden_dim, 
                          num_layers=n_layers, 
                          bidirectional=bidirectional,
                          dropout=dropout if n_layers > 1 else 0)
        
        # Output layer to predict POS tags
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        # Dropout layer
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        # text = [sent len, batch size]
        
        # Convert words to embeddings and apply dropout
        embedded = self.dropout(self.embedding(text))
        # embedded = [sent len, batch size, emb dim]
        
        # Pass through RNN
        outputs, hidden = self.rnn(embedded)
        # outputs = [sent len, batch size, hid dim * n directions]
        # hidden = [n layers * n directions, batch size, hid dim]
        
        # Apply dropout to outputs
        outputs = self.dropout(outputs)
        
        # Pass through linear layer to get predictions
        predictions = self.fc(outputs)
        # predictions = [sent len, batch size, output dim]
        
        return predictions

### **BiLSTM**

Next up, we define our model - a multi-layer bi-directional LSTM. The image below shows a simplified version of the model with only one LSTM layer and omitting the LSTM's cell state for clarity.

![](../figs/pos-bidirectional-lstm.png)

The model takes in a sequence of tokens, $X = \{x_1, x_2,...,x_T\}$, passes them through an embedding layer, $e$, to get the token embeddings, $e(X) = \{e(x_1), e(x_2), ..., e(x_T)\}$.

These embeddings are processed - one per time-step - by the forward and backward LSTMs. The forward LSTM processes the sequence from left-to-right, whilst the backward LSTM processes the sequence right-to-left, i.e. the first input to the forward LSTM is $x_1$ and the first input to the backward LSTM is $x_T$. 

The LSTMs also take in the the hidden, $h$, and cell, $c$, states from the previous time-step

$$h^{\rightarrow}_t = \text{LSTM}^{\rightarrow}(e(x^{\rightarrow}_t), h^{\rightarrow}_{t-1}, c^{\rightarrow}_{t-1})$$
$$h^{\leftarrow}_t=\text{LSTM}^{\leftarrow}(e(x^{\leftarrow}_t), h^{\leftarrow}_{t-1}, c^{\leftarrow}_{t-1})$$

After the whole sequence has been processed, the hidden and cell states are then passed to the next layer of the LSTM.

The initial hidden and cell states, $h_0$ and $c_0$, for each direction and layer are initialized to a tensor full of zeros.

We then concatenate both the forward and backward hidden states from the final layer of the LSTM, $H = \{h_1, h_2, ... h_T\}$, where $h_1 = [h^{\rightarrow}_1;h^{\leftarrow}_T]$, $h_2 = [h^{\rightarrow}_2;h^{\leftarrow}_{T-1}]$, etc. and pass them through a linear layer, $f$, which is used to make the prediction of which tag applies to this token, $\hat{y}_t = f(h_t)$.

When training the model, we will compare our predicted tags, $\hat{Y}$ against the actual tags, $Y$, to calculate a loss, the gradients w.r.t. that loss, and then update our parameters.

We implement the model detailed above in the `BiLSTMPOSTagger` class.

`nn.Embedding` is an embedding layer and the input dimension should be the size of the input (text) vocabulary. We tell it what the index of the padding token is so it does not update the padding token's embedding entry.

`nn.LSTM` is the LSTM. We apply dropout as regularization between the layers, if we are using more than one.

`nn.Linear` defines the linear layer to make predictions using the LSTM outputs. We double the size of the input if we are using a bi-directional LSTM. The output dimensions should be the size of the tag vocabulary.

We also define a dropout layer with `nn.Dropout`, which we use in the `forward` method to apply dropout to the embeddings and the outputs of the final layer of the LSTM.

In [None]:
class BiLSTMPOSTagger(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 n_layers, 
                 bidirectional, 
                 dropout, 
                 pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        
        self.lstm = nn.LSTM(embedding_dim, 
                            hidden_dim, 
                            num_layers = n_layers, 
                            bidirectional = bidirectional,
                            dropout = dropout if n_layers > 1 else 0)
        
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):

        #text = [sent len, batch size]
        
        #pass text through embedding layer
        embedded = self.dropout(self.embedding(text))
        
        #embedded = [sent len, batch size, emb dim]
        
        #pass embeddings into LSTM
        outputs, (hidden, cell) = self.lstm(embedded)
        
        #outputs holds the backward and forward hidden states in the final layer
        #hidden and cell are the backward and forward hidden and cell states at the final time-step
        
        #output = [sent len, batch size, hid dim * n directions]
        #hidden/cell = [n layers * n directions, batch size, hid dim]
        
        #we use our outputs to make a prediction of what the tag should be
        predictions = self.fc(self.dropout(outputs))
        
        #predictions = [sent len, batch size, output dim]
        
        return predictions

## Training the Model

Next, we instantiate the model. We need to ensure the embedding dimensions matches that of the GloVe embeddings we loaded earlier.

The rest of the hyperparmeters have been chosen as sensible defaults, though there may be a combination that performs better on this model and dataset.

The input and output dimensions are taken directly from the lengths of the respective vocabularies. The padding index is obtained using the vocabulary and the `Field` of the text.

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(UD_TAGS.vocab)
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

# model = BiLSTMPOSTagger(INPUT_DIM, 
#                         EMBEDDING_DIM, 
#                         HIDDEN_DIM, 
#                         OUTPUT_DIM, 
#                         N_LAYERS, 
#                         BIDIRECTIONAL, 
#                         DROPOUT, 
#                         PAD_IDX)
model = RNNPOSTagger(INPUT_DIM, 
                    EMBEDDING_DIM, 
                    HIDDEN_DIM, 
                    OUTPUT_DIM, 
                    N_LAYERS, 
                    BIDIRECTIONAL, 
                    DROPOUT, 
                    PAD_IDX)
# model = CustomRNNPOSTagger(INPUT_DIM,
#                      EMBEDDING_DIM,
#                      HIDDEN_DIM,
#                      OUTPUT_DIM,
#                      PAD_IDX)
# model = CustomBiRNNPOSTagger(INPUT_DIM,
#                      EMBEDDING_DIM,
#                      HIDDEN_DIM,
#                      OUTPUT_DIM,
#                      PAD_IDX)

We initialize the weights from a simple Normal distribution. Again, there may be a better initialization scheme for this model and dataset.

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)
        
model.apply(init_weights)

Next, a small function to tell us how many parameters are in our model. Useful for comparing different models.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

We'll now initialize our model's embedding layer with the pre-trained embedding values we loaded earlier.

This is done by getting them from the vocab's `.vectors` attribute and then performing a `.copy` to overwrite the embedding layer's current weights.

In [None]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

In [None]:
model.embedding.weight.data.copy_(pretrained_embeddings)

It's common to initialize the embedding of the pad token to all zeros. This, along with setting the `padding_idx` in the model's embedding layer, means that the embedding should always output a tensor full of zeros when a pad token is input.

In [None]:
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

We then define our optimizer, used to update our parameters w.r.t. their gradients. We use Adam with the default learning rate.

In [None]:
optimizer = optim.Adam(model.parameters())

Next, we define our loss function, cross-entropy loss.

Even though we have no `<unk>` tokens within our tag vocab, we still have `<pad>` tokens. This is because all sentences within a batch need to be the same size. However, we don't want to calculate the loss when the target is a `<pad>` token as we aren't training our model to recognize padding tokens.

We handle this by setting the `ignore_index` in our loss function to the index of the padding token in our tag vocabulary.

In [None]:
TAG_PAD_IDX = UD_TAGS.vocab.stoi[UD_TAGS.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

We then place our model and loss function on our GPU, if we have one.

In [None]:
model = model.to(device)
criterion = criterion.to(device)

We will be using the loss value between our predicted and actual tags to train the network, but ideally we'd like a more interpretable way to see how well our model is doing - accuracy.

The issue is that we don't want to calculate accuracy over the `<pad>` tokens as we aren't interested in predicting them.

The function below only calculates accuracy over non-padded tokens. `non_pad_elements` is a tensor containing the indices of the non-pad tokens within an input batch. We then compare the predictions of those elements with the labels to get a count of how many predictions were correct. We then divide this by the number of non-pad elements to get our accuracy value over the batch.

In [None]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / y[non_pad_elements].shape[0]

Next is the function that handles training our model.

We first set the model to `train` mode to turn on dropout/batch-norm/etc. (if used). Then we iterate over our iterator, which returns a batch of examples. 

For each batch: 
- we zero the gradients over the parameters from the last gradient calculation
- insert the batch of text into the model to get predictions
- as PyTorch loss functions cannot handle 3-dimensional predictions we reshape our predictions
- calculate the loss and accuracy between the predicted tags and actual tags
- call `backward` to calculate the gradients of the parameters w.r.t. the loss
- take an optimizer `step` to update the parameters
- add to the running total of loss and accuracy

In [None]:
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        text = batch.text
        tags = batch.udtags
        
        optimizer.zero_grad()
        
        #text = [sent len, batch size]
        
        predictions = model(text)
        
        #predictions = [sent len, batch size, output dim]
        #tags = [sent len, batch size]
        
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)
        
        #predictions = [sent len * batch size, output dim]
        #tags = [sent len * batch size]
        
        loss = criterion(predictions, tags)
                
        acc = categorical_accuracy(predictions, tags, tag_pad_idx)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The `evaluate` function is similar to the `train` function, except with changes made so we don't update the model's parameters.

`model.eval()` is used to put the model in evaluation mode, so dropout/batch-norm/etc. are turned off. 

The iteration loop is also wrapped in `torch.no_grad` to ensure we don't calculate any gradients. We also don't need to call `optimizer.zero_grad()` and `optimizer.step()`.

In [None]:
def evaluate(model, iterator, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text = batch.text
            tags = batch.udtags
            
            predictions = model(text)
            
            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)
            
            loss = criterion(predictions, tags)
            
            acc = categorical_accuracy(predictions, tags, tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Next, we have a small function that tells us how long an epoch takes.

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we train our model!

After each epoch we check if our model has achieved the best validation loss so far. If it has then we save the parameters of this model and we will use these "best" parameters to calculate performance over our test set.

In [None]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, TAG_PAD_IDX)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

We then load our "best" parameters and evaluate performance on the test set.

In [None]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion, TAG_PAD_IDX)

print(f'Test Loss: {test_loss:.3f} |  Test Acc: {test_acc*100:.2f}%')

## Inference

88% accuracy looks pretty good, but let's see our model tag some actual sentences.

We define a `tag_sentence` function that will:
- put the model into evaluation mode
- tokenize the sentence with spaCy if it is not a list
- lowercase the tokens if the `Field` did
- numericalize the tokens using the vocabulary
- find out which tokens are not in the vocabulary, i.e. are `<unk>` tokens
- convert the numericalized tokens into a tensor and add a batch dimension
- feed the tensor into the model
- get the predictions over the sentence
- convert the predictions into readable tags

As well as returning the tokens and tags, it also returns which tokens were `<unk>` tokens.

In [None]:
def tag_sentence(model, device, sentence, text_field, tag_field):
    
    model.eval()
    
    if isinstance(sentence, str):
        nlp = spacy.load('en_core_web_sm')
        tokens = [token.text for token in nlp(sentence)]
    else:
        tokens = [token for token in sentence]

    if text_field.lower:
        tokens = [t.lower() for t in tokens]
        
    numericalized_tokens = [text_field.vocab.stoi[t] for t in tokens]

    unk_idx = text_field.vocab.stoi[text_field.unk_token]
    
    unks = [t for t, n in zip(tokens, numericalized_tokens) if n == unk_idx]
    
    token_tensor = torch.LongTensor(numericalized_tokens)
    
    token_tensor = token_tensor.unsqueeze(-1).to(device)
         
    predictions = model(token_tensor)
    
    top_predictions = predictions.argmax(-1)
    
    predicted_tags = [tag_field.vocab.itos[t.item()] for t in top_predictions]
    
    return tokens, predicted_tags, unks

We'll get an already tokenized example from the training set and test our model's performance.

In [None]:
example_index = 1

sentence = vars(train_data.examples[example_index])['text']
actual_tags = vars(train_data.examples[example_index])['udtags']

print(sentence)

We can then use our `tag_sentence` function to get the tags. Notice how the tokens referring to subject of the sentence, the "respected cleric", are both `<unk>` tokens!

In [None]:
tokens, pred_tags, unks = tag_sentence(model, 
                                       device, 
                                       sentence, 
                                       TEXT, 
                                       UD_TAGS)

print(unks)

We can then check how well it did. Surprisingly, it got every token correct, including the two that were unknown tokens!

In [None]:
print("Pred. Tag\tActual Tag\tCorrect?\tToken\n")

for token, pred_tag, actual_tag in zip(tokens, pred_tags, actual_tags):
    correct = '✔' if pred_tag == actual_tag else '✘'
    print(f"{pred_tag}\t\t{actual_tag}\t\t{correct}\t\t{token}")

Let's now make up our own sentence and see how well the model does.

Our example sentence below has every token within the model's vocabulary.

In [None]:
sentence = 'The Queen will deliver a speech about the conflict in North Korea at 1pm tomorrow.'

tokens, tags, unks = tag_sentence(model, 
                                  device, 
                                  sentence, 
                                  TEXT, 
                                  UD_TAGS)

print(unks)

Looking at the sentence it seems like it gave sensible tags to every token!

In [None]:
print("Pred. Tag\tToken\n")

for token, tag in zip(tokens, tags):
    print(f"{tag}\t\t{token}")