<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preparing-data-for-use-as-NN-input" data-toc-modified-id="Preparing-data-for-use-as-NN-input-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preparing data for use as NN input</a></span></li><li><span><a href="#Letting-the-NN-parameterize-words" data-toc-modified-id="Letting-the-NN-parameterize-words-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Letting the NN parameterize words</a></span></li><li><span><a href="#Adding-an-LSTM-layer" data-toc-modified-id="Adding-an-LSTM-layer-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Adding an LSTM layer</a></span></li><li><span><a href="#Classifiying-the-LSTM-output" data-toc-modified-id="Classifiying-the-LSTM-output-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Classifiying the LSTM output</a></span></li><li><span><a href="#Creating-training-and-validation-datasets" data-toc-modified-id="Creating-training-and-validation-datasets-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Creating training and validation datasets</a></span></li><li><span><a href="#Creating-the-Parts-of-Speech-LSTM-model" data-toc-modified-id="Creating-the-Parts-of-Speech-LSTM-model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Creating the Parts of Speech LSTM model</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Examining-results" data-toc-modified-id="Examining-results-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Examining results</a></span></li><li><span><a href="#Using-the-model-for-inference" data-toc-modified-id="Using-the-model-for-inference-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Using the model for inference</a></span></li></ul></div>

# Predicting parts of speech with an LSTM

Let's preview the end result. We want to take a sentence and output the part of speech for each word in that sentence. Something like this:

**Code**

```python
new_sentence = "I is a teeth"

...

predictions = model(processed_sentence)

...
```

**Output**

```text
I     => Noun
is    => Verb
a     => Determiner
teeth => Noun
```

In [1]:
def ps(s):
    """Process String: convert a string into a list of lowercased words."""
    line = s.strip().replace(" ", "")
    return [c for c in line]

In [2]:
from pathlib import Path
import re

# read quesitons and answers from file
dataset_filename = Path("../train_data/arithmetic__mixed.txt")

# questions = [ ["1", "+" , "3"], ... ]
questions = []

# answers = [ [] ]
answers = []

with open(dataset_filename) as dataset_file:
    # Grabbing a subset of the entire file
    for i in range(100):
        line_q = dataset_file.readline().strip()
        line_a = dataset_file.readline().strip()

        questions.append([word.strip() for word in re.split(r'([+-/*()]|\s+)', line_q) if word.strip()])
        answers.append(eval(line_a))

# use zip to create dataset object
dataset = [(q,a) for q,a in zip(questions,answers)]

In [5]:
import torch

from fastprogress.fastprogress import progress_bar, master_bar

from random import shuffle

## Preparing data for use as NN input

We can't pass a list of plain text words and tags to a NN. We need to convert them to a more appropriate format.

We'll start by creating a unique index for each word and tag.

In [6]:
word_to_index = {}
total_words = 0

for question, _ in dataset:

    total_words += len(question)

    for word in question:
        if word not in word_to_index:
            word_to_index[word] = len(word_to_index)

In [7]:
print("       Vocabulary Indices")
print("-------------------------------")

for word in sorted(word_to_index):
    print(f"{word:>14} => {word_to_index[word]:>2}")

print("\nTotal number of words:", total_words)
print("Number of unique words:", len(word_to_index))

       Vocabulary Indices
-------------------------------
             ( =>  2
             ) =>  6
             * => 25
             + =>  1
             - =>  4
             . => 18
             / =>  7
             0 => 48
             1 =>  8
            10 =>  9
           100 => 46
          1008 => 95
          1017 => 130
           102 => 96
           104 => 68
           105 => 75
          108? => 136
            11 => 83
           114 => 56
           118 => 121
            12 => 14
           120 => 116
           122 => 60
           126 => 135
          1271 => 27
            13 => 90
           130 => 118
           133 => 82
           135 => 142
          1368 => 109
            14 => 11
           140 => 98
           142 => 58
           145 => 139
            15 =>  0
            16 => 20
           160 => 79
         16095 => 141
           164 => 89
           168 => 132
            17 =>  5
           174 => 23
            18 => 40
           180 => 43
       

## Letting the NN parameterize words

Once we have a unique identifier for each word, it is useful to start our NN with an [embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding) layer. This layer converts an index into a vector of values.

You can think of each value as indicating something about the word. For example, maybe the first value indicates how much a word conveys happiness vs sadness. Of course, the NN can learn any attributes and it is not limited to thinks like happy/sad, masculine/feminine, etc.

**Creating an embedding layer**. An embedding layer is created by telling it the size of the vocabulary (the number of words) and an embedding dimension (how many values to use to represent a word).

**Embedding layer input and output**. An embedding layer takes an index and return a matrix.

In [8]:
def convert_to_index_tensor(words, mapping):
    indices = [mapping[w] for w in words]
    return torch.tensor(indices, dtype=torch.long)

In [9]:
vocab_size = len(word_to_index)
embed_dim = 6  # Hyperparameter
embed_layer = torch.nn.Embedding(vocab_size, embed_dim)

In [10]:
# i = torch.tensor([word_to_index["the"], word_to_index["dog"]])
indices = convert_to_index_tensor(ps("15 + (7 + -17)/12"), word_to_index)
embed_output = embed_layer(indices)
indices.shape, embed_output.shape, embed_output

(torch.Size([13]),
 torch.Size([13, 6]),
 tensor([[-1.0585,  0.0547,  2.2342, -0.7027,  0.3794,  0.3668],
         [-1.6456,  0.4101, -0.9782, -0.1616, -2.1837, -0.7466],
         [ 0.4347, -1.2462,  0.2102,  0.0891,  0.1622, -0.5003],
         [-0.1166,  0.9855,  0.1602,  1.2682, -0.8762,  2.1339],
         [-1.7625,  1.9052, -1.6493, -0.4016,  0.7758,  0.0818],
         [ 0.4347, -1.2462,  0.2102,  0.0891,  0.1622, -0.5003],
         [ 0.1691, -0.6900, -0.1043, -0.4392,  0.8320, -0.0137],
         [-1.0585,  0.0547,  2.2342, -0.7027,  0.3794,  0.3668],
         [-1.7625,  1.9052, -1.6493, -0.4016,  0.7758,  0.0818],
         [ 1.4024, -0.2317, -0.8802,  2.1191,  0.1747,  1.4870],
         [ 0.1328,  0.1640, -1.2090, -0.4015, -2.1778,  0.9543],
         [-1.0585,  0.0547,  2.2342, -0.7027,  0.3794,  0.3668],
         [-0.5316, -0.5064,  0.4487,  0.1517,  0.6270,  1.6644]],
        grad_fn=<EmbeddingBackward0>))

## Adding an LSTM layer

The [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM) layer is in charge of processing embeddings such that the network can output the correct classification. Since this is a recurrent layer, it will take into account past words when it creates an output for the current word.

**Creating an LSTM layer**. To create an LSTM you need to tell it the size of its input (the size of an embedding) and the size of its internal cell state.

**LSTM layer input and output**. An LSTM takes an embedding (and optionally an initial hidden and cell state) and outputs a value for each word as well as the current hidden and cell state).

If you read the linked LSTM documentation you will see that it requires input in this format: (seq_len, batch, input_size)

As you can see above, our embedding layer outputs something that is (seq_len, input_size). So, we need to add a dimension in the middle.

In [11]:
hidden_dim = 10  # Hyperparameter
num_layers = 5  # Hyperparameter
lstm_layer = torch.nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers)

In [12]:
# The LSTM layer expects the input to be in the shape (L, N, E)
#   L is the length of the sequence
#   N is the batch size (we'll stick with 1 here)
#   E is the size of the embedding
lstm_output, _ = lstm_layer(embed_output.unsqueeze(1))
lstm_output.shape

torch.Size([13, 1, 10])

## Classifiying the LSTM output

We can now add a fully connected, [linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear) layer to our NN to learn the correct part of speech (classification).

**Creating a linear layer**. We create a linear layer by specifying the shape of the input into the layer and the number of neurons in the linear layer.

**Linear layer input and output**. The input is expected to be (input_size, output_size) and the output will be the output of each neuron.

In [14]:
linear_layer = torch.nn.Linear(hidden_dim, 1)

In [15]:
linear_output = linear_layer(lstm_output)
linear_output.shape, linear_output

(torch.Size([13, 1, 1]),
 tensor([[[-0.1629]],
 
         [[-0.1743]],
 
         [[-0.1828]],
 
         [[-0.1887]],
 
         [[-0.1928]],
 
         [[-0.1954]],
 
         [[-0.1970]],
 
         [[-0.1979]],
 
         [[-0.1986]],
 
         [[-0.1990]],
 
         [[-0.1992]],
 
         [[-0.1993]],
 
         [[-0.1994]]], grad_fn=<AddBackward0>))

# Training an LSTM model

In [16]:
# Hyperparameters
valid_percent = 0.2  # Training/validation split

embed_dim = 7  # Size of word embedding
hidden_dim = 8  # Size of LSTM internal state
num_layers = 5  # Number of LSTM layers

learning_rate = 0.1
num_epochs = 2

## Creating training and validation datasets

In [17]:
N = len(dataset)
vocab_size = len(word_to_index)  # Number of unique input words

# Shuffle the data so that we can split the dataset randomly
shuffle(dataset)

split_point = int(N * valid_percent)
valid_dataset = dataset[:split_point]
train_dataset = dataset[split_point:]

len(valid_dataset), len(train_dataset)

(20, 80)

## Creating the Parts of Speech LSTM model

In [18]:
class POS_LSTM(torch.nn.Module):
    """Part of Speach LSTM model."""

    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers):
        super().__init__()
        self.embed = torch.nn.Embedding(vocab_size, embed_dim)
        self.lstm = torch.nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers)
        self.linear = torch.nn.Linear(hidden_dim, 1)

    def forward(self, X):
        X = self.embed(X)
        X, _ = self.lstm(X.unsqueeze(1))
        return self.linear(X)

## Training

In [19]:
def compute_accuracy(dataset):
    """A helper function for computing accuracy on the given dataset."""
    total_words = 0
    total_correct = 0

    model.eval()

    with torch.no_grad():
        for sentence, tags in dataset:
            sentence_indices = convert_to_index_tensor(sentence, word_to_index)
            tag_scores = model(sentence_indices).squeeze()
            predictions = tag_scores.argmax(dim=1)
            total_words += len(sentence)
            total_correct += sum(t == tag_list[p] for t, p in zip(tags, predictions))

    return total_correct / total_words

In [22]:
model = POS_LSTM(vocab_size, embed_dim, hidden_dim, num_layers)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

mb = master_bar(range(num_epochs))

# accuracy = compute_accuracy(valid_dataset)
# print(f"Validation accuracy before training : {accuracy * 100:.2f}%")

for epoch in mb:

    # Shuffle the data for each epoch (stochastic gradient descent)
    shuffle(train_dataset)

    model.train()

    for sentence, tags in progress_bar(train_dataset, parent=mb):
        model.zero_grad()
        
        sentence = convert_to_index_tensor(sentence, word_to_index)
#         tags = convert_to_index_tensor(tags, tag_to_index)

        tag_scores = model(sentence)

        break
#         loss = criterion(tag_scores.squeeze(), tags)

#         loss.backward()
#         optimizer.step()

# accuracy = compute_accuracy(valid_dataset)
# print(f"Validation accuracy after training : {accuracy * 100:.2f}%")

## Examining results

Here we look at all words that are misclassified by the model

In [None]:
print("\nMis-predictions after training on entire dataset")
header = "Word".center(14) + " | True Tag | Prediction"
print(header)
print("-" * len(header))

with torch.no_grad():
    for sentence, tags in dataset:
        sentence_indices = convert_to_index_tensor(sentence, word_to_index)
        tag_scores = model(sentence_indices)
        predictions = tag_scores.squeeze().argmax(dim=1)
        for word, tag, pred in zip(sentence, tags, predictions):
            if tag != tag_list[pred]:
                print(f"{word:>14} |     {tag}    |    {tag_list[pred]}")

## Using the model for inference

In [None]:
new_sentence = "3 + 3"

# Convert sentence to lowercase words
sentence = ps(new_sentence)

# Check that each word is in our vocabulary
for word in sentence:
    assert word in word_to_index

# Convert input to a tensor
sentence = convert_to_index_tensor(sentence, word_to_index)

# Compute prediction
predictions = model(sentence)
predictions = predictions.squeeze().argmax(dim=1)

# Print results
for word, tag in zip(ps(new_sentence), predictions):
    print(word, "=>", tag_list[tag.item()])

Things to try:

- compare with fully connected network
- compare with CNN
- compare with transformer