# Sequence Models
This tutorial follows the [Sequence Models and Long-Short Term Memory Networks](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html) tutorial on the Pytorch webpage. The original author is Robert Guthrie.

In this tutorial, you will learn about LSTM neural networks and see an example of how they can be used to recognize parts of speech.

In [35]:
import os
import jdc
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from IPython.display import Image

torch.manual_seed(1)

<torch._C.Generator at 0x10d1e4a10>

In [34]:
repo_dirpath = '/'.join(os.getcwd().split('/')[0:-1])
image_dirpath = os.path.join(repo_dirpath, 'images')

/Users/dave/DataScience/Projects/GitHub/PythonWorkshop/intro-to-nlp-with-pytorch/images


In [36]:
Image(filename=os.path.join(image_dirpath, 'lstm_flow.png'))

FileNotFoundError: [Errno 2] No such file or directory: '/Users/dave/DataScience/Projects/GitHub/PythonWorkshop/intro-to-nlp-with-pytorch/images/lstm_flow.png'

## Introduction
### What is an LSTM?
LSTM stands for Long Short-Term Memory. The network can learn sequences of information and make predictions based off of what it learns. It is a type of recurrent neural network. The LSTM cell has a state, which gets updated as the network trains. It is this state that allows the network to remember.

Here is a basic diagram of an LSTM:
<img src=os.path.join(image_dirpath, '')

### LSTM Components
An LSTM cell consists of three gates. Each gate makes decisions on what information to pass on. The gates are:
* Forget Gate
* Input Gate
* Output Gate

#### Forget Gate
This gate decides what information is not important and removes it from the current LSTM state.

#### Input Gate.
The input gate decides what information to store in the current LSTM state.

#### Output Gate
Finally, the LSTM decides what information to output. This is done through the output gate.

### Very Simple Example 1
Before approaching the main example for this tutorial, let's see a very brief example of how to create and LSTM cell in Pytorch and pass information through it. One way to do this is using a **`for`** loop:

In [2]:
lstm = nn.LSTM(3, 3)  # Input dimension is 3, output dimension is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # Make a sequence of length 5

# Initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
for i in inputs:
    # Step through the sequence one element at a time.
    # After each time step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

In [3]:
print('out: {}'.format(out))
print('hidden: {}'.format(hidden))

out: tensor([[[-0.3600,  0.0893,  0.0215]]])
hidden: (tensor([[[-0.3600,  0.0893,  0.0215]]]), tensor([[[-1.1298,  0.4467,  0.0254]]]))


### Very Simple Example 2
Instead of creating an LSTM using a **`for`** loop, we can use Pytorch's **`cat`** function to string together each layer of the LSTM.

In [4]:
lstm = nn.LSTM(3, 3)  # Input dimension is 3, output dimension is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # Make a sequence of length 5

inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # Clean out hidden state
out, hidden = lstm(inputs, hidden)

In [5]:
print('out: {}'.format(out))
print('hidden: {}'.format(hidden))

out: tensor([[[-0.2696,  0.2599, -0.0758]],

        [[-0.4923,  0.1408, -0.0738]],

        [[-0.4523,  0.1241, -0.1461]],

        [[-0.3057,  0.1198, -0.0571]],

        [[-0.1077,  0.0289, -0.0487]]])
hidden: (tensor([[[-0.1077,  0.0289, -0.0487]]]), tensor([[[-0.1439,  0.1426, -0.2563]]]))


## Example: An LSTM for Part-of-Speech Tagging
In this example, a model will be created that can predict the part-of-speech for each word in a sentence. TODO: Show rolled out LSTM, labeling each word and part of speech.

### Prepare the data:
Let's begin by creating a dataset for training. The dataset will consist of a list of sequences. Each sequence will contain two lists. The first list is a sentence split up into words. The second list contains grammmer identifiers for each word in the sentence.

In [6]:
training_data = [
    ("The dog ate the apple.".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book.".split(), ["NN", "V", "DET", "NN"])
]
training_sentences = [training_data[x][0] for x in range(len(training_data))]

Since Pytorch only understands numbers, we need to map strings, such as the words in the training set, to integers. The following lines of code create a dictionary with this mapping:

In [7]:
word_to_ix = {}
for sentence, tags in training_data:
    for word in sentence:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print('word_to_ix: {}'.format(word_to_ix))

word_to_ix: {'that': 7, 'The': 0, 'Everybody': 5, 'read': 6, 'apple.': 4, 'the': 3, 'ate': 2, 'book.': 8, 'dog': 1}


We also need to map the parts-of-speech tags to integers:

In [8]:
# Tags to integers
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

A third dictionary is used to map the integers back to parts-of-speech.

In [9]:
# Integers to tags
ix_to_tag = {0: "DET", 1: "NN", 2: "V"}

### Set Hyperparameters

In [10]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
LEARNING_RATE = 0.1
NUM_EPOCHS = 300

### Create the model
We will define the model by creating a Python class object. This class will inherit the nn.Module class from Pytorch, which will allow us to easily use the neural network classes defined in Pytorch.

The `LSTMTagger` class will take in four values, the embedding dimension, the number of hidden dimensions, the vocabulary size, and the size of the tag set.

In [11]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # LSTM: Inputs are embeddings, outputs are hidden states
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # Linear layer maps hidden space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

We will now create a function to initialize the hidden states.

In [12]:
%%add_to LSTMTagger
def init_hidden(self):
    # Initialize the hidden state. The axes correspond to 
    # (num_layers, minibatch_size, hidden_dim)
    return (torch.zeros(1, 1, self.hidden_dim),
            torch.zeros(1, 1, self.hidden_dim))

Now we define a function to make a forward pass through the recurrent LSTM network. It will return the predict tag values given an input sentence.

In [13]:
%%add_to LSTMTagger
def forward(self, sentence):
    embeds = self.word_embeddings(sentence)
    lstm_out, self.hidden = self.lstm(
        embeds.view(len(sentence), 1, -1), self.hidden)
    tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
    tag_scores = F.log_softmax(tag_space, dim=1)
    return tag_scores

### Helper Function
This helper function will be used to map either words or tags to integers, using the previously defined dictionaries (**`tag_to_ix`**, **`ix_to_tag`**).

In [14]:
def prepare_sequence(seq, to_ix):
    """
    Convert words or tags to intigers and return a Pytorch tensor.
    :params seq: Sequence of words.
    :type seq: list
    :params to_ix: Dictionary mapping words or tags to intigers.
    :return: A Pytorch tensor of indices.
    :rtype: Tensor
    """
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

### Train the model:

Create the LSTM Pytorch model using the hyperparameters defined above.

In [15]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))

Define in the loss function. In this case, we will be using a negative log likelihood function, which is useful in classification problems. TODO: Diagram of loss function.

In [16]:
loss_function = nn.NLLLoss()

We will train the model using stochastic gradient descent.

In [17]:
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)

Let's run the model before any training has been done and store the scores to a `list`. We will then compare these scores with the scores after training.

In [18]:
store_initial_scores = []
store_initial_predictions = []
with torch.no_grad():
    for sentence in training_sentences:
        inputs = prepare_sequence(sentence, word_to_ix)
        tag_scores = model(inputs)
        max_values, max_indices = torch.max(tag_scores, 1)
        initial_prediction = [ix_to_tag[x] for x in max_indices.numpy()]
        store_initial_predictions.append(initial_prediction)
        store_initial_scores.append(tag_scores)

Now, we will train the model.

In [19]:
for epoch in range(NUM_EPOCHS):
    for sentence, tags in training_data:
        # Set gradients equal to zero after each intance
        model.zero_grad()
        
        # Initialize hidden state of LSTM after each intance
        model.hidden = model.init_hidden()
        
        # Turn inputs into tensors of word indices
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)
        
        # Run forward pass
        tag_scores = model(sentence_in)
        
        # Compute the loss, gradients, and update the parameters
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

Our model has now finished training. Let's print out some statistics to show how well the model training performed. TODO: Understand what numbers for scores mean and create diagram.

In [20]:
# Print out the scores after training the model
store_initial_scores.reverse()
store_initial_predictions.reverse()
with torch.no_grad():
    for sentence in training_sentences:
        inputs = prepare_sequence(sentence, word_to_ix)
        tag_scores = model(inputs)
        max_values, max_indices = torch.max(tag_scores, 1)
        predictions = [ix_to_tag[x] for x in max_indices.numpy()]
        
        print('Before training:')
        print(' - initial scores: {}'.format(store_initial_scores.pop()))
        print(' - sentence: {}'.format(' '.join(sentence)))
        print(' - predicition: {}'.format(store_initial_predictions.pop()))
        print('After training:')
        print(' - final scores: {}'.format(tag_scores))
        print(' - sentence: {}'.format(' '.join(sentence)))
        print(' - prediction: {}'.format(predictions))
        print('')

Before training:
 - initial scores: tensor([[-1.1737, -1.1350, -0.9960],
        [-1.1711, -1.2007, -0.9442],
        [-1.2051, -1.1736, -0.9389],
        [-1.1577, -1.1514, -0.9954],
        [-1.1466, -1.1302, -1.0236]])
 - sentence: The dog ate the apple.
 - predicition: ['V', 'V', 'V', 'V', 'V']
After training:
 - final scores: tensor([[-0.3653, -1.2979, -3.4145],
        [-2.6583, -0.1817, -2.3428],
        [-2.8464, -3.7458, -0.0852],
        [-0.1884, -3.2050, -2.0312],
        [-4.5199, -0.0146, -5.6265]])
 - sentence: The dog ate the apple.
 - prediction: ['DET', 'NN', 'V', 'DET', 'NN']

Before training:
 - initial scores: tensor([[-1.1579, -1.1226, -1.0204],
        [-1.1322, -1.0778, -1.0867],
        [-1.0748, -1.0915, -1.1304],
        [-1.1071, -1.0693, -1.1202]])
 - sentence: Everybody read that book.
 - predicition: ['V', 'NN', 'DET', 'NN']
After training:
 - final scores: tensor([[-5.1919, -0.0250, -3.9576],
        [-2.1640, -2.9561, -0.1826],
        [-0.0306, -4.4853

### What do the scores mean?
The scores are used to predict the parts-of-speech for each word in a sentence. A corresponding list of of possible parts-of-speech is assigned to each word. This list is the same for all data passed through the model. For example:

In [24]:
print('Let us take the sentence: {}'.format(' '.join(training_sentences[0])))
print('For the word "{}" the list of possible parts-of-speech are: {}'.format(training_sentences[0][0], [x for x in ix_to_tag.values()]))

Let us take the sentence: The dog ate the apple.
For the word "The" the list of possible parts-of-speech are: ['DET', 'NN', 'V']


The model assigns a scores to each part-of-speech in the list. The prediction is then the part-of-speech with the highest score. If the model makes a correct prediction, then `DET` will have the highest score.