# Sequence Models
Tutorial link: [Sequence Models and Long-Short Term Memory Networks](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html)

Original author: Robert Guthrie.

In this tutorial, you will learn about LSTM neural networks and see an example of how they can be used to recognize parts of speech.

In [1]:
import math
import os
import jdc
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from IPython.display import Image

torch.manual_seed(1)

<torch._C.Generator at 0x11165afd0>

## Introduction
### RNNs
* Difficulty learning widely separated relationships in long sequences.
### What is an LSTM?
* Long Short-Term Memory.
* Able to learn long term dependencies in a sequence.
* Uses a cell state and gates to remember relationships.
    
**Example:** _The humidity is very high. Today it is going to_ _**rain**_. A regular RNN might not do a good job predicting the word **rain**.

### Rolled Out LSTM Network

<img src="https://raw.githubusercontent.com/PythonWorkshop/intro-to-nlp-with-pytorch/master/images/lstm_flow.png" width=50%>

### LSTM Cell

<img src="https://raw.githubusercontent.com/PythonWorkshop/intro-to-nlp-with-pytorch/master/images/lstm_inner_workings.png" width=50%>

### LSTM Components
An LSTM uses three gates:
* Forget Gate
* Input Gate
* Output Gate

#### Forget Gate
* Decide what information to remove from cell state.
* Sigmoid layer.

**Example Continued:** 
* Network receives as input low humidity.
* Network adjusts the rain likelihood for today to low. 
* The forget gate then removes the current state of rain likelihood.

<img src="../images/LSTM3-focus-f.png" width=80%>
<div class="row" style="font-size: 10px">
    <div class="col-md-12">
        <p><a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a>, colah's blog, August 27, 2015
    </div>
</div>

#### Input Gate.
* Decide what information to store in cell state.
* Decide what to update.
    * Sigmoid layer. 
* Create candidate values for cell state.
    * Tanh layer.
    
**Example Continued:**
* replace the previous rain likelihood with the new likelihood for dry weather.

<img src="../images/LSTM3-focus-i.png" width=80%>
<div class="row" style="font-size: 10px">
    <div class="col-md-12">
        <p><a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a>, colah's blog, August 27, 2015
    </div>
</div>

#### Update Cell State
* Remove forgotten information.
    * Multiply $f_t$ by old cell state.

* Add new candidate values scaled by their importance.
    * Add $i_t\ast\tilde{C}_t$

<img src="../images/LSTM3-focus-C.png" width=80%>
<div class="row" style="font-size: 10px">
    <div class="col-md-12">
        <p><a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a>, colah's blog, August 27, 2015
    </div>
</div>

#### Output Gate
* Return filtered version of the cell state.

**Example Continued:**
* Keep track of humidity magnitude to help the network decide whether to predict a large storm or just light rain.

<img src="../images/LSTM3-focus-o.png" width=80%>
<div class="row" style="font-size: 10px">
    <div class="col-md-12">
        <p><a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a>, colah's blog, August 27, 2015
    </div>
</div>

### Very Simple Example 1
* Create an LSTM in PyTorch using a **`for`** loop.

In [2]:
# Define LSTM architecture
sequence_len = 5  # The length of the sequence
input_size = 1  # Number of input features per time step
hidden_size = 1  # Number of LSTM blocks per layer of the RNN
batch_size = 1  # The batch size
output_size = hidden_size  
lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size)

# Create fake inputs for the LSTM
inputs = torch.randn(sequence_len, batch_size, input_size)

# Initialize the hidden state and cell states.
hidden_0 = torch.randn(1, batch_size, hidden_size)
cell_0 = torch.randn(1, batch_size, hidden_size)

# Step through the LSTM as it takes in the input sequence
for i, in_value in enumerate(inputs):
    # Step through the sequence one element at a time.
    # After each time step, hidden contains the hidden state.
    out, hidden_out = lstm(in_value.view(1, 1, -1), (hidden_0, cell_0))
    print('x_{}: {}'.format(i+1, out))
    print('h_{}: {}'.format(i+1, hidden_out))
    print('')

x_1: tensor([[[0.1286]]], grad_fn=<StackBackward>)
h_1: (tensor([[[0.1286]]], grad_fn=<StackBackward>), tensor([[[0.5273]]], grad_fn=<StackBackward>))

x_2: tensor([[[0.1395]]], grad_fn=<StackBackward>)
h_2: (tensor([[[0.1395]]], grad_fn=<StackBackward>), tensor([[[0.4831]]], grad_fn=<StackBackward>))

x_3: tensor([[[0.1322]]], grad_fn=<StackBackward>)
h_3: (tensor([[[0.1322]]], grad_fn=<StackBackward>), tensor([[[0.5156]]], grad_fn=<StackBackward>))

x_4: tensor([[[0.1449]]], grad_fn=<StackBackward>)
h_4: (tensor([[[0.1449]]], grad_fn=<StackBackward>), tensor([[[0.4218]]], grad_fn=<StackBackward>))

x_5: tensor([[[0.1443]]], grad_fn=<StackBackward>)
h_5: (tensor([[[0.1443]]], grad_fn=<StackBackward>), tensor([[[0.4398]]], grad_fn=<StackBackward>))



Here, **`x`** is the output and **`h`** is the value of the hidden and cell states at each step in the sequence 

### Very Simple Example 2
* Create an LSTM in PyTorch using **`cat`**.

In [3]:
# Define LSTM architecture
sequence_len = 5  # The length of the sequence
input_size = 1  # Number of input features per time step
hidden_size = 1  # Number of LSTM blocks per layer of the RNN
batch_size = 1  # The batch size
output_size = hidden_size  
lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size)

# Create the inputs for the LSTM
inputs = [torch.randn(batch_size, input_size) for _ in range(sequence_len)]

# Concatenate the inputs so that they are a tensor
inputs = torch.cat(inputs).view(len(inputs), 1, -1)

# Initialize the hidden state and cell states.
hidden_0 = torch.randn(1, batch_size, hidden_size)
cell_0 = torch.randn(1, batch_size, hidden_size)

out, hidden = lstm(inputs, (hidden_0, cell_0))  # out = all states, hidden = last state and last cell state

In [4]:
print('out: {}'.format(out))
print('last hidden and cell states: {}'.format(hidden))

out: tensor([[[-0.1616]],

        [[-0.1378]],

        [[ 0.0745]],

        [[-0.2183]],

        [[-0.2227]]], grad_fn=<StackBackward>)
last hidden and cell states: (tensor([[[-0.2227]]], grad_fn=<StackBackward>), tensor([[[-0.9171]]], grad_fn=<StackBackward>))


## Example: An LSTM for Part-of-Speech Tagging
* Predict parts-of-speach in a sentence.

### Prepare the data:
* Training data is a list of list pairs.
    * First list is a sentence.
    * Second list are the parts-of-speech tags for each word in the sentence.

In [5]:
training_data = [
    ("The dog ate the apple.".split(), ["Determiner", "Noun", "Verb", "Determiner", "Noun"]),
    ("Everybody read that book.".split(), ["Noun", "Verb", "Determiner", "Noun"])
]
training_sentences = [training_data[x][0] for x in range(len(training_data))]

Using dictionaries to convert words to integers.

In [6]:
word_to_ix = {}
for sentence, tags in training_data:
    for word in sentence:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print('word_to_ix: {}'.format(word_to_ix))

word_to_ix: {'read': 6, 'The': 0, 'that': 7, 'ate': 2, 'book.': 8, 'apple.': 4, 'dog': 1, 'Everybody': 5, 'the': 3}


Map the parts-of-speech tags to integers:

In [7]:
# Tags to integers
tag_to_ix = {"Determiner": 0, "Noun": 1, "Verb": 2}

Map the integers back to parts-of-speech.

In [8]:
# Integers to tags
ix_to_tag = {0: "Determiner", 1: "Noun", 2: "Verb"}

### Set Hyperparameters

In [9]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
LEARNING_RATE = 0.1
NUM_EPOCHS = 300

### Create the model
* **`LSTMTagger`** class.
    * Inherits **`nn.Module`** from PyTorch.
    * Inputs:
        * Embedding dimension.
        * Number of hidden dimensions.
        * Vocabulary size.
        * Tag set size.

In [10]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # LSTM: Inputs are embeddings, outputs are hidden states
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # Linear layer maps hidden space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

Create a function to initialize the hidden states.

In [11]:
%%add_to LSTMTagger
def init_hidden(self):
    """
    Initialize the hidden state. The axes correspond to (num_layers, minibatch_size, hidden_dim).
    """
    return (torch.zeros(1, 1, self.hidden_dim),
            torch.zeros(1, 1, self.hidden_dim))

Define a function to make a forward pass through the recurrent LSTM network. It will return the predict tag values given an input sentence.

In [12]:
%%add_to LSTMTagger
def forward(self, sentence):
    """
    Make a forward pass through the LSTM.
    
    :param sentence: The input sentence.
    :type sentence: list
    :return: A Tensor of tag scores.
    :rtype: Tensor
    """
    embeds = self.word_embeddings(sentence)
    lstm_out, self.hidden = self.lstm(
        embeds.view(len(sentence), 1, -1), self.hidden)
    tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
    tag_scores = F.log_softmax(tag_space, dim=1)
    return tag_scores

### Helper Function
Map either words or tags to integers, using the previously defined dictionaries (**`tag_to_ix`**, **`ix_to_tag`**).

In [13]:
def prepare_sequence(seq, to_ix):
    """
    Convert words or tags to intigers and return a Pytorch tensor.
    :param seq: Sequence of words.
    :type seq: list
    :param to_ix: Dictionary mapping words or tags to intigers.
    :return: A Pytorch tensor of indices.
    :rtype: Tensor
    """
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

### Train the model:

Create the LSTM Pytorch model using the hyperparameters defined above.

In [14]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))

Define the loss function. In this case, we will be using a negative log likelihood function, which is useful in classification problems.

In [15]:
loss_function = nn.NLLLoss()

#### Negative Log Likelihood
We can illustrate negative log likelihood in the following diagram:

<img src="https://raw.githubusercontent.com/PythonWorkshop/intro-to-nlp-with-pytorch/master/images/nll_loss.png" width=50%>

We will train the model using stochastic gradient descent.

In [16]:
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)

Let's run the model before any training has been done and store the scores to a **`list`**. We will then compare these scores with the scores after training.

In [17]:
store_initial_probabilities = []
store_initial_predictions = []
with torch.no_grad():
    for sentence in training_sentences:
        inputs = prepare_sequence(sentence, word_to_ix)
        tag_scores = model(inputs)
        tag_probabilities = tag_scores.exp()
        max_values, max_indices = torch.max(tag_probabilities, 1)
        initial_prediction = [ix_to_tag[x] for x in max_indices.numpy()]
        store_initial_predictions.append(initial_prediction)
        store_initial_probabilities.append(tag_probabilities)

Now, we will train the model.

In [18]:
for epoch in range(NUM_EPOCHS):
    for sentence, tags in training_data:
        # Set gradients equal to zero after each intance
        model.zero_grad()
        
        # Initialize hidden state of LSTM after each intance
        model.hidden = model.init_hidden()
        
        # Turn inputs into tensors of word indices
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)
        
        # Run forward pass
        tag_scores = model(sentence_in)
        
        # Compute the loss, gradients, and update the parameters
        loss = loss_function(tag_scores, targets)
        
        # Perform backward pass
        loss.backward()
        
        # Update model parameters
        optimizer.step()

Our model has now finished training. Let's print out some statistics to show how well the model training performed.

In [19]:
# Print out the scores after training the model
store_initial_probabilities.reverse()
store_initial_predictions.reverse()
with torch.no_grad():
    for sentence in training_sentences:
        inputs = prepare_sequence(sentence, word_to_ix)
        tag_scores = model(inputs)
        tag_probabilities = tag_scores.exp()
        max_values, max_indices = torch.max(tag_probabilities, 1)
        predictions = [ix_to_tag[x] for x in max_indices.numpy()]
        
        print('Before training:')
        print(' - initial probabilities: {}'.format(store_initial_probabilities.pop()))
        print(' - sentence: {}'.format(' '.join(sentence)))
        print(' - predicition: {}'.format(store_initial_predictions.pop()))
        print('After training:')
        print(' - final probabilities: {}'.format(tag_probabilities))
        print(' - sentence: {}'.format(' '.join(sentence)))
        print(' - prediction: {}'.format(predictions))
        print('')

Before training:
 - initial probabilities: tensor([[0.4666, 0.2908, 0.2426],
        [0.4649, 0.2811, 0.2540],
        [0.4950, 0.2623, 0.2427],
        [0.4749, 0.2906, 0.2345],
        [0.4839, 0.2764, 0.2398]])
 - sentence: The dog ate the apple.
 - predicition: ['Determiner', 'Determiner', 'Determiner', 'Determiner', 'Determiner']
After training:
 - final probabilities: tensor([[0.6031, 0.2518, 0.1451],
        [0.0267, 0.9718, 0.0014],
        [0.0592, 0.0161, 0.9246],
        [0.9479, 0.0264, 0.0257],
        [0.0104, 0.9842, 0.0054]])
 - sentence: The dog ate the apple.
 - prediction: ['Determiner', 'Noun', 'Verb', 'Determiner', 'Noun']

Before training:
 - initial probabilities: tensor([[0.4878, 0.2757, 0.2366],
        [0.5013, 0.2558, 0.2429],
        [0.4658, 0.2717, 0.2625],
        [0.4935, 0.2628, 0.2437]])
 - sentence: Everybody read that book.
 - predicition: ['Determiner', 'Determiner', 'Determiner', 'Determiner']
After training:
 - final probabilities: tensor([[0.0037

### What do the scores mean?
* The scores are used to predict the parts-of-speech for each word in a sentence. 
* A corresponding list of of possible parts-of-speech is assigned to each word. 
* This list is the same for all data passed through the model. 

For example:

In [20]:
print('Let us take the sentence: {}'.format(' '.join(training_sentences[0])))
print('For the word "{}" the list of possible parts-of-speech are: {}'.format(training_sentences[0][0], [x for x in ix_to_tag.values()]))

Let us take the sentence: The dog ate the apple.
For the word "The" the list of possible parts-of-speech are: ['Determiner', 'Noun', 'Verb']


* A score is given to each part-of-speech in the list.
* The prediction is the part-of-speech with the highest score.
* Correct prediction = `Determiner` and will have highest score.

## Save Model
Save the Pytorch model to disk. This model will be used in the deployment tutorial.

In [21]:
models_path = os.path.join(os.getcwd(), 'models', 'model.pt')

In [22]:
torch.save(model.state_dict(), models_path)

## Load Model

In [23]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))

In [24]:
model.load_state_dict(torch.load(models_path))

In [25]:
model.eval()

LSTMTagger(
  (word_embeddings): Embedding(9, 6)
  (lstm): LSTM(6, 6)
  (hidden2tag): Linear(in_features=6, out_features=3, bias=True)
)

Run a training example through the loaded model and make a prediction.

In [26]:
inputs = prepare_sequence(training_sentences[0], word_to_ix)
tag_scores = model(inputs)
tag_probabilities = tag_scores.exp()
max_values, max_indices = torch.max(tag_probabilities, 1)
predictions = [ix_to_tag[x] for x in max_indices.numpy()]
print('sentence: {}'.format(' '.join(training_sentences[0])))
print('parts-of-speach: {}'.format(predictions))

sentence: The dog ate the apple.
parts-of-speach: ['Determiner', 'Noun', 'Verb', 'Determiner', 'Noun']


### References:

1. [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)