# Machine Learning with PyTorch

## Tasks with Networks

<font size="+1">A simple feature classifier</font>
<a href="NetworkExamples_0.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">An image classifier</font>
<a href="NetworkExamples_1.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">A regression prediction</font>
<a href="NetworkExamples_2.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Clustering with PyTorch</font>
<a href="NetworkExamples_3.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Generative Adversarial Networks (GAN)</font> 
<a href="NetworkExamples_4.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1"><u><b>Part of Speech Tagger</b></u></font>
<a href="NetworkExamples_5.ipynb"><img src="img/open-notebook.png" align="right"/></a>

## Part of Speech Tagger

In this lesson we create an LSTM-based recurrent neural network to identify parts of speech.  This example is taken very closely from the official [PyTorch tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#example-an-lstm-for-part-of-speech-tagging) by Robert Guthrie.  Some minor changes are made to the code and new commentary is introduced.

The example we will look at here is enhanced and reworked and enhanced in an official AllenNLP tutorial that is also presented in the next chapter.  Comparing the styles of these two notebooks is a useful exercise.  The example here uses a very small toy training set; however, the same network *could* be trained against a robust corpus and produce reasonably good classification.

### Remembering state in a network

As was mentioned briefly, the class of network layers called Recurrent Neural Networks are able to remember state within a layer. In the case of an LSTM (Long Short-term Memory) layer, there are extra tensors defining the *hidden state* along with the direct state of the weights for that layer.  The hidden state is able to remember information from arbitrary earlier states.

In particular, an LSTM has three additional activation functions for the hidden state.  One activation function (also called a "gate") is the *input gate*, the next is the "forget gate", the last is the "output gate."  Each of these gates is parameterized differently, and they each take the same inputs as would a fully-connected (linear) layer.  The key difference is that *some but not all* of the current hidden state will contribute to the input gate.

Whether a particular neuron is involved in feedback is based simply on the activation of its output gate.  But as well, depending on the parameters to the forget gate and the inputs, that hidden neuron may become free to be retrained on new information.  An illustration at the [A.I. Wiki](https://skymind.ai/wiki/lstm#long) provides a good overview.

![LSTM gating](img/gers_lstm.png)

### Definining the network

The network we create here is fairly similar to some others.  It is not very deep, as these go, but the LSTM provides a kind of "depth" within its one layer by retaining memory in hidden state.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

We do just a bit of bookkeeping up front.  We need some training data, which here we simply define in a list with sentences and their parts-of-speech.  In a more fleshed out model, the training examples would be much more numerous and probably live in external files or databases, and utilize a standard tagging format.  Here we simply create parallel lists of words and their part-of-speech.

In [None]:
training_data = [
    ("the dog ate the apple".split(), ["DET", "NOUN", "VERB", "DET", "NOUN"]),
    ("we often ate pie".split(), ["PRON", "PART", "VERB", "NOUN"]),
    ("everybody read that book".split(), ["NOUN", "VERB", "DET", "NOUN"]),
    ("do not dog me".split(), ["VERB", "PART", "VERB", "PRON"]),
    ("do not dog the dog".split(), ["VERB", "PART", "VERB", "DET", "NOUN"]),
    ("we dog everybody".split(), ["PRON", "VERB", "NOUN"])
]

word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
            
print(word_to_ix)

We need a a function to encode words as integers and generate tensors based on those.  

In [None]:
def prepare_sequence(seq, tag_to_ix):
    idxs = [tag_to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

We also need to give numeric values to the target tags.  In a more robust arrangment, we would probably generate these from the collection of tags actually seen in training data.  For these five parts-of-speech, we simply hard code them.  The order is not significant.

In [None]:
tag_to_ix = {"DET": 0, "NOUN": 1, "VERB": 2, "PRON": 3, "PART": 4}
ix_to_tag = {v:k for (k, v) in tag_to_ix.items()}

### Defining the model

This is where the real work happens, but there is surprisingly little of it needed.  We simply initialize with our layers, and create a very simple forward function.  

We need to represent words in the vocabulary as vectors/tensors into a less dimensional space than, for example, a one-hot encoding of all the words in the vocabulary.  Each word is mapped to one vector.  Moreover, in this embedding, the transform learns to give words that are used in similar ways comparatively similar vectors, thereby capturing their similarity.

In this particular toy example, the original tensor space does not have very many dimensions since it is a small vocabulary.  But we reduce it further for both tractability and potentially to identify distance similarities between words in a more robust training set.

An embedding layer is learned jointly with a neural network model.

In [None]:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

### Training the model

The training regime is similar to that used in other networks.  We do not worry about some of the fancier steps like decaying learning rate for this toy example.  The below code simply goes through 300 epochs with no early exit or tweaking.

In [None]:
# These will usually on the order of 32 or 64 dimensional for a 
# real-world vocabulary and training set size
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print('Initial')
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

### Making predictions

First let us tag some novel sentences.  These sentence, however, do not contain any novel words.  But of significance, the same word `dog` is used as both a noun and verb in the training set, and in both ways in the same sentence in the test set.  The tagger correctly identifies the role of each word.

In [None]:
test_data = [
    "the dog read the book".split(),
    "we dog the dog".split()
]

with torch.no_grad():
    for sentence in test_data:
        inputs = prepare_sequence(sentence, word_to_ix)
        for word, scores in zip(sentence, model(inputs)):
            best = torch.argmax(scores).item()
            part = ix_to_tag[best]
            print(f"{word}[{part}]", end=' ')
        print('\n')

Let us also look at the logit predictions word-by-word in one sentence.  The "winner" for the part-of-speech for each word is rather strongly preferred, but in principle we could further rank "probabilities" based on the output weights.

In [None]:
import pandas as pd

log_softmax = model(prepare_sequence(test_data[0], word_to_ix))
sentence = pd.DataFrame(log_softmax.detach().numpy(), columns=tag_to_ix)
sentence.index = test_data[0]
sentence

## Next Chapter

**Natural Language Processing**: This lesson looked at Reinforcement Learning.  In the next chaper we turn to the AllenNLP extension to PyTorch.

<a href="AllenNLP_0.ipynb"><img src="img/open-notebook.png" align="left"/></a>