# Machine Learning with PyTorch

## Natural Language Processing with AllenNLP

<font size="+1"><u><b>What is AllenNLP?</b></u></font>
<a href="AllenNLP_0.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">What is SpaCy?</font>
<a href="AllenNLP_1.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Hight Level Interfaces to NLP using PyTorch</font>
<a href="AllenNLP_2.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Sentiment Analysis</font>
<a href="AllenNLP_3.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">TBD</font> 
<a href="AllenNLP_4.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">TBD</font>
<a href="AllenNLP_5.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">TBD</font>
<a href="AllenNLP_6.ipynb"><img src="img/open-notebook.png" align="right"/></a>

As a minor matter, a number of functions in AllenNLP echo progress messages to STDERR in a way I find distracting for these lessons.  We can stash them in a log file instead.

In [1]:
from contextlib import redirect_stderr
log = open('stderr.log', 'w')

Also check for CUDA, which will make things run much faster.

In [2]:
import torch

if torch.cuda.is_available():
    device = 0
else:
    device = -1

### Credit

The material in this section is borrowed from Masato Hagiwara's [Training a Sentiment Analyzer using AllenNLP (in less than 100 lines of Python code)](http://www.realworldnlpbook.com/blog/training-sentiment-analyzer-using-allennlp.html).  I have made some minor changes to the code and provided my own commentary; I recommend all of his blog posts and other writing highly. I am very much looking forward to the release of his book _Real-World Natural Language Processing_, to be published in 2019 by Manning.

The dataset used in this example by Hagiwara, and hence by me, are provided by Stanford University's [Sentiment Analysis](https://nlp.stanford.edu/sentiment/index.html) research page.  This dataset was 
presented in the paper _Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank_ by Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts.  The dataset is provided with this repository for convenience.

### Sentiment tree

It is worthwhile to understand what the sentiment tree contains.  If we were only to assign sentiment values to single words, we would often miss the larger structure of overall sentence.  This of the famous saying popularly misattributed to Samual Johnson:

> Your manuscript is both good and original, but the part that is good is not original and the part that is original is not good

Every individual word in that has a positive or neutral sentiment, but overall it is a scalding criticism.  We can see whether our model identifies this example, but in general we want to look for larger phrases.

The Stanford dataset tags arbitrarily long phrases as well as individual words.  It uses 5-levels of sentiment, but the reader could be parameterized to use 3-level or 2-level by simplification.

In [3]:
import re
training = open('data/stanford/train.txt').readlines()
print("Num samples:", len(training))
samp = training[21].strip()
print("Example:    ", samp)
print("Unadorned:  ", 
      ' '.join(re.sub(r'[012345()]', '', samp).split()))

Num samples: 8544
Example:     (3 (2 ``) (3 (2 Frailty) (4 (2 '') (3 (4 (3 (2 has) (3 (2 been) (3 (4 (3 (3 written) (3 (2 so) (3 well))) (2 ,)) (2 (2 (2 that) (2 even)) (1 (2 (2 a) (2 simple)) (1 (2 ``) (0 Goddammit))))))) (2 !)) (2 '')))))
Unadorned:   `` Frailty '' has been written so well , that even a simple `` Goddammit ! ''


In [4]:
# AllenNLP comes with a reader for this format 
from allennlp.data.dataset_readers import stanford_sentiment_tree_bank
# The names for objects are rather long, use an abbrev for single use
SSTBDR = stanford_sentiment_tree_bank.StanfordSentimentTreeBankDatasetReader
reader = SSTBDR(granularity='5-class')

with redirect_stderr(log):
    train_dataset = reader.read('data/stanford/train.txt')
    dev_dataset = reader.read('data/stanford/dev.txt')

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


### Vocabulary

We also need to encode the vocabulary of the training set as integers.  The `Vocabulary` class provides a numerous features for exactly how this is accomplished.  For example, below we disregard any words that occur fewer than two times.

In [5]:
from allennlp.data.vocabulary import Vocabulary
vocab = Vocabulary.from_instances(train_dataset + dev_dataset,
                                  min_count={'tokens': 2})

100%|██████████| 9645/9645 [00:00<00:00, 56856.71it/s]


In [6]:
vocab

Vocabulary with namespaces:  tokens, Size: 9438 || labels, Size: 5 || Non Padded Namespaces: {'*labels', '*tags'}

In [7]:
for i in range(10):
    print(vocab.get_token_from_index(i), end=' | ')

@@PADDING@@ | @@UNKNOWN@@ | . | , | the | and | of | a | to | 's | 

### Embedding the vocabulary into tensors

We need to represent words in the vocabulary as vectors/tensors into a much less dimensional space than, for example, a one-hot encoding of all the words in the vocabulary.  Each word is mapped to one vector.  Moreover, in this embedding, the transform learns to give words that are used in similar ways comparatively similar vectors, thereby capturing their similarity.

An embedding layer is learned jointly with a neural network model 

In [8]:
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding

EMBEDDING_DIM = HIDDEN_DIM = 256

token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

# BasicTextFieldEmbedder for tokens, not for (unchanged) labels
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

### Sentiment Model

The model we create with AllenNLP is in many ways the same as we might with plain PyTorch.  But a number of things have been usefully abstracted for us as well.  This model is most useful using a recurrent neural network (such as LSTM) for its `encoder`, but it abstracts from the specific network type with the `Seq2VecEncoder` wrapper.

In [9]:
from typing import Dict   # AllenNLP makes wide use of type annotations
import torch
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder
from allennlp.nn.util import get_text_field_mask
from allennlp.modules.seq2vec_encoders import Seq2VecEncoder
from allennlp.training.metrics import CategoricalAccuracy, F1Measure

class Classifier(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2VecEncoder,
                 vocab: Vocabulary,
                 positive_label: int = 4) -> None:
        super().__init__(vocab)
        
        # We need the embeddings to convert word IDs to their vector representations
        self.word_embeddings = word_embeddings

        # Seq2VecEncoder is a neural network abstraction that takes a sequence of something
        # (usually a sequence of embedded word vectors), processes it, and returns it as a single
        # vector. Oftentimes, this is an RNN-based architecture (e.g., LSTM or GRU), but
        # AllenNLP also supports CNNs and other simple architectures (for example,
        # just averaging over the input vectors).
        self.encoder = encoder

        # After converting a sequence of vectors to a single vector, we feed it into
        # a fully-connected linear layer to reduce the dimension to the total number of labels.
        self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                          out_features=vocab.get_vocab_size('labels'))
        
        # Monitor the metrics - we use accuracy, as well as prec, rec, f1 for 4 (very positive)
        self.f1 = F1Measure(positive_label)        
        self.accuracy = CategoricalAccuracy()

        # We use the cross-entropy loss because this is a classification task.
        # Note that PyTorch's CrossEntropyLoss combines softmax and log likelihood loss,
        # which makes it unnecessary to add a separate softmax layer.
        self.loss_function = torch.nn.CrossEntropyLoss()

    # Instances are fed to forward after batching.
    # Fields are passed through arguments with the same name.
    def forward(self,
                tokens: Dict[str, torch.Tensor],
                label: torch.Tensor = None) -> torch.Tensor:
        # In deep NLP, when sequences of tensors in different lengths are batched together,
        # shorter sequences get padded with zeros to make them of equal length.
        # Masking is the process to ignore extra zeros added by padding
        mask = get_text_field_mask(tokens)

        # Forward pass
        embeddings = self.word_embeddings(tokens)
        encoder_out = self.encoder(embeddings, mask)
        logits = self.hidden2tag(encoder_out)

        # Returned output dictionary must contain a "loss" key for your model
        output = {"logits": logits}
        if label is not None:
            self.accuracy(logits, label)
            self.f1(logits, label)
            output["loss"] = self.loss_function(logits, label)
        return output    
    
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        # Could add more reporting, e.g.:
        # precision, recall, f1 = self.f1.get_metric(reset)
        # return {'precision': precision, 'recall': recall, 'f1': f1}
        return {'accuracy': self.accuracy.get_metric(reset)}

In [10]:
from allennlp.modules.seq2vec_encoders import PytorchSeq2VecWrapper

lstm = PytorchSeq2VecWrapper(
    torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

model = Classifier(word_embeddings, lstm, vocab)
model = model.cuda(device)

In [11]:
import torch.optim as optim
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer

optimizer = optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)

iterator = BucketIterator(batch_size=32, sorting_keys=[("tokens", "num_tokens")])
iterator.index_with(vocab)

trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=dev_dataset,
                  patience=10,
                  num_epochs=40,
                  cuda_device=device)

trainer.train()

accuracy: 0.2708, loss: 1.5766 ||: 100%|██████████| 267/267 [00:02<00:00, 107.37it/s]
accuracy: 0.2552, loss: 1.5731 ||: 100%|██████████| 35/35 [00:00<00:00, 187.93it/s]
accuracy: 0.2757, loss: 1.5617 ||: 100%|██████████| 267/267 [00:01<00:00, 146.13it/s]
accuracy: 0.2552, loss: 1.5690 ||: 100%|██████████| 35/35 [00:00<00:00, 223.40it/s]
accuracy: 0.2899, loss: 1.5242 ||: 100%|██████████| 267/267 [00:02<00:00, 114.09it/s]
accuracy: 0.2916, loss: 1.5493 ||: 100%|██████████| 35/35 [00:00<00:00, 232.37it/s]
accuracy: 0.3897, loss: 1.3968 ||: 100%|██████████| 267/267 [00:02<00:00, 132.53it/s]
accuracy: 0.3415, loss: 1.5359 ||: 100%|██████████| 35/35 [00:00<00:00, 207.74it/s]
accuracy: 0.5080, loss: 1.1996 ||: 100%|██████████| 267/267 [00:02<00:00, 150.01it/s]
accuracy: 0.3660, loss: 1.5173 ||: 100%|██████████| 35/35 [00:00<00:00, 262.53it/s]
accuracy: 0.6071, loss: 0.9826 ||: 100%|██████████| 267/267 [00:01<00:00, 140.22it/s]
accuracy: 0.3715, loss: 1.5606 ||: 100%|██████████| 35/35 [00:00

{'best_epoch': 4,
 'peak_cpu_memory_MB': 2989.528,
 'peak_gpu_0_memory_MB': 900,
 'training_duration': '00:00:34',
 'training_start_epoch': 0,
 'training_epochs': 13,
 'epoch': 13,
 'training_accuracy': 0.9005149812734082,
 'training_loss': 0.28951838693480364,
 'training_cpu_memory_MB': 2989.528,
 'training_gpu_0_memory_MB': 900,
 'validation_accuracy': 0.3178928247048138,
 'validation_loss': 3.4437111752373832,
 'best_validation_accuracy': 0.36603088101725706,
 'best_validation_loss': 1.5173429216657366}

### Making predictions

It is straightfoward to make predictions once we have a trained model.  We need to wrap the model itself in an actual predictor, such as the one [provided by Dr. Hagiwara](https://github.com/mhagiwara/realworldnlp) named `SentenceClassifierPredictor`.  But making a prediction follows the somewhat more intuitive scikit-learn style of calling a `.predict()` method rather than the `pytorch.nn` style of calling the model itself. 

I find it interesting that the classification chosen is not always strongly univocal from the model, and in some cases two far apart options are of similar preference.  In the ideal case, one logit value would be strongly greater than all others, but that is not always the case. I.e. possibly slightly more training data or slightly different parameters might tip the scale between very different predictions.

In [12]:
import numpy as np
from src.predictors import SentenceClassifierPredictor

sentiments = {'0': "Very negative",
              '1': "Mildly negative",
              '2': "Neutral",
              '3': "Mildly positive",
              '4': "Very positive"}

phrases = ["This is the best movie ever!",
           "This is the worst movie ever!",
           "The part that is good is not original, the part that is original is not good.",
           "A day that will live in infamy.",
           "When one burns one's bridges, what a very nice fire it makes.",
           "You will contract a rare disease.",
           "The only people for me are the mad one.",
           "Never give an inch!",
           "This movie was actually neither that funny, nor super witty.",
          ]
        
for phrase in phrases:
    predictor = SentenceClassifierPredictor(model, dataset_reader=reader)
    logits = predictor.predict(phrase)['logits']
    ranking = np.argsort(logits)
    first = ranking[-1]
    second = ranking[-2]

    sentiment = model.vocab.get_token_from_index(first, 'labels')
    sentiment2 = model.vocab.get_token_from_index(second, 'labels')
    print(f'{logits[first]:5.1f} {sentiments[sentiment]:15s} | {phrase}',
          f'\n{logits[second]:5.1f} {sentiments[sentiment2]}\n')

  2.7 Very positive   | This is the best movie ever! 
  0.4 Very negative

  3.3 Very negative   | This is the worst movie ever! 
  0.1 Neutral

  5.8 Very positive   | The part that is good is not original, the part that is original is not good. 
  1.3 Mildly positive

  5.6 Mildly positive | A day that will live in infamy. 
  1.5 Neutral

  3.4 Neutral         | When one burns one's bridges, what a very nice fire it makes. 
  1.1 Mildly negative

  5.3 Very positive   | You will contract a rare disease. 
  1.0 Mildly positive

  4.4 Mildly negative | The only people for me are the mad one. 
  2.9 Neutral

  2.1 Neutral         | Never give an inch! 
  1.8 Mildly positive

  6.0 Very negative   | This movie was actually neither that funny, nor super witty. 
  3.9 Very positive



### Adjusting the network

We can experiment with other network details easily enough using the scaffolding we have already created.  For example, perhaps we speeculate that a gated recurrent unit (GRU) will work better for the recurrent layer than a multi-layer long short-term memory (LSTM).  Moreover, we also want to try using RMSprop rather than Adam for the optimizer.  Plus we decrease the `patience` to cause a faster step-down in the learning rate.

These particular changes are not particularly theory based (but they are not absurd either); the example here is simply to show how we can easily vary those design details.

In [13]:
# Different optimizer
optimizer = optim.RMSprop(model.parameters(), lr=1e-4, weight_decay=1e-5)

# Different RNN layer
gru = PytorchSeq2VecWrapper(
    torch.nn.GRU(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

model = Classifier(word_embeddings, gru, vocab).cuda(device)

# Different patience for LR decay
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=dev_dataset,
                  patience=5,
                  num_epochs=20,
                  cuda_device=device)

trainer.train()

accuracy: 0.2514, loss: 1.6065 ||: 100%|██████████| 267/267 [00:01<00:00, 152.49it/s]
accuracy: 0.2625, loss: 1.6050 ||: 100%|██████████| 35/35 [00:00<00:00, 261.42it/s]
accuracy: 0.2611, loss: 1.6009 ||: 100%|██████████| 267/267 [00:02<00:00, 120.87it/s]
accuracy: 0.2634, loss: 1.6014 ||: 100%|██████████| 35/35 [00:00<00:00, 268.09it/s]
accuracy: 0.2614, loss: 1.5961 ||: 100%|██████████| 267/267 [00:02<00:00, 131.71it/s]
accuracy: 0.2634, loss: 1.5979 ||: 100%|██████████| 35/35 [00:00<00:00, 265.93it/s]
accuracy: 0.2616, loss: 1.5918 ||: 100%|██████████| 267/267 [00:02<00:00, 114.17it/s]
accuracy: 0.2634, loss: 1.5947 ||: 100%|██████████| 35/35 [00:00<00:00, 231.36it/s]
accuracy: 0.2617, loss: 1.5878 ||: 100%|██████████| 267/267 [00:02<00:00, 126.02it/s]
accuracy: 0.2634, loss: 1.5914 ||: 100%|██████████| 35/35 [00:00<00:00, 259.88it/s]
accuracy: 0.2625, loss: 1.5841 ||: 100%|██████████| 267/267 [00:02<00:00, 121.07it/s]
accuracy: 0.2625, loss: 1.5889 ||: 100%|██████████| 35/35 [00:00

{'best_epoch': 19,
 'peak_cpu_memory_MB': 3019.44,
 'peak_gpu_0_memory_MB': 922,
 'training_duration': '00:00:49',
 'training_start_epoch': 0,
 'training_epochs': 19,
 'epoch': 19,
 'training_accuracy': 0.39279026217228463,
 'training_loss': 1.5534012362305145,
 'training_cpu_memory_MB': 3019.44,
 'training_gpu_0_memory_MB': 922,
 'validation_accuracy': 0.30154405086285196,
 'validation_loss': 1.569025080544608,
 'best_validation_accuracy': 0.30154405086285196,
 'best_validation_loss': 1.569025080544608}

## Next Lesson

**Enhancing an Image Classifier**: Libraries built on top of PyTorch offer very powerful tools for natural language processing.  Next we will return to image classification, and build on pretrained models as a demonstration of re-use neural networks.

<a href="ImageClassifier.ipynb"><img src="img/open-notebook.png" align="left"/></a>