# Machine Learning with PyTorch

## Natural Language Processing with AllenNLP

<font size="+1">What is AllenNLP?</font>
<a href="AllenNLP_0.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">What is SpaCy?</font>
<a href="AllenNLP_1.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">High Level Interfaces to NLP using PyTorch</font>
<a href="AllenNLP_2.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1">Sentiment Analysis</font>
<a href="AllenNLP_3.ipynb"><img src="img/open-notebook.png" align="right"/></a>

<font size="+1"><b><u>Part-of-Speech Tagging</u></b></font> 
<a href="AllenNLP_4.ipynb"><img src="img/open-notebook.png" align="right"/></a>

## Part-of-Speech Tagging

This lesson is largely based on an [official tutorial](https://github.com/allenai/allennlp/tree/master/tutorials/tagger) in the AllenNLP GitHub repository, by [Joel Grus](https://joelgrus.com/). A similar but somewhat different version is on the [AllenNLP website](https://allennlp.org/tutorials) (I'm unsure about the author(s) of that one).  I borrow a few ideas from both of these, and both of them are fairly closely modeled on the PyTorch [parts-of-speech tutorial](https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html#example-an-lstm-for-part-of-speech-tagging) that I present in an earlier chapter.

There is actually *more* scaffolding here than in the basic PyTorch example. But it does more to allow configurability as well.  In the main, this model and setup is largely the same as we saw with Sentiment Analysis; the main difference is really simply the different tagging of the training data for this different purpose.

As a minor matter, a number of functions in AllenNLP echo progress messages to STDERR in a way I find distracting for these lessons. We can stash them in a log file instead.

In [None]:
from contextlib import redirect_stderr
log = open('stderr.log', 'w')

Also check for CUDA, which will make things run much faster.

In [None]:
import torch

if torch.cuda.is_available():
    torch.cuda.empty_cache()
    device = 0
    print(torch.cuda.get_device_name(device))
    print(f"GPU memory used: {torch.cuda.memory_allocated(device):,}")
else:
    device = -1

### Dataset Reader

Although we only read a tiny toy example here, in concept we might use this reader to read in a real-world tagged corpus.

In [None]:
# AllenNLP typically uses type annotations widely
# The following are needed to satisfy the type annotations of `PosDatasetReader`
from typing import Iterator, List, Dict
from allennlp.data.token_indexers import TokenIndexer
from allennlp.data.tokenizers import Token
from allennlp.data import Instance

The `._read()` method is a required method for subclasses of `DataSetReader`.  It yields a sequence of these slightly funny `Instance` objects (based on tokens and tags). The `.text_to_instance()` method is just a helper with no special significance to the `DataSetReader` parent.

In [None]:
from allennlp.data.dataset_readers import DatasetReader
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data.token_indexers import SingleIdTokenIndexer

class PosDatasetReader(DatasetReader):
    """DatasetReader for PoS tagging data
    
    One sentence per line, e.g.

        The###DET dog###NN ate###V the###DET apple###NN
    """
    def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
        super().__init__(lazy=False)
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}

    def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"sentence": sentence_field}
        
        if tags:
            label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
            fields["labels"] = label_field

        return Instance(fields)

    def _read(self, file_path: str) -> Iterator[Instance]:
        with open(file_path) as f:
            for line in f:
                pairs = line.strip().split()
                sentence, tags = zip(*(pair.split("###") for pair in pairs))
                yield self.text_to_instance([Token(word) for word in sentence], tags)

reader = PosDatasetReader()

### Read in the training and validation datasets

In this case, the training is two tagged sentences, and the validation is one more sentence.  We could point to other URLs for non-toy data.

In [None]:
!echo Training
!curl https://raw.githubusercontent.com/allenai/allennlp/master/tutorials/tagger/training.txt
    
!echo "\nValidation"
!curl https://raw.githubusercontent.com/allenai/allennlp/master/tutorials/tagger/validation.txt

Much of what AllenNLP provides as utility functions are similar to those in many other libraries.  For example, `cached_path()` simply downloads URLs but caches contents; there is nothing NLP or ML specific about it.

In [None]:
from allennlp.data.vocabulary import Vocabulary
from allennlp.common.file_utils import cached_path

with redirect_stderr(log):
    train_dataset = reader.read(cached_path(
        'https://raw.githubusercontent.com/allenai/allennlp'
        '/master/tutorials/tagger/training.txt'))
    
    validation_dataset = reader.read(cached_path(
        'https://raw.githubusercontent.com/allenai/allennlp'
        '/master/tutorials/tagger/validation.txt'))

    # Vocabulary maps {token -> id} (and reverse mapping).
    vocab = Vocabulary.from_instances(train_dataset + validation_dataset)

### Define the model

In [None]:
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder
from allennlp.nn.util import sequence_cross_entropy_with_logits
from allennlp.training.metrics import CategoricalAccuracy

class LstmTagger(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder,
                 vocab: Vocabulary) -> None:
        # Pass the vocab to the base class constructor.
        super().__init__(vocab)
        
        # Store embedding and encoder
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.vocab = vocab
        
        # Examine encoder to find input dimension, vocab to find output dimension
        self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                          out_features=vocab.get_vocab_size('labels'))
        
        # Categorical accuracy metric
        self.accuracy = CategoricalAccuracy()

    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> Dict[str, torch.Tensor]:
        # AllenNLP pads the shorter inputs so that batch has uniform shape,
        # ...use a mask to exclude the padding.
        # `get_text_field_mask()` returns a tensor of 0s and 1s for padded and unpadded
        mask = get_text_field_mask(sentence)
        
        # Convert sentence as a sequence of token ids into a sequence of embedded tensors.
        embeddings = self.word_embeddings(sentence)
        
        # Pass embedded tensors and mask to the LSTM
        encoder_out = self.encoder(embeddings, mask)

        # Pass each encoded output tensor to feedforward layer to produce logits 
        tag_logits = self.hidden2tag(encoder_out)
        output = {"tag_logits": tag_logits}

        # Calculate the loss
        # NOTE: Important to skip accuracy check if no labels passed...
        #   this is situation when making a prediction, will crash 
        #   if attempted without this check
        if labels is not None:
            self.accuracy(tag_logits, labels, mask)
            output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

        return output

    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}

### Instantiate the model

In [None]:
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.seq2seq_encoders import PytorchSeq2SeqWrapper
from allennlp.modules.token_embedders import Embedding

EMBEDDING_DIM = 6
HIDDEN_DIM = 6

token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))

model = LstmTagger(word_embeddings, lstm, vocab)
model = model.cuda(device)

### Train the model

We need to choose an optimizer for our model, in this case SGD (Stochastic gradient descent optimizer).  We create an iterator using the common `BucketIterator` from AllenNLP.  There is a bit to setup here, but it is also very boilerplate, and very similar to the last lesson.

One thing we will utilize here is a learning rate scheduler.  There many of these in PyTorch itself, AllenNLP wraps them slightly for its own APIs.  We had not used such a utility class in the training loops of prior lessons, but rather constructed them more manually in our loops.  In this case, we pass all the work to the parameterized `Trainer` object.

In [None]:
import torch.optim as optim
from allennlp.data.iterators import BasicIterator
from allennlp.common.params import Params
from allennlp.training.learning_rate_schedulers import LearningRateScheduler
from allennlp.training.trainer import Trainer
from allennlp.nn.util import get_text_field_mask

optimizer = optim.SGD(model.parameters(), lr=0.1)
iterator = BasicIterator(batch_size=2)
iterator.index_with(vocab)

params = Params({"type": "reduce_on_plateau"})
scheduler = LearningRateScheduler.from_params(optimizer, params)

trainer = Trainer(model=model,
                  optimizer=optimizer,
                  learning_rate_scheduler=scheduler,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=validation_dataset,
                  patience=10,
                  num_epochs=1000,
                  cuda_device=device)

In [None]:
import pickle
RETRAIN = False

In [None]:
if RETRAIN:
    # During training, the loss should go down and the accuracy up
    with redirect_stderr(log):
        summary = trainer.train()
        
    model.summary = summary
    pickle.dump(model, open('data/parts-of-speech-model.pkl', 'wb'))
else:
    model = pickle.load(open('data/parts-of-speech-model.pkl', 'rb'))
    
for k, v in model.summary.items():
    print(k.rjust(25), '|', v)

Training the model will display something like this, with progress bars advancing during training (almost all lines omitted here, but those shown are in order):

<pre style="background-color:#FFDDDD">
accuracy: 0.3333, loss: 1.1268 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 406.70it/s]
accuracy: 0.4444, loss: 1.1165 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 102.56it/s]
accuracy: 0.4444, loss: 1.1051 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 347.99it/s]
accuracy: 0.6667, loss: 0.7536 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 147.59it/s]
accuracy: 0.8889, loss: 0.6482 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 102.44it/s]
accuracy: 0.8889, loss: 0.5342 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 103.33it/s]
accuracy: 1.0000, loss: 0.0727 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 118.90it/s]
accuracy: 1.0000, loss: 0.0316 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 106.72it/s]
accuracy: 1.0000, loss: 0.0294 ||: 100%|██████████| 1/1 [00:00&lt;00:00, 90.12it/s]
</pre>

In [None]:
import numpy as np
from allennlp.predictors import SentenceTaggerPredictor

predictor = SentenceTaggerPredictor(model, dataset_reader=reader)
sentence = "Everybody ate the apple"
tag_logits = predictor.predict(sentence)['tag_logits']

tag_ids = np.argmax(tag_logits, axis=-1)

parts = [model.vocab.get_token_from_index(i, 'labels') for i in tag_ids]
print(" ".join(f"{word}###{part}" 
               for word, part in zip(sentence.split(), parts)))
