# Trying out the allennlp tutorial

Link to the tutorial: https://allennlp.org/tutorials

In this notebook I try to explore the allennlp library starting from their tutorial. Specifically trying to decode what each and every module does for better understanding their framework and of course for faster prototyping. 

Underlined stuff are hyper-linked to docs for easy access. You might also find it easier to simply do ??Module in a new cell

### External modules

In [77]:
from typing import Iterator, List, Dict

First up is [`typing`](https://docs.python.org/3/library/typing.html). It allows type hints which can be used in functions to denote what the expected input and output type would be analogous to cpp. Some interesting ones are Union (either or) and Callable (another function).

In [78]:
import torch
import torch.optim as optim
import numpy as np

Nothing much, just regular pytorch and numpy 

### Set up example cases

We will first experiment with the 
- word 'Hello', 
- sentence 'We live in a society.' 
- sentences \['You are a bold one.', 'Perhaps the archives are incomplete.'\]

In [98]:
word = 'Hello'
sent = 'We live in a society.'
sents = ['You are a bold one.', 'Perhaps the archives are incomplete.']

### Start importing allennlp

#### Tokenizer

In [138]:
from allennlp.data.tokenizers import Token, WordTokenizer
from allennlp.data.tokenizers.token import show_token

[`Token`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#allennlp.data.tokenizers.token.Token): is a wrapper around a word to keep track of some important stuff like its lemma, or a part of speech tag etc. 

[`WordTokenizer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#word-tokenizer): Tokenizes a sentence and outputs a list of tokens. By default it uses spacy's implementation for tokenizing words.

[`show_token`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#allennlp.data.tokenizers.token.show_token): a convenience function to print your tokens

In [99]:
word_token = Token(word)

This is how a single token looks like

In [139]:
show_token(word_token)

'Hello (idx: None) (lemma: None) (pos: None) (tag: None) (dep: None) (ent_type: None) '

Note that only the 'text' is filled, others tags are None and get filled up when one does some other processing

We can now tokenize a whole sentence using the WordTokenizer.

In [127]:
sent_toks = WordTokenizer().tokenize(sent)

The tokenized sentence, the output being a list

In [131]:
sent_toks

[We, live, in, a, society, .]

These are the printed tokens

In [145]:
[show_token(s) for s in sent_toks]

['We (idx: 0) (lemma: We) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'live (idx: 3) (lemma: live) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'in (idx: 8) (lemma: in) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'a (idx: 11) (lemma: a) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'society (idx: 13) (lemma: society) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 20) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ']

We can also process multiple sentences at once. 

In [147]:
sents_toks = WordTokenizer().batch_tokenize(sents)

In [148]:
sents_toks

[[You, are, a, bold, one, .], [Perhaps, the, archives, are, incomplete, .]]

In [146]:
[show_token(s) for snt in sents_toks for s in snt]

['You (idx: 0) (lemma: You) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'are (idx: 4) (lemma: be) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'a (idx: 8) (lemma: a) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'bold (idx: 10) (lemma: bold) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'one (idx: 15) (lemma: one) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 18) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'Perhaps (idx: 0) (lemma: Perhaps) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'the (idx: 8) (lemma: the) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'archives (idx: 12) (lemma: archive) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'are (idx: 21) (lemma: be) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'incomplete (idx: 25) (lemma: incomplete) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 35) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ']

#### TokenIndexer

In [101]:
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer

[`TokenIndexer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.token_indexers.html#allennlp.data.token_indexers.token_indexer.TokenIndexer): Converts a token or list of tokens to indices. These indices refer to the index of the token in some vocabulary to be used by the model.

In [84]:
from allennlp.data import Field
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data import Instance

[`Field`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.field.Field): is simply some data to be feeded to the pipeline of your model. Some important use cases are :
 - tokenized text, for example, the text "The movie" would be stored as a tokenized text in a field as \['The', 'movie'\].
 - numerical id for the tokenized text, suppose 'The' maps to 1, and 'movie' maps to 35 in a dictionary of words, the field could contain \[1, 35\]. 
 - Also note that field can contain multiple sentences of varying lengths. If you want to pass such a field to your pipeline, it needs to be appropriately padded. Field contains an `as_tensor` method to convert the data into tensor and `batch_tensors` to convert into tensors after appropriate padding. 

For most purposes you should be able to use one of the ready made Fields like the `TextField` or `SequenceLabelField`.
 - [`TextField`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.text_field.TextField): The field contains tokenized strings. One needs to pass raw strings through a [`tokenizer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html) before passing it to the field.
 - [`SequenceLabelField`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.sequence_label_field.SequenceLabelField): assigns some label for each element in a field. 

[`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance): is simply a dictionary mapping some keys to fields. This is useful when you want to do the same stuff on multiple fields.

In [89]:
from allennlp.data.dataset_readers import DatasetReader

In [7]:
from allennlp.common.file_utils import cached_path
from allennlp.data.vocabulary import Vocabulary
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer
from allennlp.predictors import SentenceTaggerPredictor

torch.manual_seed(1)

<torch._C.Generator at 0x7fa799ac42f0>

In [8]:
class PosDatasetReader(DatasetReader):
    """
    DatasetReader for PoS tagging data, one sentence per line, like

        The###DET dog###NN ate###V the###DET apple###NN
    """
    def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
        super().__init__(lazy=False)
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
    def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"sentence": sentence_field}

        if tags:
            label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
            fields["labels"] = label_field

        return Instance(fields)
    def _read(self, file_path: str) -> Iterator[Instance]:
        with open(file_path) as f:
            for line in f:
                pairs = line.strip().split()
                sentence, tags = zip(*(pair.split("###") for pair in pairs))
                yield self.text_to_instance([Token(word) for word in sentence], tags)
class LstmTagger(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder,
                 vocab: Vocabulary) -> None:
        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                          out_features=vocab.get_vocab_size('labels'))
        self.accuracy = CategoricalAccuracy()
    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> torch.Tensor:
        mask = get_text_field_mask(sentence)
        embeddings = self.word_embeddings(sentence)
        encoder_out = self.encoder(embeddings, mask)
        tag_logits = self.hidden2tag(encoder_out)
        output = {"tag_logits": tag_logits}
        if labels is not None:
            self.accuracy(tag_logits, labels, mask)
            output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

        return output
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}

In [9]:
reader = PosDatasetReader()

In [10]:
train_dataset = reader.read(cached_path(
    'https://raw.githubusercontent.com/allenai/allennlp'
    '/master/tutorials/tagger/training.txt'))
validation_dataset = reader.read(cached_path(
    'https://raw.githubusercontent.com/allenai/allennlp'
    '/master/tutorials/tagger/validation.txt'))

12/06/2018 11:56:06 - INFO - allennlp.common.file_utils -   https://raw.githubusercontent.com/allenai/allennlp/master/tutorials/tagger/training.txt not found in cache, downloading to /tmp/tmp6xcg92zt
93B [00:00, 35710.91B/s]             
12/06/2018 11:56:07 - INFO - allennlp.common.file_utils -   copying /tmp/tmp6xcg92zt to cache at /home/arka/.allennlp/cache/c3e1f451545a79cf7582dec24d072db6f5bb0d1ae24a924d03c9944516e16b60.47b1193282cbd926a1b602cc6d5a22324cfab24e669ca04f1ff4851a35c73393
12/06/2018 11:56:07 - INFO - allennlp.common.file_utils -   creating metadata file for /home/arka/.allennlp/cache/c3e1f451545a79cf7582dec24d072db6f5bb0d1ae24a924d03c9944516e16b60.47b1193282cbd926a1b602cc6d5a22324cfab24e669ca04f1ff4851a35c73393
12/06/2018 11:56:07 - INFO - allennlp.common.file_utils -   removing temp file /tmp/tmp6xcg92zt
2it [00:00, 2210.44it/s]
12/06/2018 11:56:07 - INFO - allennlp.common.file_utils -   https://raw.githubusercontent.com/allenai/allennlp/master/tutorials/tagger/validati

In [39]:
a3 = a1['labels']

In [40]:
a3.labels

('DET', 'NN', 'V', 'DET', 'NN')

In [16]:
a1 = train_dataset[0]

In [19]:
a2 = a1['sentence']

In [44]:
a4 = a2.tokens[0]

In [35]:
vocab.get_token_to_index_vocabulary()

{'@@PADDING@@': 0,
 '@@UNKNOWN@@': 1,
 'The': 2,
 'dog': 3,
 'ate': 4,
 'the': 5,
 'apple': 6,
 'Everybody': 7,
 'read': 8,
 'that': 9,
 'book': 10}

In [57]:
validation_dataset[0]['sentence'].tokens

[The, dog, read, the, apple]

In [11]:
vocab = Vocabulary.from_instances(train_dataset + validation_dataset)

12/06/2018 11:56:13 - INFO - allennlp.data.vocabulary -   Fitting token dictionary from dataset.
100%|██████████| 4/4 [00:00<00:00, 16194.22it/s]


In [58]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

In [59]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

In [66]:
vocab.get_vocab_size('labels')

3

In [67]:
token_embedding

Embedding()

In [71]:
??BasicTextFieldEmbedder

In [69]:
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

In [70]:
word_embeddings

BasicTextFieldEmbedder(
  (token_embedder_tokens): Embedding()
)

In [75]:
word_embeddings._token_embedders.keys()

dict_keys(['tokens'])

In [76]:
??word_embeddings

In [None]:
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model = LstmTagger(word_embeddings, lstm, vocab)
optimizer = optim.SGD(model.parameters(), lr=0.1)
iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])
iterator.index_with(vocab)
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=validation_dataset,
                  patience=10,
                  num_epochs=1000)
trainer.train()
predictor = SentenceTaggerPredictor(model, dataset_reader=reader)
tag_logits = predictor.predict("The dog ate the apple")['tag_logits']
tag_ids = np.argmax(tag_logits, axis=-1)
print([model.vocab.get_token_from_index(i, 'labels') for i in tag_ids])