# Trying out the allennlp tutorial

Link to the tutorial: https://allennlp.org/tutorials

In this notebook I try to explore the allennlp library starting from their tutorial. Specifically trying to decode what each and every module does for better understanding their framework and of course for faster prototyping. 

Underlined stuff are hyper-linked to docs for easy access. You might also find it easier to simply do ??Module in a new cell

### External modules

In [1]:
from typing import Iterator, List, Dict

First up is [`typing`](https://docs.python.org/3/library/typing.html). It allows type hints which can be used in functions to denote what the expected input and output type would be analogous to cpp. Some interesting ones are Union (either or) and Callable (another function).

In [2]:
import torch
import torch.optim as optim
import numpy as np

Nothing much, just regular pytorch and numpy 

### Set up example cases

We will first experiment with the 
- word 'Hello', 
- sentence 'We live in a society.' 
- sentences \['You are a bold one.', 'Perhaps the archives are incomplete.'\]

In [3]:
word = 'Hello'
sent = 'We live in a society.'
sents = ['You are a bold one.', 'Perhaps the archives are incomplete.']

### Start importing allennlp

#### Tokenizer

In [4]:
from allennlp.data.tokenizers import Token, WordTokenizer
from allennlp.data.tokenizers.token import show_token

[`Token`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#allennlp.data.tokenizers.token.Token): is a wrapper around a word to keep track of some important stuff like its lemma, or a part of speech tag etc. 

[`WordTokenizer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#word-tokenizer): Tokenizes a sentence and outputs a list of tokens. By default it uses spacy's implementation for tokenizing words.

[`show_token`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html#allennlp.data.tokenizers.token.show_token): a convenience function to print your tokens

In [5]:
word_token = Token(word)

This is how a single token looks like

In [6]:
show_token(word_token)

'Hello (idx: None) (lemma: None) (pos: None) (tag: None) (dep: None) (ent_type: None) '

Note that only the 'text' is filled, others tags are None and get filled up when one does some other processing

We can now tokenize a whole sentence using the WordTokenizer.

In [7]:
sent_toks = WordTokenizer().tokenize(sent)

The tokenized sentence, the output being a list

In [8]:
sent_toks

[We, live, in, a, society, .]

These are the printed tokens

In [9]:
[show_token(s) for s in sent_toks]

['We (idx: 0) (lemma: We) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'live (idx: 3) (lemma: live) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'in (idx: 8) (lemma: in) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'a (idx: 11) (lemma: a) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'society (idx: 13) (lemma: society) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 20) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ']

We can also process multiple sentences at once. 

In [10]:
sents_toks = WordTokenizer().batch_tokenize(sents)

In [11]:
sents_toks

[[You, are, a, bold, one, .], [Perhaps, the, archives, are, incomplete, .]]

In [12]:
[show_token(s) for snt in sents_toks for s in snt]

['You (idx: 0) (lemma: You) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'are (idx: 4) (lemma: be) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'a (idx: 8) (lemma: a) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'bold (idx: 10) (lemma: bold) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'one (idx: 15) (lemma: one) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 18) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'Perhaps (idx: 0) (lemma: Perhaps) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'the (idx: 8) (lemma: the) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'archives (idx: 12) (lemma: archive) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'are (idx: 21) (lemma: be) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 'incomplete (idx: 25) (lemma: incomplete) (pos: ) (tag: ) (dep: ) (ent_type: ) ',
 '. (idx: 35) (lemma: .) (pos: ) (tag: ) (dep: ) (ent_type: ) ']

#### TokenIndexer

In [13]:
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer

[`TokenIndexer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.token_indexers.html#allennlp.data.token_indexers.token_indexer.TokenIndexer): Converts a token or list of tokens to indices. These indices refer to the index of the token in some vocabulary to be used by the model.

[`SingleIdTokenIndexer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.token_indexers.html#single-id-token-indexer): Converts a single field 

We note that the token indexer requires a vocabulary, however, we haven't created one yet.

In [14]:
s = SingleIdTokenIndexer()

#### Fields and Instances

In [15]:
from allennlp.data import Field
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data import Instance

[`Field`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.field.Field): is simply some data to be feeded to the pipeline of your model. Some important use cases are :
 - tokenized text, for example, the text "The movie" would be stored as a tokenized text in a field as \['The', 'movie'\].
 - numerical id for the tokenized text, suppose 'The' maps to 1, and 'movie' maps to 35 in a dictionary of words, the field could contain \[1, 35\]. 
 - Also note that field can contain multiple sentences of varying lengths. If you want to pass such a field to your pipeline, it needs to be appropriately padded. Field contains an `as_tensor` method to convert the data into tensor and `batch_tensors` to convert into tensors after appropriate padding. 

For most purposes you should be able to use one of the ready made Fields like the `TextField` or `SequenceLabelField`.
 - [`TextField`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.text_field.TextField): The field contains tokenized strings. One needs to pass raw strings through a [`tokenizer`](https://allenai.github.io/allennlp-docs/api/allennlp.data.tokenizers.html) before passing it to the field.
 - [`SequenceLabelField`](https://allenai.github.io/allennlp-docs/api/allennlp.data.fields.html#allennlp.data.fields.sequence_label_field.SequenceLabelField): assigns some label for each element in a field. 

[`Instance`](https://allenai.github.io/allennlp-docs/api/allennlp.data.instance.html#allennlp.data.instance.Instance): is simply a dictionary mapping with string keys, and values as fields. One data point is one Instance

First lets create a TextField

In [16]:
simple_text_field = TextField(sent_toks, SingleIdTokenIndexer())

In [17]:
simple_text_field

<allennlp.data.fields.text_field.TextField at 0x7fe2bb542a20>

In [27]:
def get_instance_from_tokenized_sent(tok_sent: List[Token]) -> Instance:
    "Converts tokenized sentence into Instances. Each instance being TextField"
    sent_tok_text_field = TextField(tok_sent, {"tokens": SingleIdTokenIndexer()})
    fields = {'sentence': sent_tok_text_field}
    return Instance(fields)

In [28]:
def get_instances_from_tokenized_sents(tok_sents: List[List[Token]]) -> List[Instance]:
    "Converts list of sentences to instances."
    return [get_instance_from_tokenized_sent(tok_sent) for tok_sent in tok_sents]

In [29]:
simple_instance = get_instance_from_tokenized_sent(sent_toks)

In [30]:
simple_instance.fields['sentence'].__dict__

{'tokens': [We, live, in, a, society, .],
 '_token_indexers': {'tokens': <allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer at 0x7fe3208363c8>},
 '_indexed_tokens': None,
 '_indexer_name_to_indexed_token': None}

In [31]:
few_instances = get_instances_from_tokenized_sents(sents_toks)

In [32]:
[f.__dict__ for f in few_instances]

[{'fields': {'sentence': <allennlp.data.fields.text_field.TextField at 0x7fe320917208>},
  'indexed': False},
 {'fields': {'sentence': <allennlp.data.fields.text_field.TextField at 0x7fe2bb52cef0>},
  'indexed': False}]

In [33]:
[f.fields['sentence'].__dict__ for f in few_instances]

[{'tokens': [You, are, a, bold, one, .],
  '_token_indexers': {'tokens': <allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer at 0x7fe320836a90>},
  '_indexed_tokens': None,
  '_indexer_name_to_indexed_token': None},
 {'tokens': [Perhaps, the, archives, are, incomplete, .],
  '_token_indexers': {'tokens': <allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer at 0x7fe2bb52c1d0>},
  '_indexed_tokens': None,
  '_indexer_name_to_indexed_token': None}]

#### Vocabulary

In [25]:
from allennlp.data.vocabulary import Vocabulary

[`Vocabulary`](https://allenai.github.io/allennlp-docs/api/allennlp.data.vocabulary.html): Provides a mapping from a string to an integer index. This can be created `from_files`, `from_instances`. It is quite useful, since building a dictionary is fundamental to almost all nlp tasks. 

Since we have `few_instances` we are now ready to build a vocabulary

In [34]:
vocab =  Vocabulary.from_instances(few_instances)

12/08/2018 18:23:17 - INFO - allennlp.data.vocabulary -   Fitting token dictionary from dataset.
100%|██████████| 2/2 [00:00<00:00, 9147.88it/s]


Now we can get the vocabulary mapping tokens to indices

In [39]:
vocab.get_token_to_index_vocabulary()

{'@@PADDING@@': 0,
 '@@UNKNOWN@@': 1,
 'are': 2,
 '.': 3,
 'You': 4,
 'a': 5,
 'bold': 6,
 'one': 7,
 'Perhaps': 8,
 'the': 9,
 'archives': 10,
 'incomplete': 11}

We can also get index to the word

In [40]:
vocab.get_index_to_token_vocabulary()

{0: '@@PADDING@@',
 1: '@@UNKNOWN@@',
 2: 'are',
 3: '.',
 4: 'You',
 5: 'a',
 6: 'bold',
 7: 'one',
 8: 'Perhaps',
 9: 'the',
 10: 'archives',
 11: 'incomplete'}

To get just the index for a word or just the word given index

In [41]:
vocab.get_token_from_index(2)

'are'

In [42]:
vocab.get_token_index('are')

2

We can also get some statistics about the vocabulary created

In [38]:
vocab.print_statistics()

12/08/2018 18:25:37 - INFO - allennlp.data.vocabulary -   Printed vocabulary statistics are only for the part of the vocabulary generated from instances. If vocabulary is constructed by extending saved vocabulary with dataset instances, the directly loaded portion won't be considered here.




----Vocabulary Statistics----


Top 10 most frequent tokens in namespace 'tokens':
	Token: are		Frequency: 2
	Token: .		Frequency: 2
	Token: You		Frequency: 1
	Token: a		Frequency: 1
	Token: bold		Frequency: 1
	Token: one		Frequency: 1
	Token: Perhaps		Frequency: 1
	Token: the		Frequency: 1
	Token: archives		Frequency: 1
	Token: incomplete		Frequency: 1

Top 10 longest tokens in namespace 'tokens':
	Token: incomplete		length: 10	Frequency: 1
	Token: archives		length: 8	Frequency: 1
	Token: Perhaps		length: 7	Frequency: 1
	Token: bold		length: 4	Frequency: 1
	Token: are		length: 3	Frequency: 2
	Token: You		length: 3	Frequency: 1
	Token: one		length: 3	Frequency: 1
	Token: the		length: 3	Frequency: 1
	Token: .		length: 1	Frequency: 2
	Token: a		length: 1	Frequency: 1

Top 10 shortest tokens in namespace 'tokens':
	Token: a		length: 1	Frequency: 1
	Token: .		length: 1	Frequency: 2
	Token: the		length: 3	Frequency: 1
	Token: one		length: 3	Frequency: 1
	Token: You		length: 3	Frequency: 1

#### File Utils

[`cached_path`](https://allenai.github.io/allennlp-docs/api/allennlp.common.file_utils.html#allennlp.common.file_utils.cached_path): A convenience function taking either an url or a localpath. If url downloads the file to some localpath, if localpath, ensures that it exists. Returns the cached localpath back.

In [73]:
from allennlp.common.file_utils import cached_path

In [84]:
train_dataset_path = cached_path(
    'https://raw.githubusercontent.com/allenai/allennlp'
    '/master/tutorials/tagger/training.txt')
validation_dataset_path = cached_path(
    'https://raw.githubusercontent.com/allenai/allennlp'
    '/master/tutorials/tagger/validation.txt')

We can open the file and see the format

In [87]:
with open(train_dataset_path, 'r') as f:
    lines = f.readlines()

In [88]:
lines

['The###DET dog###NN ate###V the###DET apple###NN\n',
 'Everybody###NN read###V that###DET book###NN\n']

#### DataSet Readers

A superclass for all dataset readers. Has a method to read, and convert text to instance. Both need to be implemented in case of a custom dataset. Lazy defines whether or not to input the whole dataset at once.

In [35]:
from allennlp.data.dataset_readers import DatasetReader

Defining a POS Tagger. Note that the `_read` function returns an Iterator

In [80]:
class PosDatasetReader(DatasetReader):
    """
    DatasetReader for PoS tagging data, one sentence per line, like

        The###DET dog###NN ate###V the###DET apple###NN
    """
    def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
        super().__init__(lazy=False)
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
    def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"sentence": sentence_field}

        if tags:
            label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
            fields["labels"] = label_field

        return Instance(fields)
    def _read(self, file_path: str) -> Iterator[Instance]:
        with open(file_path) as f:
            for line in f:
                pairs = line.strip().split()
                sentence, tags = zip(*(pair.split("###") for pair in pairs))
                yield self.text_to_instance([Token(word) for word in sentence], tags)

In [None]:
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer
from allennlp.predictors import SentenceTaggerPredictor

torch.manual_seed(1)

In [None]:

class LstmTagger(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder,
                 vocab: Vocabulary) -> None:
        super().__init__(vocab)
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
                                          out_features=vocab.get_vocab_size('labels'))
        self.accuracy = CategoricalAccuracy()
    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> torch.Tensor:
        mask = get_text_field_mask(sentence)
        embeddings = self.word_embeddings(sentence)
        encoder_out = self.encoder(embeddings, mask)
        tag_logits = self.hidden2tag(encoder_out)
        output = {"tag_logits": tag_logits}
        if labels is not None:
            self.accuracy(tag_logits, labels, mask)
            output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

        return output
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}

In [None]:
reader = PosDatasetReader()

In [None]:
a3 = a1['labels']

In [None]:
a3.labels

In [None]:
a1 = train_dataset[0]

In [None]:
a2 = a1['sentence']

In [None]:
a4 = a2.tokens[0]

In [None]:
vocab.get_token_to_index_vocabulary()

In [None]:
validation_dataset[0]['sentence'].tokens

In [None]:
vocab = Vocabulary.from_instances(train_dataset + validation_dataset)

In [None]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

In [None]:
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

In [None]:
vocab.get_vocab_size('labels')

In [None]:
token_embedding

In [None]:
??BasicTextFieldEmbedder

In [None]:
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})

In [None]:
word_embeddings

In [None]:
word_embeddings._token_embedders.keys()

In [None]:
??word_embeddings

In [None]:
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model = LstmTagger(word_embeddings, lstm, vocab)
optimizer = optim.SGD(model.parameters(), lr=0.1)
iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])
iterator.index_with(vocab)
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  iterator=iterator,
                  train_dataset=train_dataset,
                  validation_dataset=validation_dataset,
                  patience=10,
                  num_epochs=1000)
trainer.train()
predictor = SentenceTaggerPredictor(model, dataset_reader=reader)
tag_logits = predictor.predict("The dog ate the apple")['tag_logits']
tag_ids = np.argmax(tag_logits, axis=-1)
print([model.vocab.get_token_from_index(i, 'labels') for i in tag_ids])