<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/mlp_conll03_tagger_hf_dset_and_trainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence labeling (token tagging) with MLP

This notebook builds upon the "classification with MLP" one, and shows how to implement an elementary sequence tagger.

# Setup

Before we start running our own Python code, install the required Python packages using [pip](https://en.wikipedia.org/wiki/Pip):

* [`transformers`](https://huggingface.co/docs/transformers/index) is a popular deep learning package primarily on top of torch
* [`datasets`](https://huggingface.co/docs/datasets/) provides support for loading, creating, and manipulating datasets
* evaluate is a library of performance metrics (like accuracy etc)

Both of these packages will be used in this course.

In [1]:
!pip3 install -q transformers datasets evaluate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m55.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

(Above, the `!` at the start of the line tells the notebook to run the line as an operating system command rather than Python code, and the `-q` argument to `pip` runs the command in "quiet" mode, with less output.)

---

# Get and prepare data

*   Let us work with the venerable, if somewhat dated CoNLL'03 shared task data
*   These are English news articles, and have annotation for POS, syntactic chunks, and named entities (in the IOB format)
*   The dataset happens to be in the HF datasets collection, so we can grab it from there



In [2]:
from pprint import pprint #pprint => pretty-print, I use it occassionally throughout the notebook
import datasets
dset=datasets.load_dataset("conll2003")
pprint(dset)

Downloading builder script:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98...


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'test': Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 3453
}),
 'train': Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
}),
 'validation': Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 3250
})}


In [3]:
pprint(dset["train"][12])

{'chunk_tags': [11, 12, 12, 12, 21, 11, 11, 12, 0],
 'id': '12',
 'ner_tags': [0, 5, 0, 5, 0, 1, 0, 0, 0],
 'pos_tags': [30, 22, 10, 22, 38, 22, 27, 21, 7],
 'tokens': ['Only',
            'France',
            'and',
            'Britain',
            'backed',
            'Fischler',
            "'s",
            'proposal',
            '.']}


* This dataset comes with tags pre-converted to ids, so if we want to get some idea what these are, we need do grab the dictionaries from the dataset documentation, and UPenn documentation (for the POS tags)

In [4]:
# From the documentation page and from here https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

POS2ID={'"': 0, "''": 1, '#': 2, '$': 3, '(': 4, ')': 5, ',': 6, '.': 7, ':': 8, '``': 9, 'CC': 10, 'CD': 11, 'DT': 12,
        'EX': 13, 'FW': 14, 'IN': 15, 'JJ': 16, 'JJR': 17, 'JJS': 18, 'LS': 19, 'MD': 20, 'NN': 21, 'NNP': 22, 'NNPS': 23,
        'NNS': 24, 'NN|SYM': 25, 'PDT': 26, 'POS': 27, 'PRP': 28, 'PRP$': 29, 'RB': 30, 'RBR': 31, 'RBS': 32, 'RP': 33,
        'SYM': 34, 'TO': 35, 'UH': 36, 'VB': 37, 'VBD': 38, 'VBG': 39, 'VBN': 40, 'VBP': 41, 'VBZ': 42, 'WDT': 43,
        'WP': 44, 'WP$': 45, 'WRB': 46}

NER2ID={'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

ID2POS=dict((v,k) for k,v in POS2ID.items())
ID2NER=dict((v,k) for k,v in NER2ID.items())

POS2DESCRIPTION = {
    "CC": "Coordinating conjunction",
    "CD": "Cardinal number",
    "DT": "Determiner",
    "EX": "Existential there",
    "FW": "Foreign word",
    "IN": "Preposition or subordinating conjunction",
    "JJ": "Adjective",
    "JJR": "Adjective, comparative",
    "JJS": "Adjective, superlative",
    "LS": "List item marker",
    "MD": "Modal",
    "NN": "Noun, singular or mass",
    "NNS": "Noun, plural",
    "NNP": "Proper noun, singular",
    "NNPS": "Proper noun, plural",
    "PDT": "Predeterminer",
    "POS": "Possessive ending",
    "PRP": "Personal pronoun",
    "PRP$": "Possessive pronoun",
    "RB": "Adverb",
    "RBR": "Adverb, comparative",
    "RBS": "Adverb, superlative",
    "RP": "Particle",
    "SYM": "Symbol",
    "TO": "to",
    "UH": "Interjection",
    "VB": "Verb, base form",
    "VBD": "Verb, past tense",
    "VBG": "Verb, gerund or present participle",
    "VBN": "Verb, past participle",
    "VBP": "Verb, non-3rd person singular present",
    "VBZ": "Verb, 3rd person singular present",
    "WDT": "Wh-determiner",
    "WP": "Wh-pronoun",
    "WP$": "Possessive wh-pronoun",
    "WRB": "Wh-adverb"
}

In [5]:
# Let us see if the tags make any sense to us:
import tabulate

table=[]

ex=dset["train"][12]
for word,pos_id,ner_id in zip(ex["tokens"],ex["pos_tags"],ex["ner_tags"]):
    nertag=ID2NER[ner_id]
    postag=ID2POS[pos_id]
    table.append([word,nertag,postag,POS2DESCRIPTION.get(postag,postag)])

print(tabulate.tabulate(table,headers=["Word","NER","POS","POS definition"]))

Word      NER    POS    POS definition
--------  -----  -----  ------------------------
Only      O      RB     Adverb
France    B-LOC  NNP    Proper noun, singular
and       O      CC     Coordinating conjunction
Britain   B-LOC  NNP    Proper noun, singular
backed    O      VBD    Verb, past tense
Fischler  B-PER  NNP    Proper noun, singular
's        O      POS    Possessive ending
proposal  O      NN     Noun, singular or mass
.         O      .      .


In [6]:
## dset=dset.shuffle() #This is never a bad idea, datasets may have ordering to them, which is not what we want
# I will not shuffle right now to keep articles together, and also have some consistency across runs

# Extract features

* To keep things manageable, and to allow ourselves a chance to inspect the dataset and the features, let us be somewhat explicit and generate a new dataset, with features relevant to the task.

* Below is just a small number of features, these can be expanded as you see fit


In [7]:
def generate_features_for_pos_tagging(tokens):
    # Let's generate some elementary features for POS tagging
    all_feats=[] #list of N lists, where N is the number of tokens, and each list is a list of features as strings
    for i,tok in enumerate(tokens):
        feats=[] #features of this one example
        feats.append(f"token_{tok}") #the token itself is of course a decent feature
        
        window_size=3
        #left context
        left_window_start=max(0,i-window_size)
        for j in range(left_window_start,i):
            feats.append(f"left{-(i-j)}_token_{tokens[j]}") #tokens to the left are a feature
        #right context
        right_window_end=min(i+window_size+1,len(tokens))
        for j in range(i+1,right_window_end):
            try:
                tokens[j]
            except:
                raise ValueError((i,j,len(tokens),right_window_end))
            feats.append(f"right+{j-i}_token_{tokens[j]}") #tokens to the right are a feature
        #some other random features
        if tok[0].isupper():
            feats.append("first-letter-capitalized")
        all_feats.append(feats)
    return all_feats

pprint(generate_features_for_pos_tagging("Only France and Britain backed Fischer 's proposal .".split()))        

[['token_Only',
  'right+1_token_France',
  'right+2_token_and',
  'right+3_token_Britain',
  'first-letter-capitalized'],
 ['token_France',
  'left-1_token_Only',
  'right+1_token_and',
  'right+2_token_Britain',
  'right+3_token_backed',
  'first-letter-capitalized'],
 ['token_and',
  'left-2_token_Only',
  'left-1_token_France',
  'right+1_token_Britain',
  'right+2_token_backed',
  'right+3_token_Fischer'],
 ['token_Britain',
  'left-3_token_Only',
  'left-2_token_France',
  'left-1_token_and',
  'right+1_token_backed',
  'right+2_token_Fischer',
  "right+3_token_'s",
  'first-letter-capitalized'],
 ['token_backed',
  'left-3_token_France',
  'left-2_token_and',
  'left-1_token_Britain',
  'right+1_token_Fischer',
  "right+2_token_'s",
  'right+3_token_proposal'],
 ['token_Fischer',
  'left-3_token_and',
  'left-2_token_Britain',
  'left-1_token_backed',
  "right+1_token_'s",
  'right+2_token_proposal',
  'right+3_token_.',
  'first-letter-capitalized'],
 ["token_'s",
  'left-3_tok

In [8]:
dset_posner=datasets.DatasetDict()
for section in ("train","validation","test"):
    # In defiance of the "sentence" concept, and in line with present-day NLP
    # let us simply paste everything together
    section_dict={} #A dictionary which will hold the dataset for one section (on which we can then run from_dict)
    section_dict["token"]=[]
    section_dict["postag"]=[]
    section_dict["ner"]=[]
    section_dict["features_pos"]=[]
    for ex in dset[section]:
        section_dict["token"].extend(ex["tokens"])
        section_dict["postag"].extend(ex["pos_tags"])
        section_dict["ner"].extend(ex["ner_tags"])
    
    F=generate_features_for_pos_tagging(section_dict["token"]) #generate all features across all tokens
    for fs in F: #and now go one token at a time, the features are a list of strings, let's join them with space
        section_dict["features_pos"].append(" ".join(fs)) # so we get a string of features we can vectorize easily
    d=datasets.Dataset.from_dict(section_dict) #now make the dataset for this section
    dset_posner[section]=d #...and stick it to the main dataset we are building

pprint(dset_posner)
    

{'test': Dataset({
    features: ['token', 'postag', 'ner', 'features_pos'],
    num_rows: 46435
}),
 'train': Dataset({
    features: ['token', 'postag', 'ner', 'features_pos'],
    num_rows: 203621
}),
 'validation': Dataset({
    features: ['token', 'postag', 'ner', 'features_pos'],
    num_rows: 51362
})}


## Vectorize the data

*   **Vectorize** - build the feature vector
*   Since this is NLP, vectorize here means listing the non-zero elements of the feature vector, or in other words the indices of the rows in the embedding matrix
*   A traditional and well-tested way it to use sklearn's feature extraction package
*   CountVectorizer is most likely what we want in here, but for other NLP work the TfidfVectorizer is also very handy
*   Unlike in the text classification notebook, here we are **vectorizing the features**
*   But the process is the same, since our features are now whitespace separated strings, so we can run them through `CountVectorizer` as if they were texts like any other
*   This would also be useful if we, for example, end up having the dataset as files with features on the drive

In [9]:
import sklearn.feature_extraction

# Whitespace tokenization function
# we do not want CountVectorizer to tokenize inside our features!
def whitespace_tokenizer(text):
    return text.split() #split on whitespace

vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True,max_features=30000,tokenizer=whitespace_tokenizer)

features=[ex["features_pos"] for ex in dset_posner["train"]] #get a list of all feature texts from the training data
vectorizer.fit(features) #"Trains" the vectorizer, i.e. builds its vocabulary




# Building the feature vectors

* This is super-easy with the vectorizer
* It produces a sparse matrix of the non-zero elements

In [10]:
def vectorize_example(ex):
    vectorized=vectorizer.transform([ex["features_pos"]]) # [...] because the vectorizer expects a list/iterable over inputs, not one input
    non_zero_features=vectorized.nonzero()[1] #.nonzero gives a pair of (rows,columns), we want the columns
    non_zero_features+=1 #feature index 0 will have a special meaning
                         # so let us not produce it by adding +1 to everything
    return {"input_ids":non_zero_features,"label":ex["postag"]} 

vectorized=vectorize_example(dset_posner["train"][12])
print(dset_posner["train"][12]) #watch out when reading this, this is now the dset_posner dataset, i.e. one example per word!
print(vectorized)


{'token': '1996-08-22', 'postag': 11, 'ner': 0, 'features_pos': 'token_1996-08-22 left-3_token_Peter left-2_token_Blackburn left-1_token_BRUSSELS right+1_token_The right+2_token_European right+3_token_Commission'}
{'input_ids': array([  903,  4996, 11618, 16643, 18483, 22502, 26016], dtype=int32), 'label': 11}


In [11]:
# We can map back to vocabulary and check that everything works
# vectorizer.vocabulary_ is a dictionary {key:word, value:idx}

idx2word=dict((i,w) for (w,i) in vectorizer.vocabulary_.items()) #inverse the vocab dictionary
words=[]
for idx in vectorized["input_ids"]:
    words.append(idx2word[idx-1]) ## It is easy to forget we moved all by +1
pprint(", ".join(words)) #This is now the bag of words representation of the document

('left-1_token_brussels, left-2_token_blackburn, left-3_token_peter, '
 'right+1_token_the, right+2_token_european, right+3_token_commission, '
 'token_1996-08-22')


# Tokenizing / vectorizing the whole dataset

* The datasets library allows us to efficiently map() a function across the whole dataset
* Can run in parallel

**Note**: confusingly, and unlike the Python`map` function, [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function _updates_ its argument dataset, keeping existing values. Here, the call adds the values returned by the function call (here `input_ids`) to each example while also keeping the original `text` and `label` values.


In [12]:
# Apply the tokenizer to the whole dataset using .map()
dset_tokenized = dset_posner.map(vectorize_example,num_proc=4)
pprint(dset_tokenized["train"][0])

Map (num_proc=4):   0%|          | 0/203621 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/51362 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/46435 [00:00<?, ? examples/s]

{'features_pos': 'token_EU right+1_token_rejects right+2_token_German '
                 'right+3_token_call first-letter-capitalized',
 'input_ids': [1, 18718, 22315, 27336],
 'label': 22,
 'ner': 3,
 'postag': 22,
 'token': 'EU'}


## Input encoding for MLP

* Our `input_ids` are an array containing the indices of the tokens found in the text
* This corresponds to the indices into the row of the embedding matrix in the model


# Batching and padding

* When working with neural networks, one rarely trains one example at a time
* Instead, processing always happens a batch at a time
* This has two important reasons:
  1. No batching is too slow (GPU parallelization cannot kick in across examples)
  2. The gradients are averaged across the whole batch and applied only once, i.e. batching acts as a regularizer


# Padding

* In order to build a batch as a 2D array of (example, seq) (see below), we need to fit together examples of different length!
* Solution: pad the shorter examples with zeroes to the length of the longest example in the batch
* Make sure that zero is understood as padding value rather than a (hypothetical) feature with index 0
* This is best shown by example, it is in the end easier than it may sound

# Collator

* This is simply a function which takes a list of examples and builds a training batch out of them
* Much like examples are dictionaries with the data, also batches are dictionaries with the data
* The only difference is that in a batch, all data tensors have one extra dimension, that's all there is to it

In [13]:
# I need to define it here, will explain below
def collator(list_of_examples):
    batch={"labels":torch.tensor(list(ex["label"] for ex in list_of_examples))} #this is easy, labels are made into a single tensor
    #the worse bit is now to pad the examples, as they are of different length
    tensors=[]
    max_len=max(len(example["input_ids"]) for example in list_of_examples) #this is the longest example in the 
    max_len=max(1,max_len)
    #everything needs to be padded to fit in length the longest example
    #(so we can build a single tensor out of it)
    for example in list_of_examples:
        ids=torch.LongTensor(example["input_ids"]) #pick the input ids
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
        padded=torch.nn.functional.pad(ids,(0,max_len-ids.shape[0])) #pad by max - current length, pads with zero by default
        tensors.append(padded) #accumulated the padded ids
    batch["input_ids"]=torch.vstack(tensors) #now that we have all of them the same length, a simple vstack() stacks them up
    #print(batch)
    return batch #...and that's all there is to it


import torch
#Build a batch from 2 examples, with padding
batch=collator([dset_tokenized["train"][2],dset_tokenized["train"][7]])
print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
print("labels:",batch["labels"])
print("input_ids:",batch["input_ids"])

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 6])
labels: tensor([16, 21])
input_ids: tensor([[    1,  5694, 13810, 20966, 22229, 27598],
        [  889, 12542, 12892, 19884, 22165,     0]])


# Build the MLP model

* Now that all of our data is in shape, we can build the model
* That is luckily quite easy in this case

The model class in its simplest form has `__init__()` which instantiates the layers and `forward()` which implements the actual computation. For more information on these, please see the [PyTorch turorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

In [14]:
import torch
import transformers
import collections

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This is quite clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model

        
    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        #1) sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)
        
        #2) apply non-linearity
        # (batch,embedding_dim) -> (batch,embedding_dim)
        projected=torch.tanh(embedded_summed) #Note how non-linearity is applied here and not when configuring the layer in __init__()

        #3) and now apply the upper, output layer of the network
        # (batch,embedding_dim) -> (batch, num_of_classes i.e. 2 in our case)
        logits=self.output(embedded_summed)

        # ...and that's all there is to it!

        #print("input_ids.shape",input_ids.shape)
        #print("embedded.shape",embedded.shape)
        #print("embedded_summed.shape",embedded_summed.shape)
        #print("projected.shape",projected.shape)
        #print("logits.shape",logits.shape)
        
        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

# Configure the model:
#   these parameters are used in the model's __init__()
nlabels=len(POS2ID)
mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=20,nlabels=nlabels)


In [15]:
# And we can make a model
mlp=MLP(mlp_config)
fake_batch=collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call

(tensor(3.7900, grad_fn=<NllLossBackward0>),
 tensor([[ 0.0737, -0.1518, -0.0736,  0.0187,  0.1346,  0.1872,  0.1138,  0.1776,
          -0.0591, -0.0453, -0.1307,  0.0126, -0.2072,  0.1067, -0.1423,  0.1032,
           0.1431,  0.1283, -0.1666, -0.0785, -0.2140, -0.0314,  0.1384,  0.1029,
           0.0087, -0.1134,  0.1243, -0.0080, -0.1145,  0.0413, -0.0728,  0.1759,
          -0.0770, -0.1362,  0.0064,  0.0958, -0.1023,  0.2181, -0.1621, -0.0436,
           0.1191,  0.0033,  0.0135,  0.0509,  0.0147, -0.0304,  0.2237],
         [ 0.0730, -0.1519, -0.0730,  0.0184,  0.1356,  0.1868,  0.1125,  0.1784,
          -0.0585, -0.0445, -0.1304,  0.0146, -0.2077,  0.1076, -0.1426,  0.1039,
           0.1429,  0.1285, -0.1669, -0.0783, -0.2144, -0.0306,  0.1383,  0.1033,
           0.0095, -0.1134,  0.1229, -0.0104, -0.1131,  0.0424, -0.0744,  0.1769,
          -0.0757, -0.1361,  0.0067,  0.0967, -0.1020,  0.2181, -0.1619, -0.0431,
           0.1191,  0.0045,  0.0119,  0.0508,  0.0161, -0.030

# Train the model

We will use the Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class for training

* Loads of arguments that control the training
* Configurable metrics to evaluate performance
* Data collator builds the batches
* Early stopping callback stops when eval loss no longer improves
* Model load/save
* Good foundation for later deep learning course
  

First, let's create a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments) object to specify hyperparameters and various other settings for training. 

Printing this simple dataclass object will show not only the values we set, but also the defaults for all other arguments. Don't worry if you don't understand what all of these do! Many are not relevant to us here, and you can find the details in [`Trainer` documentation](https://huggingface.co/docs/transformers/main_classes/trainer) if you are interested.

In [16]:
# Set training arguments
# their names are mostly self-explanatory
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4, #learning rate of the gradient descent
    max_steps=20000,
    load_best_model_at_end=True,
    per_device_train_batch_size=128
)

pprint(trainer_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=False,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ign

Next, let's create a metric for evaluating performance during and after training. We can use the convenience function [`load_metric`](https://huggingface.co/docs/datasets/about_metrics) to load one of many pre-made metrics and wrap this for use by the trainer.

We can use the basic `accuracy` metric, defined as the proportion of correctly predicted labels out of all labels. This time, though, the data is not evenly split.

In [27]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

We can then create the `Trainer` and train the model by invoking the [`Trainer.train`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train) function.

In addition to the model, the settings passed in through the `TrainingArguments` object created above (`trainer_args`), the data, and the metric defined above, we create and pass the following to the `Trainer`:

* [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator): groups input into batches
* [`EarlyStoppingCallback`](https://huggingface.co/docs/transformers/main_classes/callback#transformers.EarlyStoppingCallback): stops training when performance stops improving

In [18]:
dset_tokenized["train"][1]

{'token': 'rejects',
 'postag': 42,
 'ner': 0,
 'features_pos': 'token_rejects left-1_token_EU right+1_token_German right+2_token_call right+3_token_to',
 'input_ids': [1513, 14675, 17916, 25575],
 'label': 42}

In [19]:
# Make a new model  
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["validation"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

print(dset_tokenized["train"][135])
# FINALLY!
trainer.train()

{'token': 'proposal', 'postag': 21, 'ner': 0, 'features_pos': 'token_proposal left-3_token_He left-2_token_said left-1_token_a right+1_token_last right+2_token_month right+3_token_by', 'input_ids': [340, 7754, 10559, 15162, 19526, 22303, 28844], 'label': 21}




Step,Training Loss,Validation Loss,Accuracy
500,3.7478,3.581201,0.453
1000,3.3184,3.042244,0.451
1500,2.7917,2.557265,0.454
2000,2.3796,2.227563,0.496
2500,2.0818,1.981248,0.566
3000,1.8557,1.783272,0.618
3500,1.6596,1.618515,0.665
4000,1.5084,1.481169,0.696
4500,1.3789,1.36312,0.724
5000,1.2657,1.264437,0.749


TrainOutput(global_step=20000, training_loss=1.0805927307128906, metrics={'train_runtime': 132.8159, 'train_samples_per_second': 19274.8, 'train_steps_per_second': 150.584, 'total_flos': 121266452160.0, 'train_loss': 1.0805927307128906, 'epoch': 12.57})

We can then evaluate the trained model on a given dataset (here our test subset) by calling [`Trainer.evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate):

In [20]:
eval_results = trainer.evaluate(dset_tokenized["test"])

print(eval_results)

{'eval_loss': 0.6376008987426758, 'eval_accuracy': 0.8535802735005922, 'eval_runtime': 8.3005, 'eval_samples_per_second': 5594.266, 'eval_steps_per_second': 699.359, 'epoch': 12.57}


# Save the model for later use

* You can save it with `trainer.save_model()`
* You can load it with `MLP.from_pretrained()`


In [28]:
trainer.save_model("mlp-postagger")

# What has the model learned?

* The embeddings should have some meaning to them
* Similar features should have similar embeddings

In [29]:
# Grab the embedding matrix out of the trained model
# and drop the first row (padding 0)
# then we can treat the embeddings as vectors
# and maybe compare them to each other
# ha ha this below took some googling
weights=mlp.embedding.weight.detach().cpu().numpy()
weights=weights[1:,:]

In [31]:
qry_idx=vectorizer.vocabulary_["token_in"] 

#calculate the distance of the "lousy" embedding to all other embeddings
distance_to_qry=sklearn.metrics.pairwise.euclidean_distances(weights[qry_idx:qry_idx+1,:],weights)
nearest_neighbors=np.argsort(distance_to_qry) #indices of words nearest to "lousy"
for nearest in nearest_neighbors[0,:20]:
    print(idx2word[nearest])
# This works great!

token_in
token_for
token_with
token_on
token_by
token_from
token_of
token_after
token_at
token_as
token_under
token_into
token_against
token_than
token_over
token_if
token_since
token_between
token_before
token_about


* The embeddings indeed seem to reflect the task
* There is a meaning to them
* But now we have many classes, so we should take that into account too
* We can take the dot-product of the feature embeddings with the output layer weight of the class we care about
* When you think how the information propagates in the network, this will give us a single number reflecting each feature w.r.t. the selected label
* Technically speaking, it is the prediction of an example which only has that one feature, with respect to that one class
* Here is how we can implement it (here we rely on the fact that the model is linear, since I commented out the `tanh()` nonlinearity earlier in the model's `forward()`

In [39]:
import numpy
embedding_weights=weights #shape (features,embedding-dim)
output_weights=mlp.output.weight.detach().cpu().numpy() #shape (num-labels,embedding-dim)
#We just matrix-multiply these together, since this gives us all the dot-products
weights_by_label=numpy.matmul(embedding_weights,output_weights.T)
weights_by_label.shape

(30000, 47)

In [58]:
def get_most_important_features_for_and_against(label):
    label_idx=POS2ID[label]
    feature_weights=weights_by_label[:,label_idx] #pick the column that interests us
    #The shape of feature_weights is (feature_vocab_size,) i.e. it is a vector
    features_weight_idx=numpy.argsort(-feature_weights) #sort in descending order, this will be vector of indices
    features_for=[idx2word[feature_idx] for feature_idx in features_weight_idx[:20]]
    features_against=[idx2word[feature_idx] for feature_idx in features_weight_idx[-20:][::-1]]
    return features_for, features_against

for label in ("DT","NN","VB"):
    dt_plus,dt_minus=get_most_important_features_for_and_against(label)
    print(f"{label}: {POS2DESCRIPTION[label]}")
    print(f"Most important features *for* label {label}:")
    pprint("   ".join(dt_plus))
    print()
    print(f"Most important features *against* label {label}:")
    pprint("   ".join(dt_minus))
    print("\n------\n")


DT: Determiner
Most important features *for* label DT:
('token_the   token_a   token_an   token_this   token_some   token_no   '
 'token_any   token_all   token_both   token_another   token_those   '
 'token_these   token_his   token_it   token_(   token_each   token_he   '
 'token_that   token_their   token_its')

Most important features *against* label DT:
('token_.   token_)   token_-   token_to   token_:   token_,   '
 'left-1_token_a   token_--   token_percent   token_year   token_government   '
 'token_been   left-1_token_an   left-1_token_the   token_soccer   '
 'token_have   left-2_token_(   token_are   token_time   left-1_token_his')

------

NN: Noun, singular or mass
Most important features *for* label NN:
('token_percent   token_year   token_government   token_soccer   token_week   '
 'token_time   left-1_token_a   token_police   token_.   token_state   '
 'token_company   right+1_token_of   token_market   token_spokesman   '
 'token_group   token_home   left-1_token_his   