# Homework 6. Sequence Tagging with BERT

Welcome to Homework 6! 

The homework contains several tasks. You can find the amount of points that you get for the correct solution in the task header. Maximum amount of points for each homework is _six_.

The **grading** for each task is the following:
- correct answer - **full points**
- insufficient solution or solution resulting in the incorrect output - **half points**
- no answer or completely wrong solution - **no points**

Even if you don't know how to solve the task, we encourage you to write down your thoughts and progress and try to address the issues that stop you from completing the task.

When working on the written tasks, try to make your answers short and accurate. Most of the times, it is possible to answer the question in 1-3 sentences.

When writing code, make it readable. Choose appropriate names for your variables (`a = 'cat'` - not good, `word = 'cat'` - good). Avoid constructing lines of code longer than 100 characters (79 characters is ideal). If needed, provide the commentaries for your code, however, a good code should be easily readable without them :)

Finally, all your answers should be written only by yourself. If you copy them from other sources it will be considered as an academic fraud. You can discuss the tasks with your classmates but each solution must be individual.

<font color='red'>**Important!:**</font> **before sending your solution, do the `Kernel -> Restart & Run All` to ensure that all your code works.**

## Task 1. Prepare the environment and download the data (1 point)

In [0]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.sampler import SubsetRandomSampler
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

from pathlib import Path
import time
import random
from collections import Counter

from typing import List, Dict

# Check if we are running on a CPU or GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

We need to install the transformers library first, if we use Google Colab.

Another useful library to make the data loading cleaner and easier is [conllu](https://github.com/EmilStenstrom/conllu/). In particular, we are going to use the `parse()` function from this package. You can consult the documentation if you want to know more about this package.

In [0]:
!pip install transformers
!pip install conllu

In this Homework, we are going to use `AutoModelForTokenClassification`, `AutoConfig` and `AutoTokenizer` to treat our data. Please, read more about them on the official documentation page: https://huggingface.co/transformers/model_doc/auto.html.

In short, these classes will create a specific model class for us. We just need to specify the name for the model. For example, if we call `AutoTokenizer.from_pretrained('bert-base-multilingual-cased')` it is going to create a `BertTokenizer` for us. 

In [0]:
from transformers import AutoModelForTokenClassification, AdamW, AutoConfig
from transformers import AutoTokenizer, PreTrainedTokenizer
from transformers import get_linear_schedule_with_warmup

from conllu import parse

As in the previous Homework, we are going to use [Universal Dependencies](https://universaldependencies.org/) data. It has labelled corpora for morphological tagging and syntax parsing for over than 70 languages. You need to choose your language from the official UD page, choose the treebank that you like and follow the GitHub link to it. Then, from GitHub, copy the link from the green "Clone or download" button and replace it in the cell below. 

Also, replace the name of your treebank in the `!mv` command.

For example, if I choose the EDT treebank for Estonian from [here](https://universaldependencies.org/#estonian-treebanks), the GitHub link is going to be `https://github.com/UniversalDependencies/UD_Estonian-EDT.git` and the name of the treebank is `UD_Estonian-EDT`, which is the name of the repository.

Replace the `...` in `!git clone` with the GitHub link to the repository that you've chosen.

Replace the `...` in `!mv` with the name of the treebank that you've chosen.

In [0]:
!git clone ...
!mkdir data/
!mv .../ data/

Here, you will need to choose the model to suit your data. The list of all available models can be found here: https://huggingface.co/transformers/pretrained_models.html

The common advice is, that if you can use a language-specific model, this is the best way to go. Otherwise, check if a `multilingual` model has your language (https://github.com/google-research/bert/blob/master/multilingual.md) and go with it. Another important thing is that your model must support token classification. You can see the list of available models here: https://huggingface.co/transformers/model_doc/auto.html#automodelfortokenclassification

In this Homework, you are free to choose any suitable model. To do that, paste the appropriate name into the `MODEL_NAME` variable.

Finally, don't forget to put the name of your UD treebank into the `DATA_PATH` variable.

__What model did you choose? Briefly explain your choice.__

<font color='red'>Your answer here</font>

In [0]:
PAD = '[PAD]'
PAD_ID = 0
UNK = '[UNK]'
UNK_ID = 1
CLS = '[CLS]'
CLS_ID = 2
SEP = '[SEP]'
SEP_ID = 3
VOCAB_PREFIX = [PAD, UNK, CLS, SEP]
# Setting label padding to -100 since this is a default value for ignore_index
# in Pytorch CrossEntropyLoss:
# (https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss)
LABEL_PAD_ID = -100
DATA_PATH = Path('data') / '...'

MODEL_NAME = '...'

batch_size = 16
random_seed = 42

Initialize your tokenizer with the `AutoTokenizer.from_pretrained()`. Pay attention to the `do_lower_case` parameter. It should be set in appropriately depending on if your model is `cased` or `uncased`.

You can read about other available parameters here: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer

In [0]:
tokenizer = AutoTokenizer.from_pretrained(...)

You can test if the tokenizer is working here

In [0]:
print(tokenizer.tokenize("This is the BERT tokenizer that we're going to use today."))
print(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("This is the BERT tokenizer that we're going to use today.")))

Another usefull thing is to know how to access the ids of special tokens (`[CLS]`, `[SEP]`)

In [0]:
print(f'Token: {tokenizer.cls_token}\tID: {tokenizer.cls_token_id}')
print(f'Token: {tokenizer.sep_token}\tID: {tokenizer.sep_token_id}')
print(f'Token: {tokenizer.unk_token}\tID: {tokenizer.unk_token_id}')

## Task 2. Load the data (4 points)

For loading the data, we can benifit from the `WordVocab` from Homework 4 to create the vocab for the labels.

In [0]:
class BaseVocab:
    def __init__(self, data, idx=0, lower=False):
        self.data = data
        self.lower = lower
        self.idx = idx
        self.build_vocab()
        
    def normalize_unit(self, unit):
        if self.lower:
            return unit.lower()
        else:
            return unit
        
    def unit2id(self, unit):
        unit = self.normalize_unit(unit)
        if unit in self._unit2id:
            return self._unit2id[unit]
        else:
            return self._unit2id[UNK]
    
    def id2unit(self, id):
        return self._id2unit[id]
    
    def map(self, units):
        return [self.unit2id(unit) for unit in units]

    def unmap(self, ids):
        return [self.id2unit(idx) for idx in ids]
        
    def build_vocab(self):
        NotImplementedError()
        
    def __len__(self):
        return len(self._unit2id)

In [0]:
class WordVocab(BaseVocab):
    def build_vocab(self):
        if self.lower:
            counter = Counter([w[self.idx].lower() for sent in self.data for w in sent])
        else:
            counter = Counter([w[self.idx] for sent in self.data for w in sent])

        self._id2unit = VOCAB_PREFIX + list(sorted(list(counter.keys()), key=lambda k: counter[k], reverse=True))
        self._unit2id = {w:i for i, w in enumerate(self._id2unit)}

You might have seen in the Lab 6 that the data for the BERT transformers must follow a set convention. In particular, each sequence must start with the `[CLS]` token and end with the `[SEP]` token. Even though we are not going to use the `[CLS]` token in this task, it is still required to be in your data.

Another important thing that we need to take into account is the word tokenization. In the UD data, the sentences are already pre-tokenized. We could used the existing tokenization but then we would have a lot of `[UNK]` tokens, especially for a language other than English. BERT tokenizer tends to split unknown tokens into sub-words to get useful information from them. 

Let's see the following example:

If we tokenize the sentence "This is the BERT tokenizer that we're going to use today." with the tokenizer from `bert-base-multilingual-cased` model, we get these tokens: `['This', 'is', 'the', 'BE', '##RT', 'tok', '##eni', '##zer', 'that', 'we', "'", 're', 'going', 'to', 'use', 'today', '.']`. The problem is that we need to somehow allign the labels since now since, for example, the word `tokenizer` is split into three sub-words. 

One way to overcome this is not to use the BERT tokenizer and just convert already pre-tokenized words to ids. However, then we get the following mapping:

Tokens: `['[CLS]', 'This', 'is', 'the', 'BERT', 'tokenizer', 'that', 'we', "'", 're', 'going', 'to', 'use', 'today', '.', '[SEP]']`

IDs: `[[CLS], 10747, 10124, 10105, [UNK], [UNK], 10189, 11951, 112, 11639, 19090, 10114, 11760, 18745, 119, [SEP]]`

You can see that we now have two `[UNK]` tokens since `BERT` and `tokenizer` were not in the vocabulary. We also lost five sub-words that could give the model useful features from the text. What is more, for languages other than English, the number of `[UNK]` is going to be even higher.

One way to overcome this is to use the real label for the first sub-word and the padding labels for the following sub-words. For example, for the same sentence:

`['[CLS]', 'This', 'is', 'the', 'BE', '##RT', 'tok', '##eni', '##zer', 'that', 'we', "'", 're', 'going', 'to', 'use', 'today', '.', '[SEP]']`

We will have the following labels:

`['[CLS]', 'DET', 'VERB', 'DET', 'NOUN', '[PAD]', 'NOUN', '[PAD]', '[PAD], 'DET', 'PRON', 'PUNCT', 'AUX', 'VERB', 'PART', 'VERB', 'NOUN', 'PUNCT', '[SEP]']`

The next step is to pad all the sequences to the max length. For the input IDs, we are going to pad them with `0` and for the labels with `-100`. We use `-100` for the labels since this is the default value for the `ignore_index` in `CrossEntropyLoss`. This means that the labels with the ID `-100` are not going to contribute to the loss.

Finally, we need to create an attention mask which holds `1` in the place of meaningful tokens and `0` in the place of paddings. You can read more about this here: https://huggingface.co/transformers/glossary.html#attention-mask.

Now you should be ready to create the dataset by completing the `preprocess` method below. Read the data word by word, see if a word from the data is split into sub-words, in this case add the correct amount of paddings to the labels. The final output should look similar to this:

```
Tokend ids : tensor([  101,   146, 10483, 64254, 42430, 10107, 50302, 15938,    
                     17802, 20509, 10410, 60400, 26419, 10123, 10124, 25151, 
                     15636, 10123,   119,   102,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0,
                         0,     0,     0,     0,     0,     0,     0,     0])
Attention mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,   
                        1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  
                        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,   
                        0, 0])
Label ids: tensor([   2,   16,    0,    8,    0,    0,    4,    0,    6,   
                      8,    0,    0,    0,    0,    4,    0,    0,    0,    
                      5,    3, -100, -100, -100, -100, -100, -100, -100, -100, 
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,  
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 
                   -100, -100, -100, -100, -100, -100, -100, -100, -100, -100])
```



_Some tips:_

- To get the ID of the tag, use `vocab['upos'].unit2id(upostag)`.
- You can extend lists with `+=`.

In [0]:
class TaggerDataset(Dataset):
    def __init__(self, data_path: str, tokenizer: PreTrainedTokenizer,
                 vocab: BaseVocab = None, max_length: int = 128,
                 label_pad_id: int = -100):
        self.pretrained_tokenizer = tokenizer
        self.max_length = max_length
        self.label_pad_id = label_pad_id
        data = self.load_doc(data_path)

        if vocab is None:
            self.vocab = self.init_vocab(data)
        else:
            self.vocab = vocab

        self.data = self.preprocess(data, self.vocab)

    def init_vocab(self, data: List) -> Dict[str, BaseVocab]:
        uposvocab = WordVocab(data, idx=1)
        vocab = {'upos': uposvocab}
        return vocab

    def preprocess(self, data: List, vocab: Dict[str, BaseVocab]) -> List[List[int]]:
        processed = []
        for sent in data:
            # Put the [CLS] token in the beginning of each sentence
            input_ids = [self.pretrained_tokenizer.convert_tokens_to_ids(CLS)]
            attention_mask = [1]
            # Put the [CLS] label to match the sequence
            tokenized_labels = [CLS_ID]
            for word in sent:
                form = word[0]
                upostag = word[1]
                token_ids = ...
                if len(token_ids) > 1:
                    ...
                else:
                    ...
            
            # Trim the sequences to max_length - 1 since we will add [SEP] token
            # in the end later
            if len(input_ids) + 1 > self.max_length:
                input_ids = ...
                attention_mask = ...
                tokenized_labels = ...

            # Adding [SEP] token to mark the end of sequence
            input_ids += [self.pretrained_tokenizer.convert_tokens_to_ids(SEP)]
            attention_mask += [1]
            tokenized_labels += [SEP_ID]

            # Padding the rest to the max_length
            input_ids += ...
            attention_mask += ...
            tokenized_labels += ...

            # Check that all the inputs have equal lengths
            assert len(input_ids) == self.max_length, f"Input length is {len(input_ids)} while max length is {self.max_length}"
            assert len(attention_mask) == self.max_length, f"Attention mask length is {len(attention_mask)} while max length is {self.max_length}"
            assert len(tokenized_labels) == self.max_length, f"Labels length is {len(tokenized_labels)} while max length is {self.max_length}"

            # Converting python lists to pytorch long tensors
            input_ids = torch.tensor(input_ids, dtype=torch.long)
            attention_mask = torch.tensor(attention_mask, dtype=torch.long)
            tokenized_labels = torch.tensor(tokenized_labels, dtype=torch.long)

            processed.append([input_ids, attention_mask, tokenized_labels])
        return processed
        
    def load_doc(self, data_path: str) -> List:
        doc_text = open(data_path, encoding='utf-8').read()
        data = [[[token['form'], token['upostag']] for token in sent] for sent in parse(doc_text)]
        return data
            
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx][0], self.data[idx][1], self.data[idx][2]

Don't forget to put the correct names for the train, dev, and test files.

In [0]:
train_path = DATA_PATH / '...'
dev_path = DATA_PATH / '...'
test_path = DATA_PATH / '...'

In [0]:
train_data = TaggerDataset(train_path, tokenizer)
vocab = train_data.vocab

dev_data = TaggerDataset(dev_path, tokenizer, vocab)
test_data = TaggerDataset(test_path, tokenizer, vocab)

In [0]:
print(f"Tokend ids (size = {train_data[0][0].size()}): {train_data[0][0]}")
print(f"Attention mask (size = {train_data[0][1].size()}): {train_data[0][1]}")
print(f"Label ids (size = {train_data[0][2].size()}): {train_data[0][2]}")

Last modification that we make is adding `attention_masks`. This vector is basically telling the model which characters are meaningful and which one are used for padding. To do that, we put `1` in the position of meaningful tokens and `0` in the position of paddings.

In [0]:
train_loader = DataLoader(train_data, batch_size=batch_size)
validation_loader = DataLoader(dev_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)

## Task 3. Initialize and train the model (1 point)

Initialize the `AutoConfing` for your model. Put the correct number of labels to the `num_labels` parameters. This will be the output size of the last linear layer of the model.

More information by the following link: https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoConfig

_Hint_: You can access your label vocab with `vocab['upos']`.

In [0]:
config = AutoConfig.from_pretrained(...)

Initialize the `AutoModelForTokenClassification` for your model. Don't forget to use the `config`. 

More information by the following link: https://huggingface.co/transformers/model_doc/auto.html#automodelfortokenclassification

In [0]:
model = AutoModelForTokenClassification.from_pretrained(...)

model.to(device)

You can see the model structure by running this cell.

In [0]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The {} model has {:} different named parameters.\n'.format(MODEL_NAME, len(params)))

print('==== Embedding Layer ====\n')

emb_params = [p for p in params if 'embeddings' in p[0]]
for p in emb_params:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

first_transformer_params = [p for p in params if '.0.' in p[0]]
for p in first_transformer_params:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

cls_params = [p for p in params if 'classifier' in p[0]]
for p in cls_params:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

Choose the appropriate learning rate and number of epochs. It is recommeded to train for 2-4 epochs with learning rate `{2e-5, 3e-5, 5e-5}`.

In [0]:
optimizer = AdamW(model.parameters(), lr = ..., eps = 1e-8)

epochs = ...
# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

Pay attention to the modified accuracy.

In [0]:
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
    # Trim the paddings
    pred_flat = pred_flat[(labels_flat != LABEL_PAD_ID) & (labels_flat != PAD_ID)]
    labels_flat = labels_flat[(labels_flat != LABEL_PAD_ID) & (labels_flat != PAD_ID)]
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [0]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, you can start the training. Make sure your loss is going down and accuracy is going up.

In [0]:
# Taken from this tutorial: https://github.com/aniruddhachoudhury/BERT-Tutorials/blob/master/Blog%202/BERT_Fine_Tuning_Sentence_Classification.ipynb
# The code was modified

random.seed(random_seed)
np.random.seed(random_seed)
torch.manual_seed(random_seed)
torch.cuda.manual_seed_all(random_seed)

loss_values = []

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_loader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed_mins, elapsed_secs = epoch_time(t0, time.time())
            
            # Report progress.
            print(f'  Batch {step:>5,}  of  {len(train_loader):>5,}.    Elapsed: {elapsed_mins:}m {elapsed_secs:}s.')

        # Unpack this training batch from our dataloader. 
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # This will return the loss (rather than the model output) because we
        # have provided the `labels`.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/model_doc/bert.html#bertfortokenclassification
        outputs = model(b_input_ids, 
                    attention_mask=b_input_mask,
                    labels=b_labels)
        
        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple.
        loss = outputs[0]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_loader)            
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)

    print("")
    print(f"  Average training loss: {avg_train_loss:.2f}")
    print("  Training epcoh took: {:}m {:}s".format(*epoch_time(t0, time.time())))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_loader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/model_doc/bert.html#bertfortokenclassification
            outputs = model(b_input_ids, 
                            attention_mask=b_input_mask)
        
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format((eval_accuracy/nb_eval_steps) * 100))
    print("  Validation took: {:}m {:}s".format(*epoch_time(t0, time.time())))

print("")
print("Training complete!")

See how the model performs on the test set. I managed to get `97.52` for `UD_English-EWT` and `97.55` for `UD_Estonian-EDT`.

In [0]:
print("")
print("Running Testing...")

t0 = time.time()

# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()

# Tracking variables 
test_loss, test_accuracy = 0, 0
nb_test_steps, nb_test_examples = 0, 0

# Evaluate data for one epoch
for batch in test_loader:
    
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    
    # Telling the model not to compute or store gradients, saving memory and
    # speeding up validation
    with torch.no_grad():        

        # Forward pass, calculate logit predictions.
        # This will return the logits rather than the loss because we have
        # not provided labels.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification
        outputs = model(b_input_ids, 
                        attention_mask=b_input_mask)
    
    # Get the "logits" output by the model. The "logits" are the output
    # values prior to applying an activation function like the softmax.
    logits = outputs[0]

    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    
    # Calculate the accuracy for this batch of test sentences.
    tmp_test_accuracy = flat_accuracy(logits, label_ids)
    
    # Accumulate the total accuracy.
    test_accuracy += tmp_test_accuracy

    # Track the number of batches
    nb_test_steps += 1

# Report the final accuracy for this test run.
print("  Accuracy: {0:.2f}".format((test_accuracy/nb_test_steps) * 100))
print("  Testing took: {:}m {:}s".format(*epoch_time(t0, time.time())))

## Task 4. Save the model (0 points)

Study how to save and load your model to use it later.

In [0]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir = './model_save/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model
# torch.save(args, os.path.join(output_dir, 'training_args.bin'))

In [0]:
# Load a trained model and vocabulary that you have fine-tuned
model = AutoModelForTokenClassification.from_pretrained(output_dir)
tokenizer = AutoTokenizer.from_pretrained(output_dir)

# Copy the model to the GPU.
model.to(device)