# Assignment 2

Assignment 2 is an extension on lab 5, where we use neural sequence models for NER. Assignment 2 asks you to put a linear chain CRF on top.

Questions will be stated at the end of this notebook, and there will also be a follow up oral assessment.

A.I. usage is allowed but must be **throughly documented**. Do not write down any parts of A.I. solution if you do not understand them. You will **lose all marks for that part** if you do so. You will be regarded as **plagirism** if you do not understand your solution and you do not document A.I. usage.

Disscussion with classmates are allowed and encouraged.

Submit this notebook **with all the running output saved**.
Name the file p02s\<id\>.ipynb

# Deadline
**Thursday 7-Nov-2024 23:55 p.m.**


## **Fine-tuning BERT for named-entity recognition**

In the lab part (adapted from [Transformers-Tutorials](https://github.com/NielsRogge/Transformers-Tutorials)) of this notebook, we are going to use **BertForTokenClassification** which is included in the [Transformers library](https://github.com/huggingface/transformers) by HuggingFace. This model has BERT as its base architecture, with a token classification head on top, allowing it to make predictions at the token level, rather than the sequence level. Named entity recognition is typically treated as a token classification problem, so that's what we are going to use it for.

This tutorial uses the idea of **transfer learning**, i.e. first pretraining a large neural network in an unsupervised way, and then fine-tuning that neural network on a task of interest. In this case, BERT is a neural network pretrained on 2 tasks: masked language modeling and next sentence prediction. Now, we are going to fine-tune this network on a NER dataset. Fine-tuning is supervised learning, so this means we will need a labeled dataset.

If you want to know more about BERT, I suggest the following resources:
* the original [paper](https://arxiv.org/abs/1810.04805)
* Jay Allamar's [blog post](http://jalammar.github.io/illustrated-bert/) as well as his [tutorial](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)
* Chris Mccormick's [Youtube channel](https://www.youtube.com/channel/UCoRX98PLOsaN8PtekB9kWrw)
* Abbishek Kumar Mishra's [Youtube channel](https://www.youtube.com/user/abhisheksvnit)

The following notebook largely follows the same structure as the tutorials by Abhishek Kumar Mishra. For his tutorials on the Transformers library, see his [Github repository](https://github.com/abhimishra91/transformers-tutorials).

NOTE: this notebook assumes basic knowledge about deep learning, BERT, and native PyTorch. If you want to learn more Python, deep learning and PyTorch, I highly recommend cs231n by Stanford University and the FastAI course by Jeremy Howard et al. Both are freely available on the web.  

Now, let's move on to the real stuff!

#### **Importing Python Libraries and preparing the environment**

This notebook assumes that you have the following libraries installed:
* pandas
* numpy
* sklearn
* pytorch
* transformers
* seqeval

As we are running this in Google Colab, the only libraries we need to additionally install are transformers and seqeval (GPU version):

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertForTokenClassification
from seqeval.metrics import classification_report

As deep learning can be accellerated a lot using a GPU instead of a CPU, make sure you can run this notebook in a GPU runtime (which Google Colab provides for free! - check "Runtime" - "Change runtime type" - and set the hardware accelerator to "GPU").

We can set the default device to GPU using the following code (if it prints "cuda", it means the GPU has been recognized):

In [2]:
from torch import cuda

device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cuda


#### **Downloading and preprocessing the data**
Named entity recognition (NER) uses a specific annotation scheme, which is defined (at least for European languages) at the *word* level. An annotation scheme that is widely used is called **[IOB-tagging](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)**, which stands for Inside-Outside-Beginning. Each tag indicates whether the corresponding word is *inside*, *outside* or at the *beginning* of a specific named entity. The reason this is used is because named entities usually comprise more than 1 word.

Let's have a look at an example. If you have a sentence like "Barack Obama was born in Hawaï", then the corresponding tags would be   [B-PERS, I-PERS, O, O, O, B-GEO]. B-PERS means that the word "Barack" is the beginning of a person, I-PERS means that the word "Obama" is inside a person, "O" means that the word "was" is outside a named entity, and so on. So one typically has as many tags as there are words in a sentence.

So if you want to train a deep learning model for NER, it requires that you have your data in this IOB format (or similar formats such as [BILOU](https://stackoverflow.com/questions/17116446/what-do-the-bilou-tags-mean-in-named-entity-recognition)). There exist many annotation tools which let you create these kind of annotations automatically (such as Spacy's [Prodigy](https://prodi.gy/), [Tagtog](https://docs.tagtog.net/) or [Doccano](https://github.com/doccano/doccano)). You can also use Spacy's [biluo_tags_from_offsets](https://spacy.io/api/goldparse#biluo_tags_from_offsets) function to convert annotations at the character level to IOB format.

Here, we will use a NER dataset from [Kaggle](https://www.kaggle.com/namanj27/ner-dataset) that is already in IOB format. One has to go to this web page, download the dataset, unzip it, and upload the csv file to this notebook. Let's print out the first few rows of this csv file:

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download('namanj27/ner-dataset')  # 36 POSs, 17 Tags
print(f'Path to dataset files: {path}')

Path to dataset files: C:\Users\Kevin\.cache\kagglehub\datasets\namanj27\ner-dataset\versions\2


In [4]:
data = pd.read_csv(path + '/ner_datasetreference.csv', encoding='unicode_escape')  # fix problem of sentence: 47592
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


Let's check how many sentences and words (and corresponding tags) there are in this dataset:

In [5]:
data.count()

Sentence #      47959
Word          1048565
POS           1048575
Tag           1048575
dtype: int64

As we can see, there are approximately 48,000 sentences in the dataset, comprising more than 1 million words and tags (quite huge!). This corresponds to approximately 20 words per sentence.

Let's have a look at the different NER tags, and their frequency:

In [6]:
print(f'Number of tags: {len(data.Tag.unique())}')
frequencies = data.Tag.value_counts()
frequencies

Number of tags: 17


Tag
O        887908
B-geo     37644
B-tim     20333
B-org     20143
I-per     17251
B-per     16990
I-org     16784
B-gpe     15870
I-geo      7414
I-tim      6528
B-art       402
B-eve       308
I-art       297
I-eve       253
B-nat       201
I-gpe       198
I-nat        51
Name: count, dtype: int64

There are 8 category tags, each with a "beginning" and "inside" variant, and the "outside" tag. It is not really clear what these tags mean - "geo" probably stands for geographical entity, "gpe" for geopolitical entity, and so on. They do not seem to correspond with what the publisher says on Kaggle. Some tags seem to be underrepresented. Let's print them by frequency (highest to lowest):

In [7]:
tags = {}
for tag, count in zip(frequencies.index, frequencies):
    if tag != 'O':
        if tag[2:5] not in tags.keys():
            tags[tag[2:5]] = count
        else:
            tags[tag[2:5]] += count
    continue

print(sorted(tags.items(), key=lambda x: x[1], reverse=True))

[('geo', 45058), ('org', 36927), ('per', 34241), ('tim', 26861), ('gpe', 16068), ('art', 699), ('eve', 561), ('nat', 252)]


Let's remove "art", "eve" and "nat" named entities, as performance on them will probably be not comparable to the other named entities.

In [8]:
entities_to_remove = ['B-art', 'I-art', 'B-eve', 'I-eve', 'B-nat', 'I-nat']
data = data[~data.Tag.isin(entities_to_remove)]
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


We create 2 dictionaries: one that maps individual tags to indices, and one that maps indices to their individual tags. This is necessary in order to create the labels (as computers work with numbers = indices, rather than words = tags) - see further in this notebook.

In [9]:
labels_to_ids = {k: v for v, k in enumerate(data.Tag.unique())}
ids_to_labels = {v: k for v, k in enumerate(data.Tag.unique())}
labels_to_ids

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'I-per': 8,
 'I-gpe': 9,
 'I-tim': 10}

In [10]:
ids_to_labels

{0: 'O',
 1: 'B-geo',
 2: 'B-gpe',
 3: 'B-per',
 4: 'I-geo',
 5: 'B-org',
 6: 'I-org',
 7: 'B-tim',
 8: 'I-per',
 9: 'I-gpe',
 10: 'I-tim'}

As we can see, there are now only 10 different NER tags.

Now, we have to ask ourself the question: what is a training example in the case of NER, which is provided in a single forward pass? A training example is typically a **sentence**, with corresponding IOB tags. Let's group the words and corresponding tags by sentence:

In [11]:
# pandas has a very handy "forward fill" function to fill missing values based on the last upper non-nan value
data = data.ffill()
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


In [12]:
# Let's create a new column called "sentence" which groups the words by sentence
data['sentence'] = data[['Sentence #', 'Word', 'Tag']].groupby(['Sentence #'])['Word'].transform(lambda x: ' '.join(x))
# Let's also create a new column called "word_labels" which groups the tags by sentence
data['word_labels'] = data[['Sentence #', 'Word', 'Tag']].groupby(['Sentence #'])['Tag'].transform(lambda x: ','.join(x))
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag,sentence,word_labels
0,Sentence: 1,Thousands,NNS,O,Thousands of demonstrators have marched throug...,"O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
1,Sentence: 1,of,IN,O,Thousands of demonstrators have marched throug...,"O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
2,Sentence: 1,demonstrators,NNS,O,Thousands of demonstrators have marched throug...,"O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
3,Sentence: 1,have,VBP,O,Thousands of demonstrators have marched throug...,"O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
4,Sentence: 1,marched,VBN,O,Thousands of demonstrators have marched throug...,"O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."


In [13]:
# Let's only keep the "sentence" and "word_labels" columns, and drop duplicates:
data = data[['sentence', 'word_labels']].drop_duplicates().reset_index(drop=True)
data.head()

Unnamed: 0,sentence,word_labels
0,Thousands of demonstrators have marched throug...,"O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
1,Families of soldiers killed in the conflict jo...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,B-per,O,O,..."
2,They marched from the Houses of Parliament to ...,"O,O,O,O,O,O,O,O,O,O,O,B-geo,I-geo,O"
3,"Police put the number of marchers at 10,000 wh...","O,O,O,O,O,O,O,O,O,O,O,O,O,O,O"
4,The protest comes on the eve of the annual con...,"O,O,O,O,O,O,O,O,O,O,O,B-geo,O,O,B-org,I-org,O,..."


In [14]:
len(data)

47571

Let's verify that a random sentence and its corresponding tags are correct:

In [15]:
data.iloc[41].sentence

'Bedfordshire police said Tuesday that Omar Khayam was arrested in Bedford for breaching the conditions of his parole .'

In [16]:
data.iloc[41].word_labels

'B-gpe,O,O,B-tim,O,B-per,I-per,O,O,O,B-geo,O,O,O,O,O,O,O,O'

#### **Preparing the dataset and dataloader**

Now that our data is preprocessed, we can turn it into PyTorch tensors such that we can provide it to the model. Let's start by defining some key variables that will be used later on in the training/evaluation process:

In [17]:
MAX_LEN = 128  # the maximum number of tokens after a word segmentation
TRAIN_BATCH_SIZE = 32  # When the size is greater than 32, the crf model becomes abnormal
VALID_BATCH_SIZE = 32  # When the size is greater than 32, the crf model becomes abnormal
EPOCHS = 1  # number of complete iterations
LEARNING_RATE = 1e-5
MAX_GRAD_NORM = 10  # gradient clipping
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')  # 12-layer transformer encoder, 110M parameters, case insensitive.

In [18]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

A tricky part of NER with BERT is that BERT relies on **wordpiece tokenization**, rather than word tokenization. This means that we should also define the labels at the wordpiece-level, rather than the word-level!

For example, if you have word like "Washington" which is labeled as "b-gpe", but it gets tokenized to "Wash", "##ing", "##ton", then one approach could be to handle this by only train the model on the tag labels for the first word piece token of a word (i.e. only label "Wash" with "b-gpe"). This is what was done in the original BERT paper, see Github discussion [here](https://github.com/huggingface/transformers/issues/64#issuecomment-443703063).

Note that this is a **design decision**. You could also decide to propagate the original label of the word to all of its word pieces and let the model train on this. In that case, the model should be able to produce the correct labels for each individual wordpiece. This was done in [this NER tutorial with BERT](https://github.com/chambliss/Multilingual_NER/blob/master/python/utils/main_utils.py#L118). Another design decision could be to give the first wordpiece of each word the original word label, and then use the label “X” for all subsequent subwords of that word. All of them seem to lead to good performance.

Below, we define a regular PyTorch [dataset class](https://pytorch.org/docs/stable/data.html) (which transforms examples of a dataframe to PyTorch tensors). Here, each sentence gets tokenized, the special tokens that BERT expects are added, the tokens are padded or truncated based on the max length of the model, the attention mask is created and the labels are created based on the dictionary which we defined above. Word pieces that should be ignored have a label of -100 (which is the default `ignore_index` of PyTorch's [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)).

For more information about BERT's inputs, see [here](https://huggingface.co/transformers/glossary.html).








In [19]:
class BERTDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len) -> None:
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, index) -> dict:
        # step 1: get the sentence and word labels
        sentence = self.data.sentence[index].strip().split()
        word_labels = self.data.word_labels[index].split(',')

        # step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
        # BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
        encoding = self.tokenizer(sentence,  # the input object to be encoded
                                  is_split_into_words=True,  # the sentence has been separated by word
                                  return_offsets_mapping=True,  # the position of each token is offset in the original sentence
                                  padding='max_length',  # fills the sentence to the specified maximum length, using [PAD]
                                  truncation=True,  # if the sentence length exceeds max_length, the excess is truncated
                                  max_length=self.max_len,  # sets the maximum length of the sentence
                                  )

        # step 3: create token labels only for first word pieces of each tokenized word
        labels = [labels_to_ids[label] for label in word_labels]
        # code based on https://huggingface.co/transformers/custom_datasets.html#tok-ner
        # create an empty array of -100 of length max_length
        # pytorch loss ignores label with -100,
        # see https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
        encoded_labels = np.ones(len(encoding['offset_mapping']), dtype=int) * -100

        # set only labels whose first offset position is 0 and the second is not 0
        i = 0
        for idx, mapping in enumerate(encoding['offset_mapping']):
            if mapping[0] == 0 and mapping[1] != 0:
                # overwrite label
                encoded_labels[idx] = labels[i]
                i += 1

        # step 4: turn everything into PyTorch tensors
        item = {key: torch.as_tensor(val) for key, val in encoding.items()}
        item['labels'] = torch.as_tensor(encoded_labels)

        return item

    def __len__(self) -> int:
        return self.len

Now, based on the class we defined above, we can create 2 datasets, one for training and one for testing. Let's use a 80/20 split:

In [20]:
train_size = 0.8
train_dataset = data.sample(frac=train_size, random_state=200)  # Random seed: 200
test_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print(f'FULL Dataset: {data.shape}')
print(f'TRAIN Dataset: {train_dataset.shape}')
print(f'TEST Dataset: {test_dataset.shape}')

training_set = BERTDataset(train_dataset, tokenizer, MAX_LEN)
testing_set = BERTDataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (47571, 2)
TRAIN Dataset: (38057, 2)
TEST Dataset: (9514, 2)


Let's have a look at the first training example:

In [21]:
print(training_set[0])
print(tokenizer.convert_ids_to_tokens(training_set[0]['input_ids']))

{'input_ids': tensor([  101, 23564, 21030,  2099,  4967,  2001,  9388,  1011,  6109,  2005,
         2634,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 

Let's verify that the input ids and corresponding targets are correct:

In [22]:
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]['input_ids']), training_set[0]['labels']):
    print(f'{token:10}: {label}')

[CLS]     : -100
za        : 3
##hee     : -100
##r       : -100
khan      : 8
was       : 0
mar       : 0
-         : -100
93        : -100
for       : 0
india     : 1
.         : 0
[SEP]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[PAD]     : -100
[

Now, let's define the corresponding PyTorch dataloaders:

In [23]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0,  # no additional child processes are used
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
               'shuffle': True,
               'num_workers': 0,  # no additional child processes are used
               }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

#### **Defining the model**

Here we define the model, BertForTokenClassification, and load it with the pretrained weights of "bert-base-uncased". The only thing we need to additionally specify is the number of labels (as this will determine the architecture of the classification head).

Note that only the base layers are initialized with the pretrained weights. The token classification head of top has just randomly initialized weights, which we will train, together with the pretrained weights, using our labelled dataset. This is also printed as a warning when you run the code cell below.

Then, we move the model to the GPU.

In [24]:
model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=len(labels_to_ids))
model.to(device)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

#### **Training the model**

Before training the model, let's perform a sanity check, which I learned thanks to Andrej Karpathy's wonderful [cs231n course](http://cs231n.stanford.edu/) at Stanford (see also his [blog post about debugging neural networks](http://karpathy.github.io/2019/04/25/recipe/)). The initial loss of your model should be close to -ln(1/number of classes) = -ln(1/17) = 2.83.

Why? Because we are using cross entropy loss. The cross entropy loss is defined as -ln(probability score of the model for the correct class). In the beginning, the weights are random, so the probability distribution for all of the classes for a given token will be uniform, meaning that the probability for the correct class will be near 1/17. The loss for a given token will thus be -ln(1/17). As PyTorch's [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) (which is used by `BertForTokenClassification`) uses *mean reduction* by default, it will compute the mean loss for each of the tokens in the sequence for which a label is provided.

Let's verify this:



In [25]:
inputs = training_set[4]
input_ids = inputs['input_ids'].unsqueeze(0)
attention_mask = inputs['attention_mask'].unsqueeze(0)
labels = inputs['labels'].unsqueeze(0)
input_ids = input_ids.to(device).long()
attention_mask = attention_mask.to(device).long()
labels = labels.to(device).long()

outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
initial_loss = outputs[0]
initial_loss

  attn_output = torch.nn.functional.scaled_dot_product_attention(


tensor(2.5468, device='cuda:0', grad_fn=<NllLossBackward0>)

This looks good. Let's also verify that the logits of the neural network have a shape of (batch_size, sequence_length, num_labels):

In [26]:
tr_logits = outputs[1]
tr_logits.shape

torch.Size([1, 128, 11])

Next, we define the optimizer. Here, we are just going to use Adam with a default learning rate. One can also decide to use more advanced ones such as AdamW (Adam with weight decay fix), which is [included](https://huggingface.co/transformers/main_classes/optimizer_schedules.html) in the Transformers repository, and a learning rate scheduler, but we are not going to do that here.

In [27]:
outputs

TokenClassifierOutput(loss=tensor(2.5468, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[[-0.4162, -0.0554, -0.1300,  ..., -0.1524,  0.0176, -0.1124],
         [-0.5697, -0.1318, -0.2581,  ..., -0.1561, -0.0535, -0.2322],
         [-0.6135, -0.0626,  0.0281,  ..., -0.1814, -0.0653, -0.1241],
         ...,
         [-0.6126, -0.2419,  0.2971,  ..., -0.0345, -0.0652, -0.1496],
         [-0.6661, -0.6034,  0.3120,  ..., -0.0929, -0.4435, -0.1627],
         [-0.4958, -0.2259,  0.2611,  ..., -0.0214,  0.0119, -0.1632]]],
       device='cuda:0', grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [28]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

In [29]:
# Defining the training function on the 80% of the dataset for tuning the bert model
def train(training_loader) -> None:
    tr_loss, tr_accuracy = 0, 0  # training loss and training accuracy
    nb_tr_examples, nb_tr_steps = 0, 0  # the number of training samples and training steps
    tr_preds, tr_labels = [], []  # stores prediction labels and real labels
    # put model in training mode
    model.train()

    for idx, batch in enumerate(training_loader):
        
        ids = batch['input_ids'].to(device, dtype=torch.long)
        mask = batch['attention_mask'].to(device, dtype=torch.long)
        labels = batch['labels'].to(device, dtype=torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, labels=labels)  # forward propagation
        loss = outputs[0]  # extract loss value
        tr_logits = outputs[1]  # extract logits
        tr_loss += loss.item()  # add up the loss of the current batch

        nb_tr_steps += 1  # update the training step count
        nb_tr_examples += labels.size(0)  # Update the number of training samples

        if idx % 10 == 0:  # print training losses every 10 steps
            loss_step = tr_loss / nb_tr_steps
            print(f'Training loss per 10 training steps: {loss_step}')

        # compute training accuracy
        flattened_targets = labels.view(-1)  # shape (batch_size * seq_len)
        active_logits = tr_logits.view(-1, model.num_labels)  # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, dim=1)  # shape (batch_size * seq_len)

        # only compute accuracy at active labels
        active_accuracy = labels.view(-1) != -100  # shape (batch_size, seq_len)
        #active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(-100).type_as(labels))

        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)

        tr_labels.extend(labels)
        tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy

        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=MAX_GRAD_NORM)

        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f'Training loss epoch: {epoch_loss}')
    print(f'Training accuracy epoch: {tr_accuracy}')

And let's train the model!

In [30]:
for epoch in range(EPOCHS):
    print(f'Training epoch: {epoch + 1}')
    train(training_loader)

Training epoch: 1
Training loss per 10 training steps: 2.699167490005493
Training loss per 10 training steps: 2.124142061580311
Training loss per 10 training steps: 1.6388195838247026
Training loss per 10 training steps: 1.3549579285806226
Training loss per 10 training steps: 1.1922385939737645
Training loss per 10 training steps: 1.0868194921343934
Training loss per 10 training steps: 1.0010376259928844
Training loss per 10 training steps: 0.9323209524154663
Training loss per 10 training steps: 0.8728653770170094
Training loss per 10 training steps: 0.8224589978600596
Training loss per 10 training steps: 0.7781606331320092
Training loss per 10 training steps: 0.7376847025510427
Training loss per 10 training steps: 0.7041786741619268
Training loss per 10 training steps: 0.6770353337735621
Training loss per 10 training steps: 0.6488492777372928
Training loss per 10 training steps: 0.6233522370951066
Training loss per 10 training steps: 0.6025709700510369
Training loss per 10 training st

#### **Evaluating the model**

Now that we've trained our model, we can evaluate its performance on the held-out test set (which is 20% of the data). Note that here, no gradient updates are performed, the model just outputs its logits.

In [31]:
def valid(model, testing_loader) -> tuple[list, list]:
    # put model in evaluation mode
    model.eval()

    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []

    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):

            ids = batch['input_ids'].to(device, dtype=torch.long)
            mask = batch['attention_mask'].to(device, dtype=torch.long)
            labels = batch['labels'].to(device, dtype=torch.long)

            outpus = model(input_ids=ids, attention_mask=mask, labels=labels)
            loss = outpus[0]
            eval_logits = outpus[1]
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += labels.size(0)

            if idx % 10 == 0:
                loss_step = eval_loss / nb_eval_steps
                print(f'Validation loss per 10 evaluation steps: {loss_step}')

            # compute evaluation accuracy
            flattened_targets = labels.view(-1)  # shape (batch_size * seq_len)
            active_logits = eval_logits.view(-1, model.num_labels)  # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, dim=1)  # shape (batch_size * seq_len)

            # only compute accuracy at active labels
            active_accuracy = labels.view(-1) != -100  # shape (batch_size, seq_len)

            labels = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)

            eval_labels.append(labels)
            eval_preds.append(predictions)

            tmp_eval_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy

    labels = [[ids_to_labels[id.item()] for id in labels] for labels in eval_labels]
    predictions = [[ids_to_labels[id.item()] for id in preds] for preds in eval_preds]

    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f'Validation Loss: {eval_loss}')
    print(f'Validation Accuracy: {eval_accuracy}')

    return labels, predictions

As we can see below, performance is quite good! Accuracy on the test test is > 93%.

In [32]:
labels, predictions = valid(model, testing_loader)

Validation loss per 10 evaluation steps: 0.20404550433158875
Validation loss per 10 evaluation steps: 0.1423103314909068
Validation loss per 10 evaluation steps: 0.1301786530585516
Validation loss per 10 evaluation steps: 0.12989722240355708
Validation loss per 10 evaluation steps: 0.1277327075964067
Validation loss per 10 evaluation steps: 0.12303807564518031
Validation loss per 10 evaluation steps: 0.12149319119873594
Validation loss per 10 evaluation steps: 0.12097730870607873
Validation loss per 10 evaluation steps: 0.1167062861315998
Validation loss per 10 evaluation steps: 0.11712176418238944
Validation loss per 10 evaluation steps: 0.1176307064176786
Validation loss per 10 evaluation steps: 0.11798283222827825
Validation loss per 10 evaluation steps: 0.11855340431913856
Validation loss per 10 evaluation steps: 0.11742919276802595
Validation loss per 10 evaluation steps: 0.11661913224779968
Validation loss per 10 evaluation steps: 0.11607683854584662
Validation loss per 10 evalua

However, the accuracy metric is misleading, as a lot of labels are "outside" (O), even after omitting predictions on the [PAD] tokens. What is important is looking at the precision, recall and f1-score of the individual tags. For this, we use the seqeval Python library:

In [33]:
print(classification_report(labels, predictions))

              precision    recall  f1-score   support

         geo       0.79      0.90      0.85      7378
         gpe       0.93      0.92      0.92      3021
         org       0.64      0.54      0.59      3964
         per       0.74      0.78      0.76      3367
         tim       0.87      0.84      0.85      4070

   micro avg       0.79      0.81      0.80     21800
   macro avg       0.80      0.80      0.79     21800
weighted avg       0.79      0.81      0.80     21800



Performance already seems quite good, but note that we've only trained for 1 epoch. An optimal approach would be to perform evaluation on a validation set while training to improve generalization.

#### **Inference**

The fun part is when we can quickly test the model on new, unseen sentences.
Here, we use the prediction of the **first word piece of every word** (which is how the model was trained).

*In other words, the code below does not take into account when predictions of different word pieces that belong to the same word do not match.*

In [34]:
sentence = '@HuggingFace is a company based in New York, but is also has employees working in Paris'

inputs = tokenizer(sentence.split(),
                   is_split_into_words=True,
                   return_offsets_mapping=True,
                   padding='max_length',
                   truncation=True,
                   max_length=MAX_LEN,
                   return_tensors='pt')

# move to gpu
ids = inputs['input_ids'].to(device)
mask = inputs['attention_mask'].to(device)
# forward pass
outputs = model(ids, attention_mask=mask)
logits = outputs[0]

active_logits = logits.view(-1, model.num_labels)  # shape (batch_size * seq_len, num_labels)
flattened_predictions = torch.argmax(active_logits, dim=1)  # shape (batch_size * seq_len) - predictions at the token level

tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
token_predictions = [ids_to_labels[i] for i in flattened_predictions.cpu().numpy()]
wp_preds = list(zip(tokens, token_predictions))  # list of tuples. Each tuple = (wordpiece, prediction)

print(tokens)
print(token_predictions)
prediction = []
for token_pred, mapping in zip(wp_preds, inputs['offset_mapping'].squeeze().tolist()):
    #only predictions on first word pieces are important
    if mapping[0] == 0 and mapping[1] != 0:
        prediction.append(token_pred[1])
    else:
        continue

print(sentence.split())
print(prediction)

['[CLS]', '@', 'hugging', '##face', 'is', 'a', 'company', 'based', 'in', 'new', 'york', ',', 'but', 'is', 'also', 'has', 'employees', 'working', 'in', 'paris', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[P

#### **Saving the model for future use**

Finally, let's save the vocabulary (.txt) file, model weights (.bin) and the model's configuration (.json) to a directory, so that both the tokenizer and model can be re-loaded using the `from_pretrained()` class method.


In [35]:
import os

directory = os.path.join(os.getcwd(), 'model')

if not os.path.exists(directory):
    os.makedirs(directory)

# save vocabulary of the tokenizer
tokenizer.save_vocabulary(directory)
# save the model weights and its configuration file
model.save_pretrained(directory)
print(f'All files saved, path: {directory}')
print('This lab part is completed')

All files saved, path: E:\Project\PyCharmProject\NaturalLanguageProcessing\model
This lab part is completed


## Assignment 2

We have 4 questions with 100 Marks in total


1.   full labels for wordpieces (30 Marks)
2.   retrain with new labels, report result and BIO violations (20 Marks)
3.   add transition violation scores and report result (20 Marks)
4.   train transition scores and report result (30 Marks)



#Q1

1.   Recall that there is a design decision regarding how to convert labels at token level.

You need to propagate the original label of the word to all of its word pieces and let the model train on this. For beginning tags, we should have the first worldpiece having B tag, and the remaining having I tag.

For example, if you have word like "Washington" which is labeled as "b-gpe", but it gets tokenized to "Wash", "##ing", "##ton", then you should have "b-gpe", "i-gpe", "i-gpe"

Implement this version of label conversion by creating a new dataset class.

30 Marks

In [36]:
class NewDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len) -> None:
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, index) -> dict:
        # step 1: get the sentence and word labels
        sentence = self.data.sentence[index].strip().split()
        word_labels = self.data.word_labels[index].split(',')

        # step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
        # BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
        encoding = self.tokenizer(sentence,
                                  is_split_into_words=True,
                                  return_offsets_mapping=True,
                                  padding='max_length',
                                  truncation=True,
                                  max_length=self.max_len)

        # step 3: create token labels for all word pieces of each tokenized word
        # If the label is B tag, only the first wordpiece label is B, others are I
        labels = [labels_to_ids[label] for label in word_labels]

        # code based on https://huggingface.co/transformers/custom_datasets.html#tok-ner
        # create an empty array of 0 of length max_length
        # the reason is that torch-crf (getting into it later) cannot take tags outside of number of labels
        # it can accept mask, so we will be fine
        encoded_labels = np.ones(len(encoding['offset_mapping']), dtype=int) * labels_to_ids['O']

        # set labels according to offset_mapping
        i = 0
        pre_mapping = []
        for idx, mapping in enumerate(encoding['offset_mapping']):
            if mapping[0] == 0 and mapping[1] != 0:
                # overwrite label
                encoded_labels[idx] = labels[i]
                pre_mapping = mapping
                i += 1
            elif idx > 0 and pre_mapping[1] == mapping[0]:
                encoded_labels[idx] = labels_to_ids[ids_to_labels[labels[i - 1]].replace('B', 'I')]
                pre_mapping = mapping

        # step 4: turn everything into PyTorch tensors
        item = {key: torch.as_tensor(val) for key, val in encoding.items()}
        item['labels'] = torch.as_tensor(encoded_labels)

        return item

    def __len__(self) -> int:
        return self.len

In [37]:
# Testing Code Here
train_size = 0.8
train_dataset = data.sample(frac=train_size,random_state=200)
test_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print(f'FULL Dataset: {data.shape}')
print(f'TRAIN Dataset: {train_dataset.shape}')
print(f'TEST Dataset: {test_dataset.shape}')

training_set = NewDataset(train_dataset, tokenizer, MAX_LEN)
testing_set = NewDataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (47571, 2)
TRAIN Dataset: (38057, 2)
TEST Dataset: (9514, 2)


In [38]:
print(training_set[0])
print(tokenizer.convert_ids_to_tokens(training_set[0]['input_ids']))

{'input_ids': tensor([  101, 23564, 21030,  2099,  4967,  2001,  9388,  1011,  6109,  2005,
         2634,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0, 

In [39]:
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]['input_ids']), training_set[0]['labels']):
    print(f'{token:10}  {label}')

[CLS]       0
za          3
##hee       8
##r         8
khan        8
was         0
mar         0
-           0
93          0
for         0
india       1
.           0
[SEP]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD]       0
[PAD] 

In [40]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0,  # no additional child processes are used
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
               'shuffle': True,
               'num_workers': 0,  # no additional child processes are used
               }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

#Q2


1.   Train the model on the new labels, report the testing set performance with classification_report (10 Marks).   
2.   Gather the statistics of BIO rule violations. $\dfrac{\mathrm{\#Violations}}{\mathrm{\#Predicted Labels}}$ (10 Marks).

A violation happens when "I-tag" is not preceeded by "I-tag" or "B-tag".

20 Marks in total for Q2



In [41]:
# Active Accuracy can no longer be based on label != -100, we use attention_mask
# No need to fix anything here.
def new_valid(model, testing_loader) -> tuple[list, list]:
    # put model in evaluation mode
    model.eval()

    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []

    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):

            ids = batch['input_ids'].to(device, dtype=torch.long)
            mask = batch['attention_mask'].to(device, dtype=torch.long)
            labels = batch['labels'].to(device, dtype=torch.long)

            outpus = model(input_ids=ids, attention_mask=mask, labels=labels)
            loss = outpus[0]
            eval_logits = outpus[1]
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += labels.size(0)

            if idx % 10 == 0:
                loss_step = eval_loss / nb_eval_steps
                print(f'Validation loss per 10 evaluation steps: {loss_step}')

            # compute evaluation accuracy
            flattened_targets = labels.view(-1)  # shape (batch_size * seq_len)
            active_logits = eval_logits.view(-1, model.num_labels)  # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, dim=1)  # shape (batch_size * seq_len)

            # only compute accuracy at active labels
            active_accuracy = mask.view(-1) != 0  # shape (batch_size, seq_len)

            labels = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)

            eval_labels.append(labels)
            eval_preds.append(predictions)

            tmp_eval_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy

    labels = [[ids_to_labels[id.item()] for id  in labels] for labels in eval_labels]
    predictions = [[ids_to_labels[id.item()] for id in preds] for preds in  eval_preds]

    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f'Validation Loss: {eval_loss}')
    print(f'Validation Accuracy: {eval_accuracy}')

    return labels, predictions

In [42]:
# Retrain and produce classification report
def new_train(training_loader) -> None:
    tr_loss, tr_accuracy = 0, 0  # training loss and training accuracy
    nb_tr_examples, nb_tr_steps = 0, 0  # the number of training samples and training steps
    tr_preds, tr_labels = [], []  # stores prediction labels and real labels
    # put model in training mode
    model.train()

    for idx, batch in enumerate(training_loader):
        
        ids = batch['input_ids'].to(device, dtype=torch.long)
        mask = batch['attention_mask'].to(device, dtype=torch.long)
        labels = batch['labels'].to(device, dtype=torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, labels=labels)  # forward propagation
        loss = outputs[0]  # extract loss value
        tr_logits = outputs[1]  # extract logits
        tr_loss += loss.item()  # add up the loss of the current batch

        nb_tr_steps += 1  # update the training step count
        nb_tr_examples += labels.size(0)  # Update the number of training samples

        if idx % 10 == 0:  # print training losses every 10 steps
            loss_step = tr_loss / nb_tr_steps
            print(f'Training loss per 10 training steps: {loss_step}')

        # compute training accuracy
        flattened_targets = labels.view(-1)  # shape (batch_size * seq_len)
        active_logits = tr_logits.view(-1, model.num_labels)  # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, dim=1)  # shape (batch_size * seq_len)

        # only compute accuracy at active labels
        active_accuracy = mask.view(-1) != 0  # shape (batch_size, seq_len)
        #active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(0).type_as(labels))

        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)

        tr_labels.extend(labels)
        tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy

        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=MAX_GRAD_NORM)

        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f'Training loss epoch: {epoch_loss}')
    print(f'Training accuracy epoch: {tr_accuracy}')

In [43]:
for epoch in range(EPOCHS):
    print(f'Training epoch: {epoch + 1}')
    new_train(training_loader)

Training epoch: 1
Training loss per 10 training steps: 0.09828614443540573
Training loss per 10 training steps: 0.05371994034133174
Training loss per 10 training steps: 0.04786150618678048
Training loss per 10 training steps: 0.04444840796772511
Training loss per 10 training steps: 0.041603345423936844
Training loss per 10 training steps: 0.03860844554854374
Training loss per 10 training steps: 0.03771338675965051
Training loss per 10 training steps: 0.03684821108382352
Training loss per 10 training steps: 0.03627474746310416
Training loss per 10 training steps: 0.03600027975063402
Training loss per 10 training steps: 0.0357143497094512
Training loss per 10 training steps: 0.035629360156284796
Training loss per 10 training steps: 0.03501573374325579
Training loss per 10 training steps: 0.034541068470660054
Training loss per 10 training steps: 0.034439313113161014
Training loss per 10 training steps: 0.03402462889076464
Training loss per 10 training steps: 0.033781583866347435
Training 

In [44]:
labels, predictions = new_valid(model, testing_loader)

Validation loss per 10 evaluation steps: 0.03927920013666153
Validation loss per 10 evaluation steps: 0.029144178229299458
Validation loss per 10 evaluation steps: 0.028100788105456603
Validation loss per 10 evaluation steps: 0.028559184452939416
Validation loss per 10 evaluation steps: 0.02783264140257748
Validation loss per 10 evaluation steps: 0.027324571128131126
Validation loss per 10 evaluation steps: 0.027550950570062537
Validation loss per 10 evaluation steps: 0.027060789029887865
Validation loss per 10 evaluation steps: 0.026652490961606854
Validation loss per 10 evaluation steps: 0.02687792640670643
Validation loss per 10 evaluation steps: 0.02685508618850519
Validation loss per 10 evaluation steps: 0.02696979305127988
Validation loss per 10 evaluation steps: 0.026934643553123493
Validation loss per 10 evaluation steps: 0.026578037807618388
Validation loss per 10 evaluation steps: 0.026693669894168562
Validation loss per 10 evaluation steps: 0.026739474274958205
Validation lo

In [45]:
print(classification_report(labels, predictions))

              precision    recall  f1-score   support

         geo       0.81      0.89      0.85      7378
         gpe       0.92      0.92      0.92      3021
         org       0.60      0.58      0.59      3964
         per       0.71      0.78      0.74      3367
         tim       0.85      0.85      0.85      4070

   micro avg       0.78      0.81      0.80     21800
   macro avg       0.78      0.80      0.79     21800
weighted avg       0.78      0.81      0.80     21800



In [46]:
# Gather the statistics of BIO rule violations. Is there violations in labels?
def bio_violations(predictions) -> tuple[int, int, int, int, int]:
    violations = 0
    violations_in_b = 0
    violations_in_i = 0
    predicted_labels = 0    
    word_token = []
    
    # Remove the O label and treat each successive B or I label as a sublist
    for batch_sentence in predictions:
        temp_token = []
        for token in batch_sentence:
            predicted_labels += 1
            if token != 'O':
                temp_token.append(token)
            else:
                if len(temp_token) > 0:
                    word_token.append(temp_token)
                    temp_token = []
        if len(temp_token) > 0:
            word_token.append(temp_token)

    # Split out all non-first B tags in all sublists
    for token in word_token:
        for i in range(1, len(token)):
            if token[i].startswith('B'):
                word_token.append(token[:i])
                word_token.append(token[i:])
                del token[:]
                break

    word_token = [token for token in word_token if token]  # Remove the empty sublist
    word_token_b = [token for token in word_token if token[0].startswith('B')]  # All sublists starting with the B label
    word_token_i = [token for token in word_token if token[0].startswith('I')]  # All sublists starting with the I label

    # Statistics on internal error labels
    for token in word_token_b:
        base = token[0][2:]
        for i in range(1, len(token)):
            if token[i][2:] != base:
                violations += 1
                violations_in_b += 1
                
    for token in word_token_i:
        base = token[0][2:]
        for i in range(1, len(token)):
            if token[i][2:] != base:
                violations += 1
                violations_in_i += 1

    # Add all the I tags that follow the O tags
    violations += len(word_token_i)

    return violations, predicted_labels, violations_in_b, violations_in_i, len(word_token_i)  # violations, # predictions != 0

In [47]:
violations, predicted_labels, violations_in_b, violations_in_i, start_i = bio_violations(predictions)

print(f'Numbers of Violations Labels: {violations}, Violations Rate: {violations / predicted_labels}')
print(f'Numbers of Violations Labels Inner B Tags: {violations_in_b}')
print(f'Numbers of Violations Labels Inner I Tags: {violations_in_i}')
print(f'Numbers of Violations Labels I Tags Follow O Tags: {start_i}')

Numbers of Violations Labels: 1159, Violations Rate: 0.0045940471613228
Numbers of Violations Labels Inner B Tags: 799
Numbers of Violations Labels Inner I Tags: 15
Numbers of Violations Labels I Tags Follow O Tags: 345


#[pytorch-crf](https://pytorch-crf.readthedocs.io/en/stable/) Tutorial

Before we start Q3 and Q4, we need a brief tutorial on a automatically differentiable pytorch based linear-chain CRF package.

In [48]:
# from torchcrf import CRF
from torchcrf import CRF
crf_model = CRF(len(labels_to_ids))
seq_length = 3  # maximum sequence length in a batch
batch_size = 2  # number of samples in the batch
emissions = torch.randn(seq_length, batch_size, len(labels_to_ids))
tags = torch.tensor([[0, 1], [2, 4], [3, 0]], dtype=torch.long)  # (seq_length, batch_size)
attention_mask = torch.tensor([[1, 1], [1, 1], [1, 0]], dtype=torch.bool)  # (seq_length, batch_size)

loss = crf_model(emissions, tags, mask=attention_mask)
predictions = crf_model.decode(emissions, mask=attention_mask)

print(f'Loss: {loss}')
print(f'Predictions: {predictions}')

Loss: -16.180574417114258
Predictions: [[1, 9, 5], [6, 2]]


In [49]:
#should produce only 0 tags
crf_model.transitions.data[:, :] = -1e6
crf_model.transitions.data[0, 0] = 0

loss = crf_model(emissions, tags, mask=attention_mask)
predictions = crf_model.decode(emissions, mask=attention_mask)

print(f'Loss: {loss}')
print(f'Predictions: {predictions}')

Loss: -3000001.0
Predictions: [[0, 0, 0], [0, 0]]


In [50]:
#should produce 0 tags, and ends with a 1
crf_model.transitions.data[:, :] = -1e6
crf_model.transitions.data[0, 0] = 0
crf_model.transitions.data[0, 1] = 1e6

loss = crf_model(emissions, tags, mask=attention_mask)
predictions = crf_model.decode(emissions, mask=attention_mask)

print(f'Loss: {loss}')
print(f'Predictions: {predictions}')

Loss: -4999999.0
Predictions: [[0, 0, 1], [0, 1]]



#Q3
Use pytorch-crf to add transition scores to rule out violations.

This can be achieved via manully setting the crf_model.transitions.data, then rewrite valid method.

Then re-evaluate with the same trained Bert model as in Q2

1.   Add transition scores to rule out violations and re-evaluate (10 Marks)
2.   Recompute new BIO violations. You should find no violation (10 Marks)


20 Marks in total for Q3


In [51]:
# Set CRF model and evaluate
# Initialize the CRF model
from torchcrf import CRF

crf_model = CRF(len(labels_to_ids), batch_first=True)
crf_model.to(device)

# Initialize the transition matrix with very small values to prevent disallowed transitions
crf_model.transitions.data[:, :] = -1e6

# Define the allowed transition rules
for label_from, idx_from in labels_to_ids.items():
    for label_to, idx_to in labels_to_ids.items():

        # Any label can transition to O label (O -> O, B-<type> -> O, I-<type> -> O)
        if label_to == 'O':
            crf_model.transitions.data[idx_from, idx_to] = 0

        # O label can transition to any B label (O -> B-<type>)
        elif label_from == 'O' and label_to.startswith('B'):
            crf_model.transitions.data[idx_from, idx_to] = 0

        # B label can transition to the corresponding I label (B-<type> -> I-<type>)
        elif label_from.startswith('B') and label_to.startswith('I') and label_from[2:] == label_to[2:]:
            crf_model.transitions.data[idx_from, idx_to] = 0

        # I label can continue transitioning to the same type of I label (I-<type> -> I-<type>)
        elif label_from.startswith('I') and label_to.startswith('I') and label_from[2:] == label_to[2:]:
            crf_model.transitions.data[idx_from, idx_to] = 0

        # B label can transition to a different B label (B-<type> -> B-<new_type>)
        elif label_from.startswith('B') and label_to.startswith('B') and label_from[2:] != label_to[2:]:
            crf_model.transitions.data[idx_from, idx_to] = 0

        # I label can transition to a different B label (I-<type> -> B-<new_type>)
        elif label_from.startswith('I') and label_to.startswith('B') and label_from[2:] != label_to[2:]:
            crf_model.transitions.data[idx_from, idx_to] = 0

print(crf_model.transitions)

Parameter containing:
tensor([[       0.,        0.,        0.,        0., -1000000.,        0.,
         -1000000.,        0., -1000000., -1000000., -1000000.],
        [       0., -1000000.,        0.,        0.,        0.,        0.,
         -1000000.,        0., -1000000., -1000000., -1000000.],
        [       0.,        0., -1000000.,        0., -1000000.,        0.,
         -1000000.,        0., -1000000.,        0., -1000000.],
        [       0.,        0.,        0., -1000000., -1000000.,        0.,
         -1000000.,        0.,        0., -1000000., -1000000.],
        [       0., -1000000.,        0.,        0.,        0.,        0.,
         -1000000.,        0., -1000000., -1000000., -1000000.],
        [       0.,        0.,        0.,        0., -1000000., -1000000.,
                0.,        0., -1000000., -1000000., -1000000.],
        [       0.,        0.,        0.,        0., -1000000., -1000000.,
                0.,        0., -1000000., -1000000., -1000000.]

In [52]:
#Rewrite the valid method, get predictions and loss from crf_model
def crf_valid(model, crf_model, testing_loader) -> tuple[list, list]:
    # put model in evaluation mode
    model.eval()
    crf_model.eval()

    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []

    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):

            ids = batch['input_ids'].to(device, dtype=torch.long)
            mask = batch['attention_mask'].to(device, dtype=torch.bool)
            labels = batch['labels'].to(device, dtype=torch.long)

            outputs = model(input_ids=ids, attention_mask=mask, labels=labels)
            emissions = outputs[1]
            
            loss = -crf_model(emissions, labels, mask=mask)  # outpus[0] should also come from crf_model
            predictions = crf_model.decode(emissions, mask=mask)  # outpus[1]  use crf_model to get predictions
            
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += labels.size(0)

            if idx % 10 == 0:
                loss_step = eval_loss / nb_eval_steps
                print(f'Validation loss per 10 evaluation steps: {loss_step}')

            # compute evaluation accuracy
            flattened_targets = labels.view(-1)  # shape (batch_size * seq_len)

            # only compute accuracy at active labels
            active_accuracy = mask.view(-1)  # shape (batch_size, seq_len)

            labels = torch.masked_select(flattened_targets, active_accuracy).tolist()
            predictions = [label for sentence in predictions for label in sentence]

            eval_labels.append(labels)
            eval_preds.append(predictions)

            tmp_eval_accuracy = accuracy_score(labels, predictions)
            eval_accuracy += tmp_eval_accuracy

    labels = [[ids_to_labels[id] for id in labels] for labels in eval_labels]
    predictions = [[ids_to_labels[id] for id in preds] for preds in eval_preds]

    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f'Validation Loss: {eval_loss}')
    print(f'Validation Accuracy: {eval_accuracy}')

    return labels, predictions

In [53]:
labels, predictions = crf_valid(model, crf_model, testing_loader)

Validation loss per 10 evaluation steps: 52.899871826171875
Validation loss per 10 evaluation steps: 77.3529545177113
Validation loss per 10 evaluation steps: 83.44442422049386
Validation loss per 10 evaluation steps: 81.93106940484816
Validation loss per 10 evaluation steps: 90.8837721289658
Validation loss per 10 evaluation steps: 89.3619322683297
Validation loss per 10 evaluation steps: 91.14046196859391
Validation loss per 10 evaluation steps: 90.25048081303986
Validation loss per 10 evaluation steps: 90.28720125739957
Validation loss per 10 evaluation steps: 90.46473195002629
Validation loss per 10 evaluation steps: 91.09126746300423
Validation loss per 10 evaluation steps: 90.39606871046462
Validation loss per 10 evaluation steps: 8354.567604033415
Validation loss per 10 evaluation steps: 7723.276865806289
Validation loss per 10 evaluation steps: 7181.110299103649
Validation loss per 10 evaluation steps: 6710.4257731406105
Validation loss per 10 evaluation steps: 6299.31110481593

In [54]:
print(classification_report(labels, predictions))

              precision    recall  f1-score   support

         geo       0.83      0.89      0.86      7378
         gpe       0.94      0.93      0.93      3021
         org       0.69      0.59      0.64      3964
         per       0.76      0.79      0.77      3367
         tim       0.88      0.85      0.86      4070

   micro avg       0.82      0.82      0.82     21800
   macro avg       0.82      0.81      0.81     21800
weighted avg       0.82      0.82      0.82     21800



In [55]:
violations, predicted_labels, violations_in_b, violations_in_i, start_i = bio_violations(predictions)

print(f'Numbers of Violations Labels: {violations}, Violations Rate: {violations / predicted_labels}')
print(f'Numbers of Violations Labels Inner B Tags: {violations_in_b}')
print(f'Numbers of Violations Labels Inner I Tags: {violations_in_i}')
print(f'Numbers of Violations Labels I Tags Follow O Tags: {start_i}')

Numbers of Violations Labels: 0, Violations Rate: 0.0
Numbers of Violations Labels Inner B Tags: 0
Numbers of Violations Labels Inner I Tags: 0
Numbers of Violations Labels I Tags Follow O Tags: 0


#Q4
Use pytorch-crf to jointly re-train the Bert model with transition score.


Then re-evaluate with the same trained Bert model as in Q2

1.   Train transition scores jointly with Bert model (20 Marks)
2.   Report Result and Recompute new BIO violations.  (10 Marks)


30 Marks in total for Q4



In [56]:
# The parameters of BERT and CRF models are collected separately
optimizer_parameters = [
    {'params': model.parameters(), 'lr': LEARNING_RATE},  # BERT model parameters
    {'params': crf_model.parameters(), 'lr': LEARNING_RATE}  # # CRF model parameters
]

optimizer = torch.optim.Adam(optimizer_parameters)

In [57]:
# Rewrite the train method here, and then re-evaluate
def crf_train(training_loader):
    tr_loss, tr_accuracy = 0, 0  # training loss and training accuracy
    nb_tr_examples, nb_tr_steps = 0, 0  # the number of training samples and training steps
    tr_preds, tr_labels = [], []  # stores prediction labels and real labels
    # put model in training mode
    model.train()
    crf_model.train()

    for idx, batch in enumerate(training_loader):
        
        ids = batch['input_ids'].to(device, dtype=torch.long)
        mask = batch['attention_mask'].to(device, dtype=torch.bool)
        labels = batch['labels'].to(device, dtype=torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, labels=labels)  # forward propagation
        emissions = outputs[1]
        
        loss = -crf_model(emissions, labels, mask=mask)  # outpus[0] should also come from crf_model
        predictions = crf_model.decode(emissions, mask=mask)  # outpus[1]  use crf_model to get predictions
        
        tr_loss += loss.item()  # add up the loss of the current batch

        nb_tr_steps += 1  # update the training step count
        nb_tr_examples += labels.size(0)  # Update the number of training samples

        if idx % 10 == 0:  # print training losses every 10 steps
            loss_step = tr_loss / nb_tr_steps
            print(f'Training loss per 10 training steps: {loss_step}')

        # compute evaluation accuracy
        flattened_targets = labels.view(-1)  # shape (batch_size * seq_len,)

        # only compute accuracy at active labels
        active_accuracy = mask.view(-1)  # shape (batch_size, seq_len)

        labels = torch.masked_select(flattened_targets, active_accuracy).tolist()
        predictions = [label for sentence in predictions for label in sentence]

        tr_labels.append(labels)
        tr_preds.append(predictions)

        tmp_tr_accuracy = accuracy_score(labels, predictions)
        tr_accuracy += tmp_tr_accuracy
        
        # gradient clipping
        all_model_parameters = list(model.parameters()) + list(crf_model.parameters())
        torch.nn.utils.clip_grad_norm_(all_model_parameters, max_norm=MAX_GRAD_NORM)
        
        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f'Validation Loss: {epoch_loss}')
    print(f'Validation Accuracy: {tr_accuracy}')

In [58]:
for epoch in range(EPOCHS):
    print(f'Training epoch: {epoch + 1}')
    crf_train(training_loader)

Training epoch: 1
Training loss per 10 training steps: 109.29573059082031
Training loss per 10 training steps: 75.7021817294034
Training loss per 10 training steps: 73.704587663923
Training loss per 10 training steps: 67.7897561596286
Training loss per 10 training steps: 65.94489288330078
Training loss per 10 training steps: 65.20941393983131
Training loss per 10 training steps: 64.16057855574812
Training loss per 10 training steps: 62.4219983060595
Training loss per 10 training steps: 61.387170438413264
Training loss per 10 training steps: 61.04262454693134
Training loss per 10 training steps: 62.19312063538202
Training loss per 10 training steps: 61.6182490168391
Training loss per 10 training steps: 62.22922591532557
Training loss per 10 training steps: 62.757848928902895
Training loss per 10 training steps: 62.30357225566891
Training loss per 10 training steps: 61.98434594766983
Training loss per 10 training steps: 61.88572984304487
Training loss per 10 training steps: 61.7647560521

In [59]:
print(crf_model.transitions)

Parameter containing:
tensor([[ 1.5194e-03, -1.5520e-04, -1.8233e-03, -1.8587e-03, -1.0000e+06,
         -1.3294e-03, -1.0000e+06, -1.3070e-03, -1.0000e+06, -1.0000e+06,
         -1.0000e+06],
        [-1.3241e-03, -1.0000e+06, -3.7061e-05, -2.4885e-03,  2.5330e-04,
         -1.5552e-03, -1.0000e+06,  9.6984e-04, -1.0000e+06, -1.0000e+06,
         -1.0000e+06],
        [-2.3644e-03, -4.2800e-04, -1.0000e+06, -6.0014e-04, -1.0000e+06,
          1.6621e-03, -1.0000e+06, -2.4324e-03, -1.0000e+06,  2.6147e-04,
         -1.0000e+06],
        [-2.3600e-03, -1.6788e-03,  3.3200e-04, -1.0000e+06, -1.0000e+06,
          9.7256e-05, -1.0000e+06, -1.3248e-03, -3.4791e-04, -1.0000e+06,
         -1.0000e+06],
        [ 8.5010e-04, -1.0000e+06, -4.4360e-03, -1.6465e-03,  6.8245e-04,
         -2.8451e-03, -1.0000e+06, -1.1339e-03, -1.0000e+06, -1.0000e+06,
         -1.0000e+06],
        [-1.8269e-03, -4.2409e-03,  5.3631e-04, -2.5055e-04, -1.0000e+06,
         -1.0000e+06,  1.1785e-04, -1.2711e-03, -

In [60]:
labels, predictions = crf_valid(model, crf_model, testing_loader)

Validation loss per 10 evaluation steps: 36.41700744628906
Validation loss per 10 evaluation steps: 45.74637118252841
Validation loss per 10 evaluation steps: 46.61808231898716
Validation loss per 10 evaluation steps: 49.354852984028476
Validation loss per 10 evaluation steps: 47.89251029782179
Validation loss per 10 evaluation steps: 49.16838903988109
Validation loss per 10 evaluation steps: 49.02370058903929
Validation loss per 10 evaluation steps: 48.463373802077605
Validation loss per 10 evaluation steps: 49.36255918903115
Validation loss per 10 evaluation steps: 49.408356090168375
Validation loss per 10 evaluation steps: 49.24526173053402
Validation loss per 10 evaluation steps: 48.927218050570104
Validation loss per 10 evaluation steps: 49.76448394838444
Validation loss per 10 evaluation steps: 49.77584985922311
Validation loss per 10 evaluation steps: 50.0907302612954
Validation loss per 10 evaluation steps: 49.867099370388004
Validation loss per 10 evaluation steps: 50.23502227

In [61]:
print(classification_report(labels, predictions, zero_division=0))

              precision    recall  f1-score   support

         geo       0.83      0.90      0.86      7378
         gpe       0.95      0.93      0.94      3021
         org       0.79      0.58      0.67      3964
         per       0.78      0.79      0.79      3367
         tim       0.87      0.86      0.87      4070

   micro avg       0.84      0.82      0.83     21800
   macro avg       0.84      0.81      0.82     21800
weighted avg       0.84      0.82      0.83     21800



In [62]:
violations, predicted_labels, violations_in_b, violations_in_i, start_i = bio_violations(predictions)

print(f'Numbers of Violations Labels: {violations}, Violations Rate: {violations / predicted_labels}')
print(f'Numbers of Violations Labels Inner B Tags: {violations_in_b}')
print(f'Numbers of Violations Labels Inner I Tags: {violations_in_i}')
print(f'Numbers of Violations Labels I Tags Follow O Tags: {start_i}')

Numbers of Violations Labels: 0, Violations Rate: 0.0
Numbers of Violations Labels Inner B Tags: 0
Numbers of Violations Labels Inner I Tags: 0
Numbers of Violations Labels I Tags Follow O Tags: 0
