This file contains the answers to the the first second and third question of the third NLP exercise. In the first question we were asked to run the code in the second cell on this dataset: 

https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion

Before running, we had to make sure that the transformers library was installed as it is not already preinstalled in the google colab environment. The dataset was uploaded into google drive after downloading it from the kaggle depository and then the drive was mounted into the notebook.

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m115.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m126.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.29.2


Question 3 

Explanation of the **align_label** function (Answer to question number 3):

The align_label function plays a crucial role in handling subword tokenization. As subword tokenizers split words into smaller pieces (subwords), the labels need to be realigned accordingly to match the number of tokens produced. The align_label function serves this purpose by ensuring that each token receives an appropriate label.

However, the original labels are assigned only to the first subword of each split word, while the remaining subwords are assigned a label of -100. The rationale behind this is that, during the training and evaluation process, we only want to consider the first subword in each split word for the computation of loss and metrics. By setting the label ids of the other subwords to -100, we effectively "ignore" these tokens in the loss calculation and performance evaluation, as the model treats them as padding tokens.

This approach helps maintain a consistent and meaningful mapping between the original words and their corresponding labels while still accommodating the subword tokenization process.

In [4]:
# dependencies
import torch
import torch.optim as optim 
from torchtext.vocab import build_vocab_from_iterator
from transformers import BertForTokenClassification, BertTokenizerFast
from sklearn.metrics import accuracy_score, balanced_accuracy_score, classification_report
import tqdm
tqdmn = tqdm.notebook.tqdm

# hyper-parameters
EPOCHS = 3
BATCH_SIZE = 8
LR = 1e-5

# the path of the data files
base_path = '/content/drive/MyDrive/nlpdataset/'

# use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# read the data files
def load_sentences(filepath):

    sentences = []
    tokens = []
    pos_tags = []
    chunk_tags = []
    ner_tags = []

    with open(filepath, 'r') as f:
        
        for line in f.readlines():
            
            if (line == ('-DOCSTART- -X- -X- O\n') or line == '\n'):
                if len(tokens) > 0:
                    sentences.append({'tokens': tokens, 'pos_tags': pos_tags, 'chunk_tags': chunk_tags, 'ner_tags': ner_tags})
                    tokens = []
                    pos_tags = []
                    chunk_tags = []
                    ner_tags = []
            else:
                l = line.split(' ')
                tokens.append(l[0])
                pos_tags.append(l[1])
                chunk_tags.append(l[2])
                ner_tags.append(l[3].strip('\n'))
    
    return sentences

print('loading data')
train_sentences = load_sentences(base_path + 'train.txt')
test_sentences = load_sentences(base_path + 'test.txt')
valid_sentences = load_sentences(base_path + 'valid.txt')

# build tagset and tag ids
tags = [sentence['ner_tags'] for sentence in train_sentences]
tagmap = build_vocab_from_iterator(tags)
tagset = set([item for sublist in tags for item in sublist])
print('Tagset size:',len(tagset))

# load BERT tokenizer
bert_version = 'bert-base-uncased'
tokenizer = BertTokenizerFast.from_pretrained(bert_version)

# map tokens and tags to token ids and label ids
def align_label(tokens, labels):

    word_ids = tokens.word_ids()
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
        if word_idx is None:
            label_ids.append(-100)
        elif word_idx != previous_word_idx:
            try:
                label_ids.append(tagmap[labels[word_idx]])
            except:
                label_ids.append(-100)
        else:
                label_ids.append(-100)
        previous_word_idx = word_idx

    return label_ids

def encode(sentence):
    encodings = tokenizer(sentence['tokens'], truncation=True, padding='max_length', is_split_into_words=True)
    labels = align_label(encodings, sentence['ner_tags'])
    return { 'input_ids': torch.LongTensor(encodings.input_ids), 'attention_mask': torch.LongTensor(encodings.attention_mask), 'labels': torch.LongTensor(labels) }

print('encoding data')
train_dataset = [encode(sentence) for sentence in train_sentences]
valid_dataset = [encode(sentence) for sentence in valid_sentences]
test_dataset = [encode(sentence) for sentence in test_sentences]

# initialize the model including a classification layer with num_labels classes
print('initializing the model')
model = BertForTokenClassification.from_pretrained(bert_version, num_labels=len(tagset))
model.to(device)
optimizer = optim.AdamW(params=model.parameters(), lr=LR)

# prepare batches of data
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE)

# evaluate the performance of the model
def EvaluateModel(model, data_loader):
    model.eval()
    with torch.no_grad():
        Y_actual, Y_preds = [],[]
        for i, batch in enumerate(tqdmn(data_loader)):
            # move the batch tensors to the same device as the model
            batch = { k: v.to(device) for k, v in batch.items() }
            # send 'input_ids', 'attention_mask' and 'labels' to the model
            outputs = model(**batch)
            # iterate through the examples
            for idx, _ in enumerate(batch['labels']):
                # get the true values
                true_values_all = batch['labels'][idx]
                true_values = true_values_all[true_values_all != -100]
                # get the predicted values
                pred_values = torch.argmax(outputs[1], dim=2)[idx]
                pred_values = pred_values[true_values_all != -100]
                # update the lists of true answers and predictions
                Y_actual.append(true_values)
                Y_preds.append(pred_values)
        Y_actual = torch.cat(Y_actual)
        Y_preds = torch.cat(Y_preds)
    # Return list of actual labels, predicted labels 
    return Y_actual.detach().cpu().numpy(), Y_preds.detach().cpu().numpy()

# train the model
print('training the model')
for epoch in tqdmn(range(EPOCHS)):
    model.train()
    print('epoch',epoch+1)
    # iterate through each batch of the train data
    for i, batch in enumerate(tqdmn(train_loader)):
        # move the batch tensors to the same device as the model
        batch = { k: v.to(device) for k, v in batch.items() }
        # send 'input_ids', 'attention_mask' and 'labels' to the model
        outputs = model(**batch)
        loss = outputs[0]
        # set the gradients to zero
        optimizer.zero_grad()
        # propagate the loss backwards
        loss.backward()
        # update the model weights
        optimizer.step()
    # calculate performence on validation set
    Y_actual, Y_preds = EvaluateModel(model,valid_loader)
    print("\nValidation Accuracy : {:.3f}".format(accuracy_score(Y_actual, Y_preds)))
    print("\nValidation Macro-Accuracy : {:.3f}".format(balanced_accuracy_score(Y_actual, Y_preds)))

print('applying the model to the test set')
# apply the trained model to the test set
Y_actual, Y_preds = EvaluateModel(model,test_loader)

print("\nTest Accuracy : {:.3f}".format(accuracy_score(Y_actual, Y_preds)))
print("\nTest Macro-Accuracy : {:.3f}".format(balanced_accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds,labels = tagmap(tagmap.get_itos()), target_names = tagmap.get_itos(), zero_division = 0))



loading data
Tagset size: 9


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

encoding data
initializing the model


Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

training the model


  0%|          | 0/3 [00:00<?, ?it/s]

epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.985

Validation Macro-Accuracy : 0.909
epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.987

Validation Macro-Accuracy : 0.916
epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.988

Validation Macro-Accuracy : 0.933
applying the model to the test set


  0%|          | 0/432 [00:00<?, ?it/s]


Test Accuracy : 0.979

Test Macro-Accuracy : 0.901

Classification Report : 
              precision    recall  f1-score   support

           O       0.99      0.99      0.99     38323
       B-LOC       0.93      0.93      0.93      1668
       B-PER       0.99      0.95      0.97      1617
       B-ORG       0.89      0.90      0.89      1661
       I-PER       0.99      0.98      0.98      1156
       I-ORG       0.83      0.90      0.87       835
      B-MISC       0.81      0.83      0.82       702
       I-LOC       0.80      0.88      0.84       257
      I-MISC       0.64      0.75      0.69       216

    accuracy                           0.98     46435
   macro avg       0.87      0.90      0.89     46435
weighted avg       0.98      0.98      0.98     46435



The results of the NER model on the test set show a high level of accuracy (0.979) and macro-average accuracy (0.901). These values indicate that the model is performing well in recognizing and classifying named entities in the text. The overall accuracy of 0.979 represents the proportion of correctly classified tokens.

However, it is essential to look at the macro-average accuracy as well, which considers the average performance across all the classes, giving equal weight to each class. This is important because some classes might have fewer examples and could be underrepresented in the dataset. The macro-average accuracy of 0.901 is also high, which suggests that the model is performing well in recognizing and classifying entities even in the less frequent classes.

In [None]:
import random

def find_failed_sentence(test_sentences, test_dataset):
    indices = list(range(len(test_sentences)))
    random.shuffle(indices)

    for idx in indices:
        sentence = test_sentences[idx]
        if len(sentence['tokens']) >= 10:
            encoded = test_dataset[idx]
            batch = {k: v.unsqueeze(0).to(device) for k, v in encoded.items()}
            with torch.no_grad():
                outputs = model(**batch)
            pred_values = torch.argmax(outputs[1], dim=2)[0]
            pred_values = pred_values[encoded['labels'] != -100].detach().cpu().numpy()
            true_values = encoded['labels'][encoded['labels'] != -100].detach().cpu().numpy()

            if not (pred_values == true_values).all():
                return sentence, pred_values, true_values

    return None, None, None

failed_sentence, preds, actual = find_failed_sentence(test_sentences, test_dataset)

if failed_sentence:
    itos = tagmap.get_itos()
    print("Failed Sentence Tokens:", failed_sentence['tokens'])
    print("\nActual Tags:", [itos[tag] for tag in actual])
    print("\nPredicted Tags:", [itos[tag] for tag in preds])
    print("\nToken-wise comparison:")
    for token, actual_tag, pred_tag in zip(failed_sentence['tokens'], actual, preds):
        print(f"Token: {token}, Actual: {itos[actual_tag]}, Predicted: {itos[pred_tag]}")
else:
    print("No failed sentence with at least 10 tokens found.")

Failed Sentence Tokens: ['Luxembourg', "'s", 'traditional', 'Christmas', 'market', ',', 'which', 'starts', 'on', 'Saturday', 'and', 'runs', 'to', 'December', '24', ',', 'has', 'taken', 'to', 'the', 'world', 'wide', 'web', 'as', 'a', 'way', 'of', 'publicising', 'its', 'activities', '.']

Actual Tags: ['B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Predicted Tags: ['B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

Token-wise comparison:
Token: Luxembourg, Actual: B-LOC, Predicted: B-LOC
Token: 's, Actual: O, Predicted: O
Token: traditional, Actual: O, Predicted: O
Token: Christmas, Actual: O, Predicted: O
Token: market, Actual: O, Predicted: O
Token: ,, Actual: O, Predicted: O
Token: which, Actual: O, Predicted: O
Token: starts, Actual: O, Predicted: O
Tok

Here are the tokens that were incorrectly tagged:

world - Actual: O, Predicted: B-MISC\
wide - Actual: O, Predicted: I-ORG\
web - Actual: O, Predicted: I-ORG

Comment:
The model incorrectly predicted the tags for the words "world", "wide", and "web". It seems to have mistaken the phrase "world wide web" as an organization or a miscellaneous entity, instead of recognizing that these words should have the 'O' label, indicating they are not part of any named entity in this context. This could be due to the training data having examples where the phrase "world wide web" was tagged as a named entity, or the model might not have seen enough examples of this phrase in non-entity contexts to generalize well.

In [6]:
sentence = "Mount Everest, part of the Himalayas, is the Earth's highest mountain above sea level."
tokens = tokenizer(sentence, truncation=True, is_split_into_words=False, return_offsets_mapping=True)

input_ids = torch.tensor(tokens['input_ids']).unsqueeze(0).to(device)
attention_mask = torch.tensor(tokens['attention_mask']).unsqueeze(0).to(device)

model.eval()
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    predicted_labels = torch.argmax(outputs[0], dim=2).squeeze().tolist()


predicted_labels = [tagmap.get_itos()[label_id] for label_id in predicted_labels]


words = []
for idx, (start, end) in enumerate(tokens['offset_mapping'][1:-1]):
    if start != end and tokens['input_ids'][idx+1] != tokenizer.cls_token_id and tokens['input_ids'][idx+1] != tokenizer.sep_token_id:
        words.append(sentence[start:end])


predicted_labels = [label for idx, label in enumerate(predicted_labels[1:-1]) if tokens['input_ids'][idx+1] != tokenizer.cls_token_id and tokens['input_ids'][idx+1] != tokenizer.sep_token_id]

correct_labels = ["B-LOC", "I-LOC", "O", "O", "O", "O", "B-LOC", "O", "O", "O", "B-LOC", "O", "O", "O", "O", "O", "O","O","O"]
indexnum = 1
for word, correct_label, predicted_label in zip(words, correct_labels, predicted_labels):
    print(f"{indexnum}: {word} - Correct: {correct_label}, Predicted: {predicted_label}")
    indexnum += 1

1: Mount - Correct: B-LOC, Predicted: B-LOC
2: Everest - Correct: I-LOC, Predicted: B-LOC
3: , - Correct: O, Predicted: O
4: part - Correct: O, Predicted: O
5: of - Correct: O, Predicted: O
6: the - Correct: O, Predicted: O
7: Himalayas - Correct: B-LOC, Predicted: B-LOC
8: , - Correct: O, Predicted: O
9: is - Correct: O, Predicted: O
10: the - Correct: O, Predicted: O
11: Earth - Correct: B-LOC, Predicted: O
12: ' - Correct: O, Predicted: O
13: s - Correct: O, Predicted: O
14: highest - Correct: O, Predicted: O
15: mountain - Correct: O, Predicted: O
16: above - Correct: O, Predicted: O
17: sea - Correct: O, Predicted: O
18: level - Correct: O, Predicted: O
19: . - Correct: O, Predicted: O


1) "Everest" was labeled as I-LOC (inside a location) in the ground truth, but the model predicted it as B-LOC (beginning of a location).

The model mistakenly identified it as the beginning of a location rather than recognizing it as a continuation of the previous location "Mount."

2) "Earth" was labeled as B-LOC (beginning of a location) in the ground truth, but the model predicted it as O (outside any entity).

In this case, the model failed to recognize "Earth" as a location entity and instead classified it as O (non-entity). However, "Earth" should have been labeled as B-LOC, indicating that it is the beginning of a location. The missclasification could be due to the meaning of the word earth as "ground".

So, in addition to the misclassification of "Everest," the model also incorrectly labeled "Earth" as O instead of recognizing it as a location entity (B-LOC).

Overall, the model correctly identified "Mount" and "Himalayas" as location entities but misclassified "Everest" and "Earth" in the given sequence.