# NLP Assignment #3
### by Prodromos Kampouridis

#### IMPORTANT NOTE
##### *Due to the large length of code, the answers to tasks 1-8 can also be found as markdowns in the cells below.*


##### *For more detailed information, please refer to the report entitled PRODROMOS KAMPOURIDIS REPORT*.

**Introduction**

In the first code cell of the notebook, we modify the given code of the ner-bert.py file, so that we are capable to reuse some of the functionality for multiple questions. The changes are the folowing:

- The code for building tagset and tag ids is moved under the build_tags function which also takes as a parameter the tags_type that we want to use, so then we will reuse this method for NER, POS tagging and Chunking.
- The encode method is also enriched with the tags_type parameter, so that we will be able to use it for the POS tagging and Chunking tasks later on.
- An extra parameter return_concat_results is introduced in the EvaluateModel function, to make it possible to retrieve predictions and labels per sentence (without concatenating them), in order to identify wrongly classified sentences in question 2.
- The training process is moved under the TrainModel function that takes as parameters the model and the training dataloader, so that we can use it to train different models (see questions 1,4,5,6,7,8) with different datasets (see question 5)

In [4]:
#
# Named-entity recognition using BERT
# Dataset: https://www.kaggle.com/datasets/alaakhaled/conll003-englishversion
#

# dependencies
import torch
import torch.optim as optim 
from torchtext.vocab import build_vocab_from_iterator
from transformers import BertForTokenClassification, BertTokenizerFast
from sklearn.metrics import accuracy_score, balanced_accuracy_score, classification_report
import tqdm
tqdmn = tqdm.notebook.tqdm

# hyper-parameters
EPOCHS = 3
BATCH_SIZE = 8
LR = 1e-5

# the path of the data files
base_path = '/kaggle/input/conll003-englishversion/'

# use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# read the data files
def load_sentences(filepath):

    sentences = []
    tokens = []
    pos_tags = []
    chunk_tags = []
    ner_tags = []

    with open(filepath, 'r') as f:
        
        for line in f.readlines():
            
            if (line == ('-DOCSTART- -X- -X- O\n') or line == '\n'):
                if len(tokens) > 0:
                    sentences.append({'tokens': tokens, 'pos_tags': pos_tags, 'chunk_tags': chunk_tags, 'ner_tags': ner_tags})
                    tokens = []
                    pos_tags = []
                    chunk_tags = []
                    ner_tags = []
            else:
                l = line.split(' ')
                tokens.append(l[0])
                pos_tags.append(l[1])
                chunk_tags.append(l[2])
                ner_tags.append(l[3].strip('\n'))
    
    return sentences

print('loading data')
train_sentences = load_sentences(base_path + 'train.txt')
test_sentences = load_sentences(base_path + 'test.txt')
valid_sentences = load_sentences(base_path + 'valid.txt')


def build_tags(tags_type="ner_tags"):
    # build tagset and tag ids
    tags = [sentence[tags_type] for sentence in train_sentences]
    tagmap = build_vocab_from_iterator(tags)
    tagset = set([item for sublist in tags for item in sublist])
    print('Tagset size:',len(tagset))
    return tagmap, tagset

tagmap, tagset = build_tags("ner_tags")

# load BERT tokenizer
bert_version = 'bert-base-uncased'
tokenizer = BertTokenizerFast.from_pretrained(bert_version)

# map tokens and tags to token ids and label ids
def align_label(tokens, labels):

    word_ids = tokens.word_ids()
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
        if word_idx is None:
            label_ids.append(-100)
        elif word_idx != previous_word_idx:
            try:
                label_ids.append(tagmap[labels[word_idx]])
            except:
                label_ids.append(-100)
        else:
                label_ids.append(-100)
        previous_word_idx = word_idx

    return label_ids

def encode(sentence, tags_type="ner_tags"):
    encodings = tokenizer(sentence['tokens'], truncation=True, padding='max_length', is_split_into_words=True)
    labels = align_label(encodings, sentence[tags_type])
    return { 'input_ids': torch.LongTensor(encodings.input_ids), 'attention_mask': torch.LongTensor(encodings.attention_mask), 'labels': torch.LongTensor(labels) }

print('encoding data')
train_dataset = [encode(sentence, "ner_tags") for sentence in train_sentences]
valid_dataset = [encode(sentence, "ner_tags") for sentence in valid_sentences]
test_dataset = [encode(sentence, "ner_tags") for sentence in test_sentences]

# initialize the model including a classification layer with num_labels classes
print('initializing the model')
ner_model = BertForTokenClassification.from_pretrained(bert_version, num_labels=len(tagset))
ner_model.to(device)
optimizer = optim.AdamW(params=ner_model.parameters(), lr=LR)

# prepare batches of data
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE)

# evaluate the performance of the model
def EvaluateModel(model, data_loader, return_concat_results=True):
    model.eval()
    with torch.no_grad():
        Y_actual, Y_preds = [],[]
        for i, batch in enumerate(tqdmn(data_loader)):
            # move the batch tensors to the same device as the model
            batch = { k: v.to(device) for k, v in batch.items() }
            # send 'input_ids', 'attention_mask' and 'labels' to the model
            outputs = model(**batch)
            # iterate through the examples
            for idx, _ in enumerate(batch['labels']):
                # get the true values
                true_values_all = batch['labels'][idx]
                true_values = true_values_all[true_values_all != -100]
                # get the predicted values
                pred_values = torch.argmax(outputs[1], dim=2)[idx]
                pred_values = pred_values[true_values_all != -100]
                # update the lists of true answers and predictions
                Y_actual.append(true_values)
                Y_preds.append(pred_values)
    if return_concat_results:
        Y_actual = torch.cat(Y_actual)
        Y_preds = torch.cat(Y_preds)
        # Return list of actual labels, predicted labels 
        return Y_actual.detach().cpu().numpy(), Y_preds.detach().cpu().numpy()
    else:
        # Return list of actual labels, predicted labels per sentence.
        return [y.detach().cpu().numpy() for y in Y_actual], [y.detach().cpu().numpy() for y in Y_preds]
    

def TrainModel(model, train_loader):
    # train the model
    print('training the model')
    for epoch in tqdmn(range(EPOCHS)):
        model.train()
        print('epoch',epoch+1)
        # iterate through each batch of the train data
        for i, batch in enumerate(tqdmn(train_loader)):
            # move the batch tensors to the same device as the model
            batch = { k: v.to(device) for k, v in batch.items() }
            # send 'input_ids', 'attention_mask' and 'labels' to the model
            outputs = model(**batch)
            loss = outputs[0]
            # set the gradients to zero
            optimizer.zero_grad()
            # propagate the loss backwards
            loss.backward()
            # update the model weights
            optimizer.step()
        # calculate performence on validation set
        Y_actual, Y_preds = EvaluateModel(model,valid_loader)
        print("\nValidation Accuracy : {:.3f}".format(accuracy_score(Y_actual, Y_preds)))
        print("\nValidation Macro-Accuracy : {:.3f}".format(balanced_accuracy_score(Y_actual, Y_preds)))

    print('applying the model to the test set')
    # apply the trained model to the test set
    Y_actual, Y_preds = EvaluateModel(model,test_loader)

    print("\nTest Accuracy : {:.3f}".format(accuracy_score(Y_actual, Y_preds)))
    print("\nTest Macro-Accuracy : {:.3f}".format(balanced_accuracy_score(Y_actual, Y_preds)))
    print("\nClassification Report : ")
    print(classification_report(Y_actual, Y_preds,labels = tagmap(tagmap.get_itos()), target_names = tagmap.get_itos(), zero_division = 0))



loading data
Tagset size: 9


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

encoding data
initializing the model


Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 

## Answers

### 1.

In [2]:
TrainModel(ner_model, train_loader)

training the model


  0%|          | 0/3 [00:00<?, ?it/s]

epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.986

Validation Macro-Accuracy : 0.907
epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.988

Validation Macro-Accuracy : 0.942
epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.989

Validation Macro-Accuracy : 0.939
applying the model to the test set


  0%|          | 0/432 [00:00<?, ?it/s]


Test Accuracy : 0.980

Test Macro-Accuracy : 0.908

Classification Report : 
              precision    recall  f1-score   support

           O       1.00      0.99      0.99     38323
       B-LOC       0.92      0.94      0.93      1668
       B-PER       0.97      0.97      0.97      1617
       B-ORG       0.91      0.89      0.90      1661
       I-PER       0.98      0.99      0.99      1156
       I-ORG       0.88      0.89      0.89       835
      B-MISC       0.84      0.82      0.83       702
       I-LOC       0.79      0.93      0.85       257
      I-MISC       0.62      0.75      0.68       216

    accuracy                           0.98     46435
   macro avg       0.88      0.91      0.89     46435
weighted avg       0.98      0.98      0.98     46435



The model's performance on the test set is:

Accuracy: **0.980**

Macro-average accuracy: **0.908**

### 2.

Furthermore, we implement the find_bad_example() method which given actual labels and predictions of sentence tokens, finds a sentence with at least 10 tokens and at least 30% of its tokens wrongly classified.

In [4]:
import numpy as np
from tabulate import tabulate

def find_bad_example(Y_actual, Y_preds):
    # Searching the test set predictions for a sentence with at least 10 tokens, for which the model predicted at least 30% of the tokens wrong.
    for i, (preds, labels) in enumerate(zip(Y_preds, Y_actual)):
        num_of_misclassifications = sum(np.array(preds) != np.array(labels))
        if len(labels) >= 10 and num_of_misclassifications / len(labels) >= 0.3:
            tokens = test_sentences[i]["tokens"]
            preds = preds[labels != -100]
            predictions = [tagmap.get_itos()[p] for p in preds]
            labels = labels[labels != -100]
            labels = [tagmap.get_itos()[l] for l in labels]
            print(" ".join(tokens))
            print(tabulate([[t, p, l] for t, p, l in zip(tokens, predictions, labels)], headers=["TOKENS", "PREDICTIONS", "LABELS"], tablefmt="grid"))
            #print(tabulate([["TOKENS"] + tokens, ["PREDICTIONS"] + predictions, ["LABELS"] + labels], tablefmt="grid"))
            break

In [4]:
Y_actual, Y_preds = EvaluateModel(ner_model, test_loader, return_concat_results=False)
find_bad_example(Y_actual, Y_preds)

  0%|          | 0/432 [00:00<?, ?it/s]

NCAA AMERICAN FOOTBALL-OHIO STATE 'S PACE FIRST REPEAT LOMBARDI AWARD WINNER .
+---------------+---------------+----------+
| TOKENS        | PREDICTIONS   | LABELS   |
| NCAA          | B-MISC        | B-ORG    |
+---------------+---------------+----------+
| AMERICAN      | B-MISC        | O        |
+---------------+---------------+----------+
| FOOTBALL-OHIO | I-MISC        | B-MISC   |
+---------------+---------------+----------+
| STATE         | I-ORG         | I-MISC   |
+---------------+---------------+----------+
| 'S            | O             | O        |
+---------------+---------------+----------+
| PACE          | O             | B-PER    |
+---------------+---------------+----------+
| FIRST         | O             | O        |
+---------------+---------------+----------+
| REPEAT        | O             | O        |
+---------------+---------------+----------+
| LOMBARDI      | B-MISC        | B-MISC   |
+---------------+---------------+----------+
| AWARD         | O  

The model missclassfied 6 out of 12 tokens of the sentence "NCAA AMERICAN FOOTBALL-OHIO STATE 'S PACE FIRST REPEAT LOMBARDI AWARD WINNER ."

Wrongly classified tokens: "NCAA", "AMERICAN", "FOOTBALL-OHIO", "STATE", "PACE", "AWARD"

Correctly classified tokens: "'S", "FIRST", "REPEAT", "LOMBARDI", "WINNER", "."

The detailed predictions vs. labels comparison, appears in the table above.

To complete the second part of the Question 2 (i.e. pass a sentence from a newspaper through the model), we create three new functions that can handle data without annotated labels:

encode_predict: Encodes the input sentence without requiring or returning any labels. Instead it also returns the encodings.word_ids() so that we can later map the subtoken predictions to the correct words/tokens.
Predict: Performs a Forward pass of the model to produce the predictions
predict_sentence: Encodes the input sentence and generates the predictions by using the previous 2 functions. Then, it maps the subtoken predictions to the correct token and prints a table with the predictions per token.

In [1]:
def encode_predict(sentence):
    encodings = tokenizer(sentence['tokens'], truncation=True, padding='max_length', is_split_into_words=True)
    return {'input_ids': torch.LongTensor(encodings.input_ids), 'attention_mask': torch.LongTensor(encodings.attention_mask)}, encodings.word_ids()

def Predict(model, data_loader):
    model.eval()
    with torch.no_grad():
        Y_preds = []
        for i, batch in enumerate(tqdmn(data_loader)):
            #print(batch)
            # move the batch tensors to the same device as the model
            batch = { k: v.to(device) for k, v in batch.items() }
            # send 'input_ids', 'attention_mask' and 'labels' to the model
            outputs = model(**batch)
            # iterate through the examples
            for idx, _ in enumerate(batch['input_ids']):
                pred_values = torch.argmax(outputs["logits"], dim=2)[idx]
                pred_values = pred_values[batch["attention_mask"][idx] == 1]
                Y_preds.append(pred_values.detach().cpu().numpy())
    # Return list of actual labels, predicted labels per sentence.
    return Y_preds

def predict_sentence(model, sentence):
    sentence = {"tokens": sentence.split()}
    enc_sentence, word_ids = encode_predict(sentence)
    pred_dataset = [enc_sentence]
    pred_loader = torch.utils.data.DataLoader(pred_dataset, batch_size=1)
    predictions = Predict(model, pred_loader)[0]

    tokens = sentence["tokens"]
    predictions = [tagmap.get_itos()[p] for p in predictions]
    word_predictions = []
    seen_word_ids = set()
    for i, p in enumerate(predictions):
        word_id = word_ids[i]
        if word_id in seen_word_ids or word_id is None:
            continue
        word_predictions.append(p)
        seen_word_ids.add(word_id)
    print(" ".join(sentence["tokens"]))
    print(tabulate([[t, p] for t, p in zip(tokens, word_predictions)], headers=["TOKENS", "PREDICTIONS"], tablefmt="grid"))


Next we will give as input to the NER, the following sentence taken from bbc.com:

*Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border.*

In [6]:
bbc_sentence = """Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border."""
predict_sentence(ner_model, bbc_sentence)

  0%|          | 0/1 [00:00<?, ?it/s]

Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border.
+--------------+---------------+
| TOKENS       | PREDICTIONS   |
| Local        | O             |
+--------------+---------------+
| governor     | O             |
+--------------+---------------+
| Vyacheslav   | B-PER         |
+--------------+---------------+
| Gladkov      | I-PER         |
+--------------+---------------+
| said         | O             |
+--------------+---------------+
| Russian      | B-MISC        |
+--------------+---------------+
| forces       | O             |
+--------------+---------------+
| were         | O             |
+--------------+---------------+
| searching    | O             |
+--------------+---------------+
| for          | O             |
+--------------+---------------+
| "saboteurs", | O             |
+--------------+---------------+
| who          | O             |
+--------------+-----------

The model correctly classifies most of the tokens. Two possible misclassifications are: 1) forces: "I-MISC" (instead of "O"), 2) district: "I-LOC" (instead of "O").

### 3.

The purpose of the align_label function, is to convert the labels of a sentence, from string representation (e.g. "B_ORG") to label ids (e.g. 3), to create a suitable target format for the model. It also manages to mark the special tokens, and possible subtokens of a word (after the first one) with the special token -100, so that only one label corresponds to each word, and the loss and evaluation metrics are calculated correctly. More detailed:

Firstly, the tokens.word_ids() is called. This returns the id of the word that each token of a sentence originates from. This mapping is needed because the BERT tokenizer can split a word into more than one subtokens, and we need a way to know from which word (in the original sentence) each subtoken came from. Note that tokens.word_ids sets the word id of the special tokens (i.e. [CLS]) to None.

The align_label function sets the None word ids to -100 meaning that the target label id for the special tokens of the sentence is -100

Additionaly, for each token that is not a special token, the function assigns the label id by mapping the word id of the token (labels[word_idx]) and then using tagmap to map the string label to the label id (tagmap[labels[word_idx]]). Although, this only happens for the first subtoken of each word (in case a word was split into more than one sub-tokens). This is achieved by the elif clause which checks whether the word id has changed since the previous iteration. For any other subtoken that is not the first subtoken of a word, the function again sets its label id to -100.

Finally, according to the huggingface documentation, the token classification loss function ignores the special label id -100. Consequently, all the special tokens and the subtokens that are not the first subtoken of a word, are ignored during the calculation of the loss, along with the corresponding model predictions. Also, labels and predictions in the positions that are marked with -100 are filtered out in the EvaluateModel function, so that the evaluation metrics are correctly calculated.

### 4.

In order to freeze BERT in our model, we iterate through the bert parameters (freezed_bert_model.bert.parameters()) and we set the "requires_grad" attribute of each parameter to False. Consequently, we create a new AdamW optimizer, and we set its params argument to the list of the model parameters that has requires_grad == True. This way, the optimizer will only optimize the parameters that have requires_grad == True, i.e. the classifier layer's parameters, leaving the bert parameters as is (freezed).

In [7]:
freezed_bert_model = BertForTokenClassification.from_pretrained(bert_version, num_labels=len(tagset))
freezed_bert_model.to(device)

for param in freezed_bert_model.bert.parameters():
    param.requires_grad = False

optimizer = optim.AdamW(params=[param for param in freezed_bert_model.parameters() if param.requires_grad == True], lr=LR)

TrainModel(freezed_bert_model, train_loader)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 

training the model


  0%|          | 0/3 [00:00<?, ?it/s]

epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.832

Validation Macro-Accuracy : 0.116
epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.836

Validation Macro-Accuracy : 0.130
epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.856

Validation Macro-Accuracy : 0.198
applying the model to the test set


  0%|          | 0/432 [00:00<?, ?it/s]


Test Accuracy : 0.853

Test Macro-Accuracy : 0.210

Classification Report : 
              precision    recall  f1-score   support

           O       0.86      1.00      0.92     38323
       B-LOC       0.77      0.36      0.49      1668
       B-PER       0.85      0.16      0.26      1617
       B-ORG       0.66      0.25      0.36      1661
       I-PER       0.90      0.13      0.22      1156
       I-ORG       0.00      0.00      0.00       835
      B-MISC       0.00      0.00      0.00       702
       I-LOC       0.00      0.00      0.00       257
      I-MISC       0.00      0.00      0.00       216

    accuracy                           0.85     46435
   macro avg       0.45      0.21      0.25     46435
weighted avg       0.81      0.85      0.81     46435



The model's performance on the test set is:

Accuracy: **0.853**

Macro-average accuracy: **0.21**

It is notable that the model's performance is way worse than the performance of the model in the first task. This is probably due to the very small learning rate that we used for both questions (1e-5), which although is suitable for the optimizing the BERT+classifier model, delays the optimization of the classification layer (with BERT freezed). A possible solution would be to increase the learning rate to a larger value (e.g. 1e-3) and/or increase the number of epochs.

Indeed, if we change the learning rate from 1e-5 to 1e-3, the model performs way better:

Accuracy: **0.965**

Macro-average accuracy: **0.813**

In [5]:
LR = 1e-3
freezed_bert_model = BertForTokenClassification.from_pretrained(bert_version, num_labels=len(tagset))
freezed_bert_model.to(device)

for param in freezed_bert_model.bert.parameters():
    param.requires_grad = False

optimizer = optim.AdamW(params=[param for param in freezed_bert_model.parameters() if param.requires_grad == True], lr=LR)

TrainModel(freezed_bert_model, train_loader)

# Se-setting learning rate to 1e-5 for the rest of the experiments. 
LR = 1e-5

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 

training the model


  0%|          | 0/3 [00:00<?, ?it/s]

epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.969

Validation Macro-Accuracy : 0.776
epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.971

Validation Macro-Accuracy : 0.811
epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.971

Validation Macro-Accuracy : 0.808
applying the model to the test set


  0%|          | 0/432 [00:00<?, ?it/s]


Test Accuracy : 0.965

Test Macro-Accuracy : 0.813

Classification Report : 
              precision    recall  f1-score   support

           O       0.99      0.99      0.99     38323
       B-LOC       0.83      0.93      0.88      1668
       B-PER       0.93      0.93      0.93      1617
       B-ORG       0.85      0.77      0.81      1661
       I-PER       0.94      0.96      0.95      1156
       I-ORG       0.83      0.63      0.72       835
      B-MISC       0.77      0.71      0.74       702
       I-LOC       0.74      0.74      0.74       257
      I-MISC       0.61      0.66      0.63       216

    accuracy                           0.96     46435
   macro avg       0.83      0.81      0.82     46435
weighted avg       0.96      0.96      0.96     46435



In [6]:
# Count model parameters
def count_parameters(model, freezed = False):
    return sum(p.numel() for p in model.parameters() if p.requires_grad != freezed)

print(f"Number of freezed (bert) parameters: {count_parameters(freezed_bert_model, freezed = True)}")
print(f"Number of trainable (classification layer) parameters: {count_parameters(freezed_bert_model, freezed = False)}")

Number of freezed (bert) parameters: 108891648
Number of trainable (classification layer) parameters: 6921


The number of freezed (bert) parameters is: 108891648

The number of trainable (classification layer) parameters: 6921

### 5.

In order to train the model in the concatenation of the training and validation set, we first concatenate the train_dataset and the valid_dataset. Both datasets are lists of objects, where each object represents a training/validation example. Thus, we achieve this with a simple list concatenation: train_valid_dataset = train_dataset + valid_dataset

Furthermore. we create a new dataloader (train_valid_loader) with input dataset, the train_valid_dataset.

Lastly, we train the model by passing the train_valid_loader to the train_loader parameter of the TrainModel method.

It is important to note that validation metrics after each epoch are still calculated on the validation set, which is now also part of the training dataset, and thus, these metrics should not be taken into consideration. Since training stopping criterion is fixed to 3 epochs, this does not raise a problem, because validation metrics do not affect training in any way. It would however be a problem if we chose to use Early Stopping, where the number of epochs to train the model would be affected by the validation metrics.

In [9]:
ner_model_q5 = BertForTokenClassification.from_pretrained(bert_version, num_labels=len(tagset))
ner_model_q5.to(device)
optimizer = optim.AdamW(params=ner_model_q5.parameters(), lr=LR)

train_valid_dataset = train_dataset + valid_dataset
train_valid_loader = torch.utils.data.DataLoader(train_valid_dataset, batch_size=BATCH_SIZE, shuffle=True)
TrainModel(ner_model_q5, train_loader=train_valid_loader)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 

training the model


  0%|          | 0/3 [00:00<?, ?it/s]

epoch 1


  0%|          | 0/2162 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.993

Validation Macro-Accuracy : 0.962
epoch 2


  0%|          | 0/2162 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.997

Validation Macro-Accuracy : 0.978
epoch 3


  0%|          | 0/2162 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.998

Validation Macro-Accuracy : 0.990
applying the model to the test set


  0%|          | 0/432 [00:00<?, ?it/s]


Test Accuracy : 0.979

Test Macro-Accuracy : 0.911

Classification Report : 
              precision    recall  f1-score   support

           O       1.00      0.99      0.99     38323
       B-LOC       0.91      0.94      0.92      1668
       B-PER       0.99      0.95      0.97      1617
       B-ORG       0.88      0.90      0.89      1661
       I-PER       0.99      0.99      0.99      1156
       I-ORG       0.89      0.89      0.89       835
      B-MISC       0.80      0.84      0.82       702
       I-LOC       0.80      0.93      0.86       257
      I-MISC       0.61      0.78      0.68       216

    accuracy                           0.98     46435
   macro avg       0.87      0.91      0.89     46435
weighted avg       0.98      0.98      0.98     46435



The model's performance on the test set is:

Accuracy: **0.979**

Macro-average accuracy: **0.911**

Comparing to the model of the first question which was trained only on the original training set, the test accuracy is slightly lower: 0.979 (vs. 0.980), but the test macro-accuracy slightly improved: 0.911 (vs. 0.908)

### 6.

To train the model for the POS tagging task, we first need to create new tagmap and tagset variables on the pos_tags labels. We achieve this by calling the build_tags method that we constructed in the first cell, that takes as input the type of tags for which we want to create the tagmap and the tagset(i.e. pos_tags).

We then create new train, valid and test datasets. Again, we achieve this by using the modified encode method that takes the tag type (i.e. pos_tags) as a parameter and constructs the labels tensor based on that.

Next, we create a new model with num_labels=len(tagset), where the tagset has now changed to contain the pos tagging labels. 

Lastly, we create data loaders on the new pos tagging datasets and we train the model using the TrainModel method.

In [2]:
tagmap, tagset = build_tags("pos_tags")

# initialize the model including a classification layer with num_labels classes
print('initializing the model')
pos_tagging_model = BertForTokenClassification.from_pretrained(bert_version, num_labels=len(tagset))
pos_tagging_model.to(device)
optimizer = optim.AdamW(params=pos_tagging_model.parameters(), lr=LR)

print('encoding data')
train_dataset = [encode(sentence, "pos_tags") for sentence in train_sentences]
valid_dataset = [encode(sentence, "pos_tags") for sentence in valid_sentences]
test_dataset = [encode(sentence, "pos_tags") for sentence in test_sentences]

# prepare batches of data
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE)

TrainModel(pos_tagging_model, train_loader=train_loader)



Tagset size: 45
initializing the model


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 

encoding data
training the model


  0%|          | 0/3 [00:00<?, ?it/s]

epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.935

Validation Macro-Accuracy : 0.751
epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.942

Validation Macro-Accuracy : 0.798
epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.946

Validation Macro-Accuracy : 0.838
applying the model to the test set


  0%|          | 0/432 [00:00<?, ?it/s]


Test Accuracy : 0.941

Test Macro-Accuracy : 0.853

Classification Report : 
              precision    recall  f1-score   support

         NNP       0.90      0.93      0.92      8595
          NN       0.90      0.89      0.89      4931
          CD       0.97      0.99      0.98      5962
          IN       0.99      0.98      0.99      4018
          DT       1.00      0.99      0.99      2799
          JJ       0.86      0.82      0.84      2393
         NNS       0.93      0.93      0.93      2174
         VBD       0.95      0.94      0.95      1699
           .       1.00      1.00      1.00      1630
           ,       1.00      1.00      1.00      1637
          VB       0.94      0.89      0.92       933
         VBN       0.86      0.89      0.87       866
          RB       0.87      0.86      0.87       888
          CC       1.00      0.99      1.00       765
          TO       1.00      1.00      1.00       818
         PRP       1.00      0.96      0.98       605
   

The model's performance on the test set is:

Accuracy: **0.941**

Macro-average accuracy: **0.853**

In [5]:
Y_actual, Y_preds = EvaluateModel(pos_tagging_model, test_loader, return_concat_results=False)
find_bad_example(Y_actual, Y_preds)

  0%|          | 0/432 [00:00<?, ?it/s]

SOCCER - LATE GOALS GIVE JAPAN WIN OVER SYRIA .
+----------+---------------+----------+
| TOKENS   | PREDICTIONS   | LABELS   |
| SOCCER   | NN            | NN       |
+----------+---------------+----------+
| -        | :             | :        |
+----------+---------------+----------+
| LATE     | NNP           | JJ       |
+----------+---------------+----------+
| GOALS    | NNP           | NNS      |
+----------+---------------+----------+
| GIVE     | NNP           | VBP      |
+----------+---------------+----------+
| JAPAN    | NNP           | NNP      |
+----------+---------------+----------+
| WIN      | NNP           | NNP      |
+----------+---------------+----------+
| OVER     | IN            | IN       |
+----------+---------------+----------+
| SYRIA    | NNP           | NNP      |
+----------+---------------+----------+
| .        | .             | .        |
+----------+---------------+----------+


The model missclassfied 3 out of 10 tokens of the sentence "SOCCER - LATE GOALS GIVE JAPAN WIN OVER SYRIA ."

Wrongly classified tokens: "LATE", "GOALS", "GIVE"

Correctly classified tokens: "SOCCER", "-", "JAPAN", "WIN", "OVER", "SYRIA", "."

The detailed predictions vs. labels comparison, appear in the table above.

Next we will give as input to the POS tagging model, the following sentence taken from bbc.com:

*Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border.*


In [8]:
bbc_sentence = """Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border."""
predict_sentence(pos_tagging_model, bbc_sentence)

  0%|          | 0/1 [00:00<?, ?it/s]

Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border.
+--------------+---------------+
| TOKENS       | PREDICTIONS   |
| Local        | JJ            |
+--------------+---------------+
| governor     | NN            |
+--------------+---------------+
| Vyacheslav   | NNP           |
+--------------+---------------+
| Gladkov      | NNP           |
+--------------+---------------+
| said         | VBD           |
+--------------+---------------+
| Russian      | JJ            |
+--------------+---------------+
| forces       | NNS           |
+--------------+---------------+
| were         | VBD           |
+--------------+---------------+
| searching    | VBG           |
+--------------+---------------+
| for          | IN            |
+--------------+---------------+
| "saboteurs", | "             |
+--------------+---------------+
| who          | WP            |
+--------------+-----------

The model classifies correctly all the tokens, expect for the token "saboteurs", which is classified as " (quote) instead of NNS (Noun, plural).

It is important to note that although BERT’s tokenizer is able to split difficult tokens like ‘”saboteurs”,’ and make correct predictions, in this example the model’s prediction is mostly affected by the first character ‘”’, which can be explained by the cleaner data that the model was trained on. Better initial tokenization of the word ‘“saboteurs”,’ into 4 separate tokens [‘”’, ‘saboteurs’, ‘“’, ‘,’], would probably yield better predictions, although our sentence of words was picked randomly for education purposes. Similarly for the token ‘border.’.

### 7.

To train the model for the text chunking task, we first need to create new tagmap and tasget variables on the chunk_tags labels. We achieve this by calling the build_tags method that we constructed in the first cell, that takes as input the type of tags for which we want to create the tagmap and tagset (i.e. chunk_tags).

Next, we create new train, valid and test datasets. Again, we achieve this by using the modified encode method that takes the tag type (i.e. chunk_tags) as a parameter and constructs the labels tensor based on that.

Then, we create a new model with num_labels=len(tagset), where the tagset has now changed to contain the chunking labels. 

Lastly, we create data loaders on the new chunking datasets and we train the model using the TrainModel method.

In [9]:
tagmap, tagset = build_tags("chunk_tags")

# initialize the model including a classification layer with num_labels classes
print('initializing the model')
chunking_model = BertForTokenClassification.from_pretrained(bert_version, num_labels=len(tagset))
chunking_model.to(device)
optimizer = optim.AdamW(params=chunking_model.parameters(), lr=LR)

print('encoding data')
train_dataset = [encode(sentence, "chunk_tags") for sentence in train_sentences]
valid_dataset = [encode(sentence, "chunk_tags") for sentence in valid_sentences]
test_dataset = [encode(sentence, "chunk_tags") for sentence in test_sentences]

# prepare batches of data
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE)

TrainModel(chunking_model, train_loader=train_loader)

Tagset size: 20
initializing the model


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: 

encoding data
training the model


  0%|          | 0/3 [00:00<?, ?it/s]

epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.948

Validation Macro-Accuracy : 0.514
epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.954

Validation Macro-Accuracy : 0.582
epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.955

Validation Macro-Accuracy : 0.634
applying the model to the test set


  0%|          | 0/432 [00:00<?, ?it/s]


Test Accuracy : 0.951

Test Macro-Accuracy : 0.586

Classification Report : 
              precision    recall  f1-score   support

        I-NP       0.95      0.97      0.96     16177
        B-NP       0.96      0.95      0.95     12985
           O       0.98      0.98      0.98      6210
        B-PP       0.97      0.98      0.97      3979
        B-VP       0.93      0.93      0.93      3767
        I-VP       0.94      0.93      0.94      1913
      B-ADVP       0.75      0.77      0.76       559
      B-SBAR       0.91      0.82      0.86       296
      B-ADJP       0.80      0.54      0.65       276
       B-PRT       0.78      0.64      0.70       110
      I-ADJP       0.64      0.55      0.59        55
      I-ADVP       0.67      0.30      0.42        33
        I-PP       0.58      0.47      0.52        15
      B-INTJ       0.00      0.00      0.00        13
     I-CONJP       0.20      0.14      0.17         7
       B-LST       1.00      0.21      0.34        29
   

The model's performance on the test set is:

Accuracy: **0.951**

Macro-average accuracy: **0.586**

In [10]:
Y_actual, Y_preds = EvaluateModel(chunking_model, test_loader, return_concat_results=False)
find_bad_example(Y_actual, Y_preds)

  0%|          | 0/432 [00:00<?, ?it/s]

Aly Ashour 7 , 56 penalty , Mohamed Ouda 24 , 73
+----------+---------------+----------+
| TOKENS   | PREDICTIONS   | LABELS   |
| Aly      | B-NP          | B-ADVP   |
+----------+---------------+----------+
| Ashour   | I-NP          | B-NP     |
+----------+---------------+----------+
| 7        | I-NP          | I-NP     |
+----------+---------------+----------+
| ,        | I-NP          | O        |
+----------+---------------+----------+
| 56       | I-NP          | B-NP     |
+----------+---------------+----------+
| penalty  | I-NP          | I-NP     |
+----------+---------------+----------+
| ,        | O             | O        |
+----------+---------------+----------+
| Mohamed  | B-NP          | B-NP     |
+----------+---------------+----------+
| Ouda     | I-NP          | I-NP     |
+----------+---------------+----------+
| 24       | I-NP          | I-NP     |
+----------+---------------+----------+
| ,        | I-NP          | I-NP     |
+----------+---------------+---

The model missclassfied 4 out of 12 tokens of the sentence "Aly Ashour 7 , 56 penalty , Mohamed Ouda 24 , 73"

Wrongly classified tokens: "Aly", "Ashour", ",", "56"

Correctly classified tokens: "7", "penalty", ",", "Mohamed", "Ouda", "24", ",", "73"

The detailed predictions vs. labels comparison, appear in the table above.

Next we will give as input to the text chunking tagging model, the following sentence taken from bbc.com:

*Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border.*

In [11]:
bbc_sentence = """Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border."""
predict_sentence(chunking_model, bbc_sentence)

  0%|          | 0/1 [00:00<?, ?it/s]

Local governor Vyacheslav Gladkov said Russian forces were searching for "saboteurs", who he said had attacked Grayvoronsky district by the border.
+--------------+---------------+
| TOKENS       | PREDICTIONS   |
| Local        | B-NP          |
+--------------+---------------+
| governor     | I-NP          |
+--------------+---------------+
| Vyacheslav   | I-NP          |
+--------------+---------------+
| Gladkov      | I-NP          |
+--------------+---------------+
| said         | B-VP          |
+--------------+---------------+
| Russian      | B-NP          |
+--------------+---------------+
| forces       | I-NP          |
+--------------+---------------+
| were         | B-VP          |
+--------------+---------------+
| searching    | I-VP          |
+--------------+---------------+
| for          | B-PP          |
+--------------+---------------+
| "saboteurs", | O             |
+--------------+---------------+
| who          | B-NP          |
+--------------+-----------

The model classifies correctly all the tokens, expect for the token "saboteurs", which is classified as "O" instead of B-NP. Of course, with better initial tokenization, the model would probably classify this token better.

### 8.

To change the model from BERT to roberta-base, we first have to import the model (RobertaForTokenClassification) and its tokenizer (RobertaTokenizerFast). Next, we just initialize the tokenizer and the model by specifying the roberta version to be "roberta_base", as follows:

roberta_version = 'roberta-base'

tokenizer = RobertaTokenizerFast.from_pretrained(roberta_version)

model = RobertaForTokenClassification.from_pretrained(roberta_version, num_labels=len(tagset))

The rest of the code remains the same


In [12]:
from transformers import RobertaForTokenClassification, RobertaTokenizerFast

tagmap, tagset = build_tags("ner_tags")

# load roberta tokenizer
roberta_version = 'roberta-base'
tokenizer = RobertaTokenizerFast.from_pretrained(roberta_version, add_prefix_space=True)

# initialize the model including a classification layer with num_labels classes
print('initializing the model')
model = RobertaForTokenClassification.from_pretrained(roberta_version, num_labels=len(tagset))
model.to(device)
optimizer = optim.AdamW(params=model.parameters(), lr=LR)

print('encoding data')
train_dataset = [encode(sentence, "ner_tags") for sentence in train_sentences]
valid_dataset = [encode(sentence, "ner_tags") for sentence in valid_sentences]
test_dataset = [encode(sentence, "ner_tags") for sentence in test_sentences]

# prepare batches of data
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE)

TrainModel(model, train_loader)

Tagset size: 9


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

initializing the model


Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForTokenClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions

encoding data
training the model


  0%|          | 0/3 [00:00<?, ?it/s]

epoch 1


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.990

Validation Macro-Accuracy : 0.946
epoch 2


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.992

Validation Macro-Accuracy : 0.957
epoch 3


  0%|          | 0/1756 [00:00<?, ?it/s]

  0%|          | 0/407 [00:00<?, ?it/s]


Validation Accuracy : 0.992

Validation Macro-Accuracy : 0.959
applying the model to the test set


  0%|          | 0/432 [00:00<?, ?it/s]


Test Accuracy : 0.983

Test Macro-Accuracy : 0.926

Classification Report : 
              precision    recall  f1-score   support

           O       1.00      0.99      1.00     38323
       B-LOC       0.95      0.93      0.94      1668
       B-PER       0.98      0.96      0.97      1617
       B-ORG       0.89      0.93      0.91      1661
       I-PER       0.99      0.99      0.99      1156
       I-ORG       0.87      0.95      0.91       835
      B-MISC       0.79      0.86      0.82       702
       I-LOC       0.88      0.92      0.90       257
      I-MISC       0.61      0.79      0.69       216

    accuracy                           0.98     46435
   macro avg       0.88      0.93      0.90     46435
weighted avg       0.98      0.98      0.98     46435



The model's performance on the test set is:

Accuracy: **0.983**

Macro-average accuracy: **0.926**


Comparing to the bert-base-uncased model of the first question, the roberta-base model performs better both in accuracy: 0.983 (vs. 0.980) and in macro-accuracy: 0.926 (vs. 0.908)