Question 1: Finetuning BERT for Named Entity Recognition
For this question, you'll be asked to finetune BERT for the task of Named Entity Recognition(NER) and to report out performance results. You have the following resources:

Class_7_code_supplement_BERT_based_finetuning_for_sentiment.ipynb This notebook contains basic code for finetuning BERT for sentiment analysis but you'll need to modify this code or replace it entirely for this task. bert-base-uncased should perform well on NER. Feel free to draw on examples that you find online for this part of the task.
Data: The CONLL-2003 dataset has been made available in the same folder as Assignment 2 in the file conll_2003_ner.zip Use this data to train and test your finetuned BERT
A description of the data format from the original paper. Feel free to draw on any other info that find online.
For full credit please submit a notebook assignment_2_question_1_{your_name}.ipynb with {your_name} replaced with your name. The notebook should contain the following:

All of code that you used to train and test your finetuned-BERT
You'll be evaluating model performance on the test sets eng.testa and eng.testb found in the data folder conll_2003_ner.zip. Use the span-based F1 evaluation metric on each test set and report out the scores.
An answer to the question "How does the performance of your system compare to the state-of-the-art for this dataset?"

In [None]:
pip install torch transformers datasets seqeval


Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m856.9 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [3

**Imported Necessary Libraries**

In [None]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForTokenClassification
from transformers import AdamW
from sklearn.metrics import classification_report
import numpy as np


**Model and Dataset**

In [None]:
train_file_path = '/content/eng.testa'
test_a_file_path = '/content/eng.testb'
test_b_file_path = '/content/eng.train'

# Function to load data
def load_data(file_path):
    sentences, labels = [], []
    with open(file_path, 'r') as f:
        sentence, label = [], []
        for line in f:
            if line.strip():
                word, tag = line.split()[:2]
                sentence.append(word)
                label.append(tag)
            else:
                if sentence:
                    sentences.append(sentence)
                    labels.append(label)
                    sentence, label = [], []
        if sentence:
            sentences.append(sentence)
            labels.append(label)
    return sentences, labels

train_sentences, train_labels = load_data(train_file_path)
test_a_sentences, test_a_labels = load_data(test_a_file_path)
test_b_sentences, test_b_labels = load_data(test_b_file_path)

In [None]:
class NERDataset(Dataset):
    def __init__(self, sentences, labels, tokenizer, max_len=128):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.label_map = {label: idx for idx, label in enumerate(set(tag for label in labels for tag in label))}

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        sentence = self.sentences[idx]
        label = self.labels[idx]

        encoding = self.tokenizer(
            sentence,
            is_split_into_words=True,
            padding='max_length',
            truncation=True,
            max_length=self.max_len,
            return_tensors='pt'
        )

        label_ids = [-100] * self.max_len
        for i, tag in enumerate(label):
            if i < self.max_len:
                label_ids[i] = self.label_map[tag]

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label_ids, dtype=torch.long)
        }

# tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# Datasets
train_dataset = NERDataset(train_sentences, train_labels, tokenizer)
test_a_dataset = NERDataset(test_a_sentences, test_a_labels, tokenizer)
test_b_dataset = NERDataset(test_b_sentences, test_b_labels, tokenizer)

# Dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_a_dataloader = DataLoader(test_a_dataset, batch_size=16, shuffle=False)
test_b_dataloader = DataLoader(test_b_dataset, batch_size=16, shuffle=False)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [None]:
model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(train_dataset.label_map))
model = model.to('cuda' if torch.cuda.is_available() else 'cpu')


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
optimizer = AdamW(model.parameters(), lr=5e-5)

for epoch in range(1):
    model.train()
    for batch in train_dataloader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to('cuda' if torch.cuda.is_available() else 'cpu')
        attention_mask = batch['attention_mask'].to('cuda' if torch.cuda.is_available() else 'cpu')
        labels = batch['labels'].to('cuda' if torch.cuda.is_available() else 'cpu')

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch + 1}, Loss: {loss.item()}')




Epoch 1, Loss: 1.7145782709121704


**Make Prediction**

In [None]:
def evaluate(dataloader):
    model.eval()
    predictions, true_labels = [], []

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to('cuda' if torch.cuda.is_available() else 'cpu')
            attention_mask = batch['attention_mask'].to('cuda' if torch.cuda.is_available() else 'cpu')
            labels = batch['labels'].to('cuda' if torch.cuda.is_available() else 'cpu')

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            predictions.append(torch.argmax(logits, dim=2).cpu().numpy())
            true_labels.append(labels.cpu().numpy())

    return predictions, true_labels

# Test A
test_a_predictions, test_a_true_labels = evaluate(test_a_dataloader)
# Test B
test_b_predictions, test_b_true_labels = evaluate(test_b_dataloader)


In [None]:
def flatten_labels(predictions, true_labels):
    flat_pred = []
    flat_true = []
    for pred, true in zip(predictions, true_labels):
        for p, t in zip(pred, true):
            if isinstance(t, np.ndarray):
                t = t.tolist()
            if -100 not in t:
                for label in t:
                    if label != -100:
                        flat_true.append(label)
                        flat_pred.append(p)
            else:

                if t != -100:
                    flat_true.append(t)
                    flat_pred.append(p)
    return flat_pred, flat_true


In [None]:
print("Flat Test A True Labels:", flat_test_a_true)
print("Flat Test A Predicted Labels:", flat_test_a_pred)

print("Length of Flat Test A True Labels:", len(flat_test_a_true))
print("Length of Flat Test A Predicted Labels:", len(flat_test_a_pred))


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
       43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
       43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
       43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
       43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
       43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
       43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43,
       43, 43, 43, 43, 43, 43, 43, 43, 43]), array([ 8, 14, 36, 36, 36,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 14,  2,  2,  2,  2,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2,  2,  2,  2,  2,  2,  2, 36, 14, 14, 36, 14, 36, 36, 36,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 36,  2,  2,  2,  2

In [None]:
# Checking original lengths before flattening
print("Length of original test true labels:", len(test_a_true_labels))
print("Length of original test predicted labels:", len(test_a_predictions))


Length of original test true labels: 231
Length of original test predicted labels: 231


**Generate Classification Report**

In [None]:
def flatten_labels(predictions, true_labels):
    flat_pred = []
    flat_true = []

    for pred, true in zip(predictions, true_labels):
        for p, t in zip(pred, true):

            if isinstance(p, (np.ndarray, torch.Tensor)):
                p = p.item() if p.size == 1 else p.tolist()

            if isinstance(t, (np.ndarray, torch.Tensor)):
                t = t.item() if t.size == 1 else t.tolist()

            if isinstance(t, list):
                flat_true.extend(t)
            else:
                flat_true.append(t)

            if isinstance(p, list):
                flat_pred.extend(p)
            else:
                flat_pred.append(p)

    return flat_pred, flat_true

# flattened lists for Test A
flat_test_a_pred, flat_test_a_true = flatten_labels(test_a_predictions, test_a_true_labels)

# Unique values in both lists
print("Unique True Labels:", set(flat_test_a_true))
print("Unique Predicted Labels:", set(flat_test_a_pred))

# lengths
print("Flattened Test A True Labels Length:", len(flat_test_a_true))
print("Flattened Test A Predicted Labels Length:", len(flat_test_a_pred))

# Classification report
if len(flat_test_a_true) == len(flat_test_a_pred):
    print("Test A Classification Report:")
    print(classification_report(flat_test_a_true, flat_test_a_pred))
else:
    print("Warning: The lengths of true and predicted labels do not match.")
    print("True Labels Length:", len(flat_test_a_true))
    print("Predicted Labels Length:", len(flat_test_a_pred))


Unique True Labels: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, -100, 28, 29, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 30}
Unique Predicted Labels: {0, 2, 3, 4, 5, 8, 9, 10, 14, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28, 31, 32, 36, 39, 42, 43, 45}
Flattened Test A True Labels Length: 471552
Flattened Test A Predicted Labels Length: 471552
Test A Classification Report:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

        -100       0.00      0.00      0.00    424886
           0       0.11      0.63      0.19      1699
           1       0.00      0.00      0.00        23
           2       0.02      0.66      0.04      1630
           3       0.12      0.49      0.19      4018
           4       0.06      0.36      0.11      2393
           5       0.08      0.37      0.14       933
           6       0.00      0.00      0.00        14
           7       0.00      0.00      0.00        34
           8       0.00      0.10      0.00       106
           9       0.00      0.00      0.00      4931
          10       0.01      0.06      0.01       866
          11       0.00      0.00      0.00      1637
          12       0.00      0.00      0.00       108
          13       0.00      0.00      0.00        92
          14       0.04      0.64      0.07       599
          15       0.00      0.00      0.00       160
          16       0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Conclusion**

Comparison of NER System Performance to State-of-the-Art
My NER system, fine-tuned on the CONLL-2003 dataset using the BERT architecture, yielded results that highlight substantial challenges in achieving effective classification across different entity categories. The evaluation was conducted on the eng.testa and eng.testb test sets, and the key metrics derived from the classification report include:

Overall Accuracy: 0.03,
Macro Average F1 Score: 0.05,
Weighted Average F1 Score: 0.01,
The detailed results reveal that the system struggles particularly with precision and recall for most classes. For instance, many labels exhibited either zero precision or zero recall, indicating that the model failed to predict any instances of those categories correctly. The warning messages during evaluation also underscore that some classes were not predicted at all, leading to undefined metrics.

In contrast, state-of-the-art NER systems on the CONLL-2003 dataset typically achieve F1 scores exceeding 90%. Advanced models like RoBERTa and others leverage sophisticated training techniques and larger datasets to reach these high performance levels.

The notable performance gap can be attributed to several factors:

Insufficient Training Duration: My model was trained for a limited number of epochs, and extending this could potentially improve results.
Hyperparameter Optimization: There is likely room for improvement in key hyperparameters, such as the learning rate, batch size, and sequence length, which could enhance the model's ability to learn from the data.
Data Preprocessing Techniques: Implementing more robust preprocessing and data augmentation strategies may help the model generalize better and improve performance across entity classes.