# ELL881 : Assignment 3

In this assignment, you will be building a named entity recognition (NER) model using a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model. NER is a subtask of information extraction that involves identifying and classifying named entities in text into predefined categories such as person names, organization names, locations, and more.

Broadly the steps involved will be as follows:

1. **Data Preparation**: You will process the dataset given to you and tokenize it.
2. **Fine-Tuning BERT**: You will fine-tune a pre-trained BERT model for sequence classification using the training set. You will use the Hugging Face Transformers library to load the pre-trained BERT model and customize the final layers for NER. You will also define the loss function, optimizer, and learning rate scheduler.
3. **Model Evaluation**: You will evaluate the performance of the trained model using the test set on the accuracy metric. 

## Dataset description

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
df = pd.read_csv("drive/My Drive/ell881/ner.csv")
df.head()

Unnamed: 0,text,labels
0,Thousands of demonstrators have marched throug...,O O O O O O B-geo O O O O O B-geo O O O O O B-...
1,Iranian officials say they expect to get acces...,B-gpe O O O O O O O O O O O O O O B-tim O O O ...
2,Helicopter gunships Saturday pounded militant ...,O O B-tim O O O O O B-geo O O O O O B-org O O ...
3,They left after a tense hour-long standoff wit...,O O O O O O O O O O O
4,U.N. relief coordinator Jan Egeland said Sunda...,B-geo O O B-per I-per O B-tim O B-geo O B-gpe ...


The labels in this dataset are as follows:
* `geo` for geographical entity
* `org` for organization entity
* `per` for person entity
* `gpe` for geopolitical entity
* `tim` for time indicator entity
* `art` for artifact entity
* `eve` for event entity
* `nat` for natural phenomenon entity
* `O` is assigned if a word doesn’t belong to any entity.

The labels have also been tagged using the BIO scheme.
You can use the following code for getting the list of labels.

In [4]:
labels = [i.split() for i in df['labels'].values.tolist()]
unique_labels = set()
for lb in labels:
    [unique_labels.add(i) for i in lb if i not in unique_labels]

In [5]:
print(unique_labels)

{'B-gpe', 'B-org', 'I-nat', 'B-nat', 'I-gpe', 'B-art', 'O', 'B-per', 'B-eve', 'I-geo', 'I-eve', 'I-art', 'B-tim', 'I-org', 'B-geo', 'I-per', 'I-tim'}


In [6]:
# Map each label into its id representation and vice versa
label_to_ids = {k: v for v, k in enumerate(sorted(unique_labels))}
ids_to_labels = {v: k for v, k in enumerate(sorted(unique_labels))}
print(label_to_ids)

{'B-art': 0, 'B-eve': 1, 'B-geo': 2, 'B-gpe': 3, 'B-nat': 4, 'B-org': 5, 'B-per': 6, 'B-tim': 7, 'I-art': 8, 'I-eve': 9, 'I-geo': 10, 'I-gpe': 11, 'I-nat': 12, 'I-org': 13, 'I-per': 14, 'I-tim': 15, 'O': 16}


In [7]:
label_all_tokens = False

def labels_aligned(texts, labels):
  token_inputs = tokenizer(texts, padding = 'max_length', max_length = 512, truncation = True)
  word_ids = token_inputs.word_ids()
  prev_word_id = None
  label_ids = []

  for word_id in word_ids:
    if word_id is None:
      label_ids.append(-5)
    elif word_id != prev_word_id:
      try:
        label_ids.append(label_to_ids[labels[word_id]])
      except:
        label_ids.append(-100)
    else:
      try:
        label_ids.append(label_to_ids[labels[word_id]] if label_all_tokens else -5)
      except:
        label_ids.append(-5)
    prev_word_idx = word_id
  return label_ids

In [27]:
import torch
class DataSeq(torch.utils.data.Dataset):
  def __init__(self, df):
    lab = [i.split() for i in df['labels'].values.tolist()]
    txt = df['text'].values.tolist()
    self.texts = [tokenizer(str(i), padding='max_length', max_length = 512, truncation=True, return_tensors="pt") for i in txt]
    self.labels = [labels_aligned(i,j) for (i,j) in zip(txt, lab)]
  def __len__(self):
    return len(self.labels)
  def get_batch_data(self, id):
    return self.texts[id]
  def get_batch_labels(self, id):
    return torch.LongTensor(self.labels[id])
  def __getitem__(self, id):
    batch_data = self.get_batch_data(id)
    batch_labels = self.get_batch_labels(id)

    return batch_data, batch_labels 

In [23]:
# We split the data into train, validation and test sets (80-10-10 split)
import numpy as np
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=42),
                            [int(.8 * len(df)), int(.9 * len(df))])

## Model Building

In this assignment, you will be using a pretrained BERT model from HuggingFace (supplied in the `transformers` library).
This is a classification task hence the model that you should make use of `BertForTokenClassification` model.

You can train the model using GPU. You should ideally get the script ready on your system by taking a small subset of data and then train it completely using an online service such as Google Colab or Kaggle.

Further, the model expects the inputs to supplied in a particular format which you should be able to read online in the documentations and other resources like medium articles (read up on using BERT for NLP tasks in Pytorch and you will find a lot of resources online). Additionally, for performing tokenization you should be using the tokenizer supplied in the transformers library. The imports for the same have been done in the code snippet below:

In [10]:
%%capture
!pip install transformers

In [11]:
from transformers import BertTokenizerFast
from transformers import BertForTokenClassification

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
# Note how we are using the cased version of tokenizer here since the labels leverage the case information


class BertModel(torch.nn.Module):

    def __init__(self):
        super(BertModel, self).__init__()
        self.bert = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=len(unique_labels))

    def forward(self, input_id, mask, label):
        output = self.bert(input_ids=input_id, attention_mask=mask, labels=label, return_dict=False)
        return output


Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

###Model Training

In [28]:
from torch.utils.data import DataLoader
from tqdm import tqdm
from torch.optim import SGD

lr = 5e-3
epochs = 5
batch_size = 2

train_data = DataSeq(df_train)
val_data = DataSeq(df_val)
dataloader_train = DataLoader(train_data, num_workers=4, batch_size=batch_size, shuffle=True)
dataloader_val = DataLoader(val_data, num_workers=4, batch_size=batch_size)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")



In [33]:
def train(model, df_train, df_val):

  optimizer = SGD(model.parameters(), lr=lr)
  if use_cuda:
    model = model.cuda()

  best_accuracy = 0
  best_loss = 1000

  for epoch in range(epochs):
    train_acc = 0
    train_loss = 0
    model.train()

    for data_train, train_label in tqdm(dataloader_train):
      train_label = train_label.to(device)
      mask = data_train['attention_mask'].squeeze(1).to(device)
      input_id = data_train['input_ids'].squeeze(1).to(device)
      optimizer.zero_grad()
      loss, logits = model(input_id, mask, train_label)
      for i in range(logits.shape[0]):
        print('jdbefjhbe')
        logits_clean = logits[i][train_label[i] != -5]
        label_clean = train_label[i][train_label[i] != -5]
        pred = logits_clean.argmax(dim=1)
        acc = (pred == label_clean).float().mean()
        train_acc += acc
        train_loss += loss.item()
      loss.backward()
      optimizer.step()

    model.eval()

    val_acc = 0
    val_loss = 0

    for data_val, val_label in tqdm(dataloader_val):
      val_label = val_label.to(device)
      mask = data_val['attention_mask'].squeeze(1).to(device)
      input_id = data_val['input_ids'].squeeze(1).to(device)
      optimizer.zero_grad()
      loss, logits = model(input_id, mask, val_label)
      for i in range(logits.shape[0]):
        logits_clean = logits[i][val_label[i] != -5]
        label_clean = val_label[i][val_label[i] != -5]
        pred = logits_clean.argmax(dim=1)
        acc = (pred == label_clean).float().mean()
        val_acc += acc
        val_loss += loss.item()
    
    val_acc = val_acc / len(df_val)
    val_loss = val_loss / len(df_val)

    print(f'Epochs: {epoch + 1} | Loss : {train_loss/len(df_train) : .3f} | Accuracy : {train_acc/ len(df_train):  .3f}')

In [34]:
model = BertModel()
train(model, df_train, df_val)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

RuntimeError: ignored

## Evaluation
For evaluating the performance of the model, you should make use of the `accuracy` metric. You should report the performance after removing the pads. Further, it would be better if you report the accuracy both with and without the "O" label.

In [None]:
def evaluate(model, df_test):
  test_data = DataSeq(df_test)

  dataloader_test = DataLoader(test_data, num_workers=4, batch_size=1)
  use_cuda = torch.cuda.is_available()
  device = torch.device("cuda" if use_cuda else "cpu")
  optimizer = SGD(model.parameters(), lr=lr)
  if use_cuda:
    model = model.cuda()
  
  test_acc = 0

  for data_test, test_label in tqdm(dataloader_test):
    test_label = test_label.to(device)
    mask = data_test['attention_mask']
      mask = torch.stack(mask)
      mask = mask.to(device)
      input_id = data_test['input_ids']
      input_id = torch.stack(input_id)
      input_id = input_id.to(device)
    optimizer.zero_grad()
    loss, logits = model(input_id, mask, test_label)
    for i in range(logits.shape[0]):
      logits_clean = logits[i][test_label[i] != -5]
      label_clean = test_label[i][test_label[i] != -5]
      pred = logits_clean.argmax(dim=1)
      acc = (pred == label_clean).float().mean()
      test_acc += acc
  
  test_acc = test_acc / len(df_test)

  print(f'Test Accuracy : {test_acc}')

evaluate(model, df_test)

Your final submission should include a report that describes your methodology, experimental results, analysis, and discussion, as well as the code used to train and test the model.

Good luck, and have fun exploring BERT!