### 1. Load Datasets
Loading test, train and validation sets into pandas dataframe

In [5]:
import pandas as pd

df_train = pd.read_json('data/raw/train.json', lines=True)
df_test = pd.read_json('data/raw/test.json', lines=True)
df_validation = pd.read_json('data/raw/validation.json', lines=True)

### 2. Exploratory Data Analysis (Sayeed, Jui)
#### Analyzing training dataset

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1112 entries, 0 to 1111
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   quality_checked  1112 non-null   object
 1   text             1112 non-null   object
 2   task             1112 non-null   object
 3   meta             1112 non-null   object
 4   doc_id           1112 non-null   object
 5   dataset_type     1112 non-null   object
 6   annotator_id     1112 non-null   object
 7   entity_mentions  1112 non-null   object
dtypes: object(8)
memory usage: 69.6+ KB


Observed no null values

In [7]:
df_train.head()

Unnamed: 0,quality_checked,text,task,meta,doc_id,dataset_type,annotator_id,entity_mentions
0,[],PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194,train,annotator1,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
1,[],PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194,train,annotator2,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
2,[],PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194,train,annotator8,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
3,[],PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194,train,annotator11,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
4,[],PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'D. Stępniak', 'articles': [91, ...",001-84741,train,annotator1,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."


Observation: We do not need dataset_type because the test, train and validation files are already separate. We also do not need the columns quality_checked, annotator_id.

Dropping unrequired columns.

In [8]:
df_train.drop(columns=['quality_checked', 'dataset_type', 'annotator_id'])

Unnamed: 0,text,task,meta,doc_id,entity_mentions
0,PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
1,PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
2,PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
3,PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
4,PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'D. Stępniak', 'articles': [91, ...",001-84741,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
...,...,...,...,...,...
1107,PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Helmut Ludescher', 'articles': ...",001-60002,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
1108,PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'J. Peter', 'articles': [91, 34,...",001-146353,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
1109,PROCEDURE\n\nThe case was referred to the Cour...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Christopher Ian Scott', 'articl...",001-58010,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."
1110,PROCEDURE\n\nThe case originated in an applica...,Task: Annotate the document to anonymise the f...,"{'applicant': 'Henryk Kreuz', 'articles': [91,...",001-61921,"[{'confidential_status': 'NOT_CONFIDENTIAL', '..."


Analyzing task column

In [9]:
df_train['task'].unique()

array(['Task: Annotate the document to anonymise the following person: Henrik Hasslund',
       'Task: Annotate the document to anonymise the following person: D. Stępniak',
       'Task: Annotate the document to anonymise the following person: Nusret Amutgan',
       ...,
       'Task: Annotate the document to anonymise the following person: J. Peter',
       'Task: Annotate the document to anonymise the following person: Christopher Ian Scott',
       'Task: Annotate the document to anonymise the following person: Yiannis Kyriakou'],
      dtype=object)

Observation: we don't need the task column

Finding out how many unique values are there in text column and doc_id column. Making sure they match.

In [10]:
len(df_train['text'].unique())

1014

In [11]:
len(df_train['doc_id'].unique())

1014

Observation: We have 1014 unique values for documents but the dataset has 1112 entries. So there might be duplicates.

In [12]:
df_train['meta'][0]

{'applicant': 'Henrik Hasslund',
 'articles': [91, 34, 54, 34, 93],
 'countries': 'DNK',
 'legal_branch': 'CHAMBER',
 'year': 2008}

Observation: We may be able to reserve this column for later evaluation. Might be helpful to find out if our model struggles with region specific names, or has a bias, etc.

In [13]:
df_train_meta = df_train[['text', 'meta', 'doc_id']].copy()
df_train_meta.head()

Unnamed: 0,text,meta,doc_id
0,PROCEDURE\n\nThe case originated in an applica...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194
1,PROCEDURE\n\nThe case originated in an applica...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194
2,PROCEDURE\n\nThe case originated in an applica...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194
3,PROCEDURE\n\nThe case originated in an applica...,"{'applicant': 'Henrik Hasslund', 'articles': [...",001-90194
4,PROCEDURE\n\nThe case originated in an applica...,"{'applicant': 'D. Stępniak', 'articles': [91, ...",001-84741


In [14]:
df_train_meta.to_csv('data/processed/train_meta.csv')

OSError: Cannot save file into a non-existent directory: 'data/processed'

In [None]:
df_test_meta = df_test[['text', 'meta', 'doc_id']].copy()
df_test_meta.to_csv('data/processed/test_meta.csv')

df_validation_meta = df_validation[['text', 'meta', 'doc_id']].copy()
df_validation_meta.to_csv('data/processed/validation_meta.csv')

Exploring entity mentions column

In [None]:
df_train['entity_mentions'][1]

In [None]:
import json
df_train_exploded = df_train.explode('entity_mentions')
entities_flat = pd.json_normalize(df_train_exploded['entity_mentions'])
df_train_entities = pd.concat([df_train_exploded[['doc_id']].reset_index(drop=True), entities_flat.reset_index(drop=True)], axis=1)

df_train_entities.head()

Observation: We have start_offset, end_offset and entity_type. We need to extract this data to create a token and tags for finetuning DistilliBERT model.

Checking the counts of entities for filtered set where identifier_type is not NO_MASK

In [None]:
print("Entity Type Counts:")
entity_type_stats = df_train_entities[df_train_entities['identifier_type'] != 'NO_MASK']['entity_type'].value_counts()
entity_type_stats.plot(kind='bar', title='Distribution of Entity Types')

Certain imbalance of data is seen here where datetime entities are much higher in count than quantity or code

In [None]:
# Check masking requirements
mask_stats = df_train_entities['identifier_type'].value_counts()
mask_stats.plot(kind='bar', title='Distribution of Masking Needs')

In [None]:
entities_per_doc = df_train_entities.groupby('doc_id').size().sort_values(ascending=False)
entities_per_doc.plot(kind='hist', bins=30, title='Entities per Document Distribution', xlabel='Number of Entities', ylabel='Number of Documents')

In [None]:
print(f"Average entities per document: {entities_per_doc.mean():.2f}")

### 3. Data pre-processing (Jui)
Converting offests to list

In [None]:
def convert_offsets_to_lists(row):
    text = row['text']
    entities = row['entity_mentions']

    # create character-level map
    char_tags = ["O"] * len(text)

    for ent in entities:
        # Filter 'NO_MASK' entities
        if ent.get('identifier_type') == 'NO_MASK':
            continue

        start, end = ent['start_offset'], ent['end_offset']
        label = ent['entity_type']

        # fill character-level map
        if start < len(text) and end <= len(text):
            char_tags[start] = f"B-{label}" # beginning of entity
            for i in range(start+1, end):
                char_tags[i] = f"I-{label}" # inside entity

    # convert character map to word - tag
    tokens = text.split()
    ner_tags = []

    cursor = 0
    for token in tokens:
        # advance cursor to the start of word (skipping spaces)
        while cursor < len(text) and text[cursor].isspace():
            cursor += 1

        # tag of the word is the tag of its first character
        if cursor < len(text):
            ner_tags.append(char_tags[cursor])
            cursor += len(token)
        else:
            ner_tags.append("O")

    return {"tokens": tokens, "ner_tags": ner_tags}

In [None]:
from datasets import Dataset

# converting Pandas to Hugging Face Dataset
hf_train = Dataset.from_pandas(df_train)
train_processed = hf_train.map(convert_offsets_to_lists)

# Quick Check:
print(train_processed[0]['tokens'])
print(train_processed[0]['ner_tags'])

Create label mappings from train set to be used to pass when training the model, same mappings would be used during tokenization of test and validation sets as well

In [None]:
# extracting unique tags from training data
unique_tags = set(tag for row in train_processed for tag in row['ner_tags'])
label_list = sorted(list(unique_tags)) # e.g., ['B-LOC', 'B-PER', 'I-PER', 'O']

# createing maps
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}

print(f"Number of labels: {len(label_list)}")
print(label2id)

Using AutoTokenizer to handle sub-words and align new tags. Defining tokenization function to be used various versions of BERT

In [None]:
def tokenize_and_align(examples, tokenizer):
  '''Takes the tokenizer object as input and operated on a row in huggingface dataset object'''
  # split words into sub-words
  tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

  labels = []
  for i, label in enumerate(examples["ner_tags"]):
      word_ids = tokenized_inputs.word_ids(batch_index=i) #supported only for fast tokenizers
      previous_word_idx = None
      label_ids = []
      for word_idx in word_ids:
          if word_idx is None:
              # special tokens like [CLS] get -100 (ignored)
              label_ids.append(-100)
          elif word_idx != previous_word_idx:
              # first piece of a word gets the real label ID
              label_ids.append(label2id[label[word_idx]]) #using map created from training set
          else:
              # subsequent pieces (e.g., "##lor") get -100 (ignored)
              label_ids.append(-100)
          previous_word_idx = word_idx
      labels.append(label_ids)

  tokenized_inputs["labels"] = labels
  return tokenized_inputs

Writing a script for processing test and validation sets

In [None]:
def preprocess_data(df, tokenizer):
  '''Takes a pandas dataframe and a tokenizer object and returns a tokenized huggingface dataset object ready to be passed into BERT for training
  '''
  # converting Pandas to Hugging Face Dataset
  hf = Dataset.from_pandas(df)
  processed = hf.map(convert_offsets_to_lists)
  #tokenize
  tokenized = processed.map(tokenize_and_align, batched=True, fn_kwargs={"tokenizer": tokenizer})

  return tokenized

Importing AutoTokenizer to use various BERT tokenizers

In [None]:
from transformers import AutoTokenizer

## 4. Bi-LSTM (Jui)

Using DistilBERT tokenizer to keep the comparison fair

In [None]:
tokenizer_distilbert = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
train_distilbert = preprocess_data(df_train ,tokenizer_distilbert)

In [None]:
test_distilbert = preprocess_data(df_test, tokenizer_distilbert)
validation_distilbert = preprocess_data(df_validation, tokenizer_distilbert)

Importing pyTorch

In [None]:
import torch
import torch.nn as nn

In [None]:
#setup GPU availability
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

Defining the Bi-LSTM model for NER

In [None]:
class bilstm_nre(nn.Module):
  def __init__(self, vocab_size, num_labels, embed_dim=128, hidden_dim=256, weight_tensor=None, dropout_rate=None):
      super().__init__()

      #setting up a custom embedding layer
      #seting padding index to 0 the embedding model from learning vectors for padding
      self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

      #bi-lstm layer
      self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)

      #dropout
      if dropout_rate is not None:
        self.dropout = nn.Dropout(dropout_rate)

      #output layer (2*hidden_dim for bi-lstm)
      self.classifier = nn.Linear(hidden_dim * 2, num_labels)

      #set loss function (ignore_index=-100 because we set masking to -100 in previous function)
      if weight_tensor is not None:
        self.loss_fct = nn.CrossEntropyLoss(ignore_index=-100, weight=weight_tensor)
      else:
        self.loss_fct = nn.CrossEntropyLoss(ignore_index=-100)

  #forward pass
  def forward(self, input_ids, labels=None):
      #embed
      embeds = self.embedding(input_ids)

      #lstm forward
      lstm_out, _ = self.lstm(embeds)

      logits = self.classifier(lstm_out)

      #loss calculation
      loss = None
      if labels is not None:
          # Flatten the tensors so we can check every token at once
          # logits shape: (batch * seq_len, num_labels)
          # labels shape: (batch * seq_len)
          loss = self.loss_fct(logits.view(-1, logits.shape[-1]), labels.view(-1))

      return loss, logits

batch pre-processing function to prepare dataloaders

In [None]:
def collate_fn(batch):
    # convert batch to tensors
    input_ids = [torch.tensor(item['input_ids']) for item in batch]
    labels = [torch.tensor(item['labels']) for item in batch]

    # padding inputs with 0(blank space), labels with -100
    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=-100)

    return input_ids.to(device), labels.to(device)

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_distilbert, batch_size=16, shuffle=True, collate_fn=collate_fn)

In [None]:
test_loader = DataLoader(test_distilbert, batch_size=16, shuffle=True, collate_fn=collate_fn)
validation_loader = DataLoader(validation_distilbert, batch_size=16, shuffle=True, collate_fn=collate_fn)

Tracking model performance at every epoch

In [None]:
#using huggingface evaluate library for NER evaluation
!pip install evaluate seqeval

In [None]:
import matplotlib.pyplot as plt
import evaluate
import numpy as np

Function to evaluate after every epoch

In [None]:
seqeval = evaluate.load("seqeval")

def evaluate_epoch(model, dataloader, label_list):
    model.eval() # Set to evaluation mode

    all_preds = []
    all_labels = []
    total_val_loss = 0

    with torch.no_grad():
        for batch_ids, batch_labels in dataloader:
            batch_ids = batch_ids.to(device)
            batch_labels = batch_labels.to(device)

            # Forward pass
            loss, logits = model(batch_ids, batch_labels)

            # test loss
            total_val_loss += loss.item()

            # Get predictions (argmax)
            preds = torch.argmax(logits, dim=-1).cpu().numpy()
            labels = batch_labels.cpu().numpy()

            all_preds.extend(preds)
            all_labels.extend(labels)

    #calculate validation loss
    avg_val_loss = total_val_loss / len(dataloader)

    # convert IDs back to Tags (removing -100)
    decoded_preds = [
        [label_list[p] for (p, l) in zip(pred, label) if l != -100]
        for pred, label in zip(all_preds, all_labels)
    ]
    decoded_labels = [
        [label_list[l] for (p, l) in zip(pred, label) if l != -100]
        for pred, label in zip(all_preds, all_labels)
    ]

    # compute metrics using seqeval (Strict Entity-Level scoring)
    results = seqeval.compute(predictions=decoded_preds, references=decoded_labels)

    return {
        "val_loss": avg_val_loss,
        "accuracy": results["overall_accuracy"],
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"]
    }

Function for training loop with evaluation metric after each epoch

In [None]:
def train_eval_lstm(model, optimizer, n_epoches = 5):
  print("Starting Bi-LSTM for NER training...")

  history = {
      "train_loss": [],
      "val_loss": [],
      "accuracy": [],
      "precision": [],
      "recall": [],
      "f1": []
  }

  for epoch in range(n_epoches):
      # train
      model.train()
      total_loss = 0

      for batch_ids, batch_labels in train_loader:
          optimizer.zero_grad()
          loss, logits = model(batch_ids, batch_labels)
          loss.backward()
          optimizer.step()
          total_loss += loss.item()

      avg_train_loss = total_loss / len(train_loader)

      # validation on test set to calculate metrics
      metrics = evaluate_epoch(model, test_loader, label_list)

      # tracking history of metrics
      history["train_loss"].append(avg_train_loss)
      history["val_loss"].append(metrics["val_loss"])
      history["accuracy"].append(metrics["accuracy"])
      history["precision"].append(metrics["precision"])
      history["recall"].append(metrics["recall"])
      history["f1"].append(metrics["f1"])

      print(f"Epoch {epoch+1}/{n_epoches} | "
            f"Train Loss: {avg_train_loss:.4f} | "
            f"Test Loss: {metrics['val_loss']:.4f} | "
            f"Test Recall: {metrics['recall']:.4f} | "
            f"Test Precision: {metrics['precision']:.4f} |"
            f"Test F1: {metrics['f1']:.4f} |"
            f"Test Accuracy: {metrics['accuracy']:.4f}")

  print("Training complete!")
  return history

function for plotting the evaluation metrics vs epoch

In [None]:
def plot_training_metrics(history):
    epochs_range = range(1, len(history['train_loss']) + 1)

    plt.figure(figsize=(18, 6))

    # training vs testing loss
    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, history['train_loss'], 'r-o', label='Training Loss')
    plt.plot(epochs_range, history['val_loss'], 'b-o', label='Testing Loss')
    plt.title('Training Loss vs Testing Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.grid(True)

    # R vs P vs A vs F1
    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, history['recall'], 'b-o', label='Val Recall')
    plt.plot(epochs_range, history['f1'], 'g-o', label='Val F1 Score')
    plt.plot(epochs_range, history['accuracy'], 'k-x', label='Accuracy', linestyle='--')
    plt.plot(epochs_range, history['precision'], 'g-o', label='Precision')
    plt.title('Test R vs P vs A vs F1')
    plt.xlabel('Epochs')
    plt.ylabel('Score')
    plt.ylim(0, 1.0) # y-axis to 0-100%
    plt.legend(loc='lower right')
    plt.grid(True)

    plt.tight_layout()
    plt.show()

Training & Evaluation LSTM model v1

In [None]:
# vocab_size is 30522 for DistilBERT
# num_labels is len(label_list)
lstm_nre_v1 = bilstm_nre(30522, len(label_list), embed_dim=128, hidden_dim=256).to(device)
optimizer_v1 = torch.optim.Adam(lstm_nre_v1.parameters(), lr=1e-4)

In [None]:
history_v1 = train_eval_lstm(lstm_nre_v1, optimizer_v1, n_epoches = 10)

In [None]:
# plot evaluation metrics
plot_training_metrics(history_v1)

Observation: Model is doing fine till epoch 9. After epoch 10, a slight overfitting is seen. Also, a high accuracy of 94.45% with a low recall and precision of 51.53% shows that the model has learnt that predicting non-entity (label 'O') most of the times is a safe bet.

Implementing weighted loss strategy to tell the model that missing an entity is worst than getting a non-entity wrong.

To calculate weights, if the tag occurs more number of times like the non-entity tag 'O', we need to give it lower weight. Using sklearn class_weight utility to compute this.

In [None]:
from sklearn.utils.class_weight import compute_class_weight
#list of all tags in training set
all_classes = [label
              for row in train_distilbert['labels']
                for label in row
                  if label != -100
]
unique_classes = np.unique(all_classes)

#balanced mode adjusts the weights inversly proportional to the frequencies of the classes
weights = compute_class_weight(class_weight='balanced', classes = unique_classes, y = all_classes)

#convert class weights to pytorch tensor
class_weights = torch.tensor(weights, dtype=torch.float).to(device)

print("Calculated Class Weights:")
for i, weight in enumerate(class_weights):
    # getting label names from id2label
    label_name = id2label[i] if 'id2label' in locals() else str(i)
    print(f"{label_name}: {weight:.4f}")

In [None]:
#trying similar architecture with weighted loss approach
lstm_nre_v2 = bilstm_nre(30522, len(label_list), embed_dim=128, hidden_dim=256, weight_tensor=class_weights).to(device)
optimizer_v2 = torch.optim.Adam(lstm_nre_v2.parameters(), lr=1e-4)

In [None]:
history_v2 = train_eval_lstm(lstm_nre_v2, optimizer_v2, n_epoches = 10)

In [None]:
# plot evaluation metrics
plot_training_metrics(history_v2)

Observed that the model starts to overfit at epoch 4. The recall has improved to 52.61% the precision has drastically dropped to 9.46%, showing that only 9.46% of the redacted words are actually sensetive! The accuracy has also dropped to 57.34%.

What effect might it have if capacity of the network is increased?

In [None]:
#increasing dimension of hidden layer without using weighted loss
lstm_nre_v3 = bilstm_nre(30522, len(label_list), embed_dim=128, hidden_dim=288).to(device)
optimizer_v3 = torch.optim.Adam(lstm_nre_v3.parameters(), lr=1e-4)

In [None]:
history_v3 = train_eval_lstm(lstm_nre_v3, optimizer_v3, n_epoches = 10)

In [None]:
# plot evaluation metrics
plot_training_metrics(history_v3)

Observation: Accuracy remained very high at about 94.52% but a slight boost in recall at 53.22% was observed.

Trying to increase capacity further.

In [None]:
#increasing dimension of hidden layer without using weighted loss
lstm_nre_v4 = bilstm_nre(30522, len(label_list), embed_dim=128, hidden_dim=416).to(device)
optimizer_v4 = torch.optim.Adam(lstm_nre_v4.parameters(), lr=1e-4)

In [None]:
history_v4 = train_eval_lstm(lstm_nre_v4, optimizer_v4, n_epoches = 10)

In [None]:
# plot evaluation metrics
plot_training_metrics(history_v4)

Observation: Increasing capacity significantly is enabling the model to learn well. The model is starting to show a slight overfit at epoch 7. While accuracy is still high at 94.93%, the test R has increased to 60.56% and test F1 to 66.99%

Exploring if extending the training process increases model performance further.

In [None]:
history_v4_1 = train_eval_lstm(lstm_nre_v4, optimizer_v4, n_epoches = 20)

In [None]:
# plot evaluation metrics
plot_training_metrics(history_v4_1)

Observation: Continuing training for 20 more epoches lead to higher recall of 70.22% and F1 of 74.73% at epoch 17 with accuracy of 94.85%. However, the model is an overfit.

Trying an approach where increasing the capacity further but training for fewer epoches.

In [None]:
#increasing dimension of hidden layer more without using weighted loss
lstm_nre_v5 = bilstm_nre(30522, len(label_list), embed_dim=128, hidden_dim=480).to(device)
optimizer_v5 = torch.optim.Adam(lstm_nre_v5.parameters(), lr=1e-4)

In [None]:
history_v5 = train_eval_lstm(lstm_nre_v5, optimizer_v5, n_epoches = 30)

In [None]:
# plot evaluation metrics
plot_training_metrics(history_v5)

Observation: After epoch 10 the model is learning slowly but could reach 71.24% recall and 73.69% F1 score at epoch 27 however, the recall dropped after that. Some amount of overfitting is seen.

Adding dropout layers to reduce overfitting.

In [None]:
#adding dropout layer to reduce overfitting
lstm_nre_v6 = bilstm_nre(30522, len(label_list), embed_dim=128, hidden_dim=480, dropout_rate=0.2).to(device)
optimizer_v6 = torch.optim.Adam(lstm_nre_v6.parameters(), lr=1e-4)

In [None]:
history_v6 = train_eval_lstm(lstm_nre_v6, optimizer_v6, n_epoches = 27)

In [None]:
# plot evaluation metrics
plot_training_metrics(history_v6)

Observation: Overfitting is slightly reduced at epoch 25 with R: 70.11%, P: 76.53%, F1: 73.18% and A: 95.59%

### 5. LegalBERT Finetuning (Liza)

This script sets up the preprocessing and evaluation pipeline for fine-tuning the Legal-BERT model on the NER task. It implements a custom tokenize_and_align_labels function to correctly map original character-level entity annotations to BERT's sub-word tokens using the BIO tagging scheme. Additionally, the evaluate_model_performance function standardizes the testing process by computing key metrics—such as F1-score, Precision, and Recall, ignoring special tokens to ensure accurate performance assessment across different model versions.

In [16]:

# Installing libs just in case
!pip install seqeval evaluate transformers datasets

import numpy as np
import pandas as pd
import evaluate
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification

# Using Legal-BERT since it performs better on domain-specific texts than vanilla BERT
MODEL_CHECKPOINT = "nlpaueb/legal-bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
metric = evaluate.load("seqeval")

# Define NER labels
label_list = ["O", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-ORG", "I-ORG"]
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}

# Mapping dataset entity types to our simplified schema
type_mapper = {"PERSON": "PER", "ORGANIZATION": "ORG", "LOCATION": "LOC", "ORG": "ORG", "LOC": "LOC", "PER": "PER"}

def tokenize_and_align_labels(examples):
    """
    Standard BERT preprocessing.
    Main challenge: Aligning character-level offsets from the JSON
    to the sub-word tokens generated by the tokenizer.
    """
    tokenized_inputs = tokenizer(
        examples["text"], truncation=True, max_length=512,
        return_offsets_mapping=True, padding="max_length"
    )
    labels = []

    for i, doc_offsets in enumerate(tokenized_inputs["offset_mapping"]):
        # Handle potential None values in source data
        doc_mentions = examples["entity_mentions"][i] if examples["entity_mentions"][i] is not None else []

        doc_labels = [0] * len(doc_offsets)
        for idx, (start, end) in enumerate(doc_offsets):
            if start == end: # Skip special tokens like [CLS]
                doc_labels[idx] = -100
                continue

            # Check if token falls inside a known entity span
            for mention in doc_mentions:
                if start >= mention['start_offset'] and end <= mention['end_offset']:
                    raw_type = mention['entity_type']
                    short_type = type_mapper.get(raw_type, "ORG") # Defaulting to ORG if type is unclear

                    # BIO Tagging logic
                    prefix = "B-" if start == mention['start_offset'] else "I-"
                    label_name = f"{prefix}{short_type}"
                    doc_labels[idx] = label2id.get(label_name, 0)
                    break
        labels.append(doc_labels)

    tokenized_inputs["labels"] = labels
    tokenized_inputs.pop("offset_mapping") # Not needed for training
    return tokenized_inputs

# --- INTEGRATION PART ---
# Converting the Pandas DFs loaded in Section 1 to Hugging Face Datasets
print("Converting Pandas DataFrames to HF Datasets...")
raw_datasets = DatasetDict({
    "train": Dataset.from_pandas(df_train),
    "validation": Dataset.from_pandas(df_validation),
    "test": Dataset.from_pandas(df_test)
})

# Applying tokenization to all splits
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names
)
print("Preprocessing done. Ready for training.")

Converting Pandas DataFrames to HF Datasets...


Map:   0%|          | 0/1112 [00:00<?, ? examples/s]

Map:   0%|          | 0/541 [00:00<?, ? examples/s]

Map:   0%|          | 0/555 [00:00<?, ? examples/s]

Preprocessing done. Ready for training.


In [18]:

model = AutoModelForTokenClassification.from_pretrained(
    MODEL_CHECKPOINT,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

# Hyperparameters
args = TrainingArguments(
    "legal-bert-ner",
    eval_strategy="epoch",
    learning_rate=2e-5,          # Typical LR for BERT fine-tuning
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,          # 3 epochs is usually enough for NER
    weight_decay=0.01,
    save_strategy="no",          # Not saving checkpoints to save disk space
    logging_steps=50
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=DataCollatorForTokenClassification(tokenizer)
)

print("Starting training loop...")
trainer.train()

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

BertForTokenClassification LOAD REPORT from: nlpaueb/legal-bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.bias                       | UNEXPECTED | 
bert.pooler.dense.bias                     | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.decoder.bias               | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
bert.pooler.dense.weight                   | UNEXPECTED | 
cls.predictions.decoder.weight             | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored w

Starting training loop...


Epoch,Training Loss,Validation Loss
1,0.510124,0.131106
2,0.131997,0.104129
3,0.090614,0.102124


TrainOutput(global_step=210, training_loss=0.20304985897881644, metrics={'train_runtime': 363.2006, 'train_samples_per_second': 9.185, 'train_steps_per_second': 0.578, 'total_flos': 871725384769536.0, 'train_loss': 0.20304985897881644, 'epoch': 3.0})

In [19]:
def evaluate_model_performance(trainer, eval_dataset, model_name="MyModel"):
    print(f"--- Evaluating {model_name} ---")
    predictions, labels, _ = trainer.predict(eval_dataset)
    predictions = np.argmax(predictions, axis=2)

    # Filter out special tokens (-100) before calculating metrics
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)

    return {
        "Model": model_name,
        "Accuracy": results['overall_accuracy'],
        "F1 Score": results['overall_f1'],
        "Precision": results['overall_precision'],
        "Recall": results['overall_recall'],
        "PER_F1": results['PER']['f1'] if 'PER' in results else 0,
        "ORG_F1": results['ORG']['f1'] if 'ORG' in results else 0,
        "LOC_F1": results['LOC']['f1'] if 'LOC' in results else 0
    }

# Running evaluation on the held-out test set
results_list = []
metrics_legal_bert = evaluate_model_performance(trainer, tokenized_datasets["test"], model_name="Legal-BERT")
results_list.append(metrics_legal_bert)

# Displaying the leaderboard
leaderboard_df = pd.DataFrame(results_list)
print("\n--- FINAL LEADERBOARD ---")
display(leaderboard_df.round(4))

--- Evaluating Legal-BERT ---



--- FINAL LEADERBOARD ---


Unnamed: 0,Model,Accuracy,F1 Score,Precision,Recall,PER_F1,ORG_F1,LOC_F1
0,Legal-BERT,0.9652,0.8119,0.7926,0.8321,0.9486,0.7828,0.8033


In [22]:
import torch
from transformers import pipeline

# Setting up a quick pipeline for testing (using the model currently in memory)
nlp = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple", # Automatically merges B- and I- tags
    device=0 if torch.cuda.is_available() else -1
)

def redact_text(text):
    entities = nlp(text)
    # Sorting entities in reverse order is crucial!
    # Otherwise, replacing text at the beginning shifts indices for subsequent entities.
    entities = sorted(entities, key=lambda x: x['start'], reverse=True)

    redacted = text
    for entity in entities:
        # Masking only our target sensitive groups
        if entity['entity_group'] in ["PER", "ORG", "LOC"]:
            start, end = entity['start'], entity['end']
            redacted = redacted[:start] + "[REDACTED]" + redacted[end:]
    return redacted

# Sanity check
sample_text = "Mr. Alexander Petrov signed the contract with Apple Inc. in Toronto."
print(f"Original: {sample_text}")
print(f"Redacted: {redact_text(sample_text)}")

Original: Mr. Alexander Petrov signed the contract with Apple Inc. in Toronto.
Redacted: [REDACTED] signed the contract with [REDACTED] in [REDACTED].


### 7. DistilBERT Finetuning (Mit)

Using tokenized datasets: train_distilbert, test_distilbert, validation_distilbert

### 8. Pre-trained DistilBERT - Benchmarking (Sayed)

Using tokenized datasets: train_distilbert, test_distilbert, validation_distilbert

### 9. RoBERTa, BERT-NER, ALBERT
These are some more models that can be used. Feel free to pick any and start working