# Fine-tuning DistilBERT for Named Entity Recognition (NER)

This notebook demonstrates how to fine-tune a DistilBERT model for named entity recognition using the CoNLL-2003 dataset. The process includes data preparation, model configuration, training, and evaluation.

## Setup

Install required libraries and import dependencies for the project.

In [None]:
!pip install transformers
!pip install datasets
!pip install accelerate -U
!pip install evaluate

import re
import torch
import pandas as pd

import numpy as np
import datasets
import transformers
import evaluate
import matplotlib.pyplot as plt

## Data Preparation

Define the label set for CoNLL-2003 named entities. The dataset contains four entity types:
- PER: Person names
- ORG: Organizations
- LOC: Locations
- MISC: Miscellaneous entities

Each entity type has a Beginning (B-) and Inside (I-) tag, plus 'O' for non-entity tokens.

In [3]:
label2id = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
id2label = {value: key for key, value in label2id.items()}

num_labels = len(label2id)

## Model Definition

Initialize the DistilBERT model for token classification and freeze the base model layers. We only want to fine-tune the classification head for this task, keeping the language model's core knowledge intact.

In [9]:
from transformers import DistilBertTokenizer, DistilBertForTokenClassification
from transformers import DistilBertTokenizerFast
model = DistilBertForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=num_labels)
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")


for param in model.distilbert.parameters():
    param.requires_grad = False

print("Model and tokenizer loaded, and base model parameters frozen.")

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model and tokenizer loaded, and base model parameters frozen.


## Dataset Loading

Load the CoNLL-2003 NER dataset, which contains annotated sentences with named entity tags.

In [6]:
from datasets import load_dataset
dataset = load_dataset("eriktks/conll2003")

print(dataset)

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for eriktks/conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/eriktks/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


## Dataset Processing and DataLoaders

Create a custom dataset class to handle token-level NER tagging. This involves:
1. Tokenizing texts with the DistilBERT tokenizer
2. Aligning labels with tokenized words (handling special tokens and subwords)
3. Setting up data loaders for batch processing

In [None]:
from torch.utils.data import Dataset, DataLoader

class NERDataset(Dataset):
    def __init__(self, data, tokenizer=tokenizer, max_length=128):
        """
        Initialize the dataset with data and tokenizer.
        Args:
            data (Dataset): Dataset split (train/validation/test) from Hugging Face `datasets`.
            tokenizer (Tokenizer): Tokenizer to process the text data.
            max_length (int): Maximum sequence length for tokenization.
        """
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        """
        Return the number of samples in the dataset.
        """
        return len(self.data)

    def __getitem__(self, idx):
        """
        Process and return a single data sample.
        Args:
            idx (int): Index of the data sample.
        Returns:
            Dict: Tokenized inputs and aligned labels.
        """
        text = self.data[idx]["tokens"]
        labels = self.data[idx]["ner_tags"]

        tokenized_inputs = self.tokenizer(
            text,
            is_split_into_words=True,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )

        word_ids = tokenized_inputs.word_ids()
        aligned_labels = []
        previous_word_idx = None

        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                aligned_labels.append(-100)
            else:
                aligned_labels.append(labels[word_idx])
            previous_word_idx = word_idx

        tokenized_inputs["labels"] = torch.tensor(aligned_labels, dtype=torch.long)

        return {
            "input_ids": tokenized_inputs["input_ids"].squeeze(0),
            "attention_mask": tokenized_inputs["attention_mask"].squeeze(0),
            "labels": tokenized_inputs["labels"]
        }

def create_dataloader(dataset_split, tokenizer, max_length=128, batch_size=16):
    """
    Create a DataLoader for the NER dataset.
    Args:
        dataset_split (Dataset): Dataset split (train/validation/test).
        tokenizer (Tokenizer): Tokenizer for tokenizing the dataset.
        max_length (int): Maximum sequence length.
        batch_size (int): Batch size for DataLoader.
    Returns:
        DataLoader: Torch DataLoader for batching.
    """
    ner_dataset = NERDataset(data=dataset_split, tokenizer=tokenizer, max_length=max_length)
    return DataLoader(ner_dataset, batch_size=batch_size, shuffle=True)

train_dataloader = create_dataloader(dataset["train"], tokenizer)
val_dataloader = create_dataloader(dataset["validation"], tokenizer)

for batch in train_dataloader:
    print(batch)
    break


{'input_ids': tensor([[  101, 12627,  2727,  ...,     0,     0,     0],
        [  101,  2348,  8275,  ...,     0,     0,     0],
        [  101,  1996, 18178,  ...,     0,     0,     0],
        ...,
        [  101,  1017,  1011,  ...,     0,     0,     0],
        [  101,  2273,  1005,  ...,     0,     0,     0],
        [  101, 13848,  3806,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[-100,    5,    0,  ..., -100, -100, -100],
        [-100,    0,    0,  ..., -100, -100, -100],
        [-100,    0,    7,  ..., -100, -100, -100],
        ...,
        [-100,    0,    0,  ..., -100, -100, -100],
        [-100,    0,    0,  ..., -100, -100, -100],
        [-100,    0,    0,  ..., -100, -100, -100]])}


## Custom Training Loop

Define a training function that handles both training and validation. This function includes:
- Gradient updates with AdamW optimizer
- Progress tracking with tqdm
- Loss calculation for token classification
- Model checkpointing to save the best model

In [11]:
import torch
from torch.nn import CrossEntropyLoss
from tqdm import tqdm
from torch.optim import AdamW

# Training function
def train_model(model, train_dataloader, val_dataloader, num_epochs=3, learning_rate=5e-5, device="cuda"):
    """
    Train the model and evaluate on the validation dataset at each epoch.
    Args:
        model (nn.Module): DistilBERT model for token classification.
        train_dataloader (DataLoader): DataLoader for the training set.
        val_dataloader (DataLoader): DataLoader for the validation set.
        num_epochs (int): Number of training epochs.
        learning_rate (float): Learning rate for the optimizer.
        device (str): Device to train the model ('cuda' or 'cpu').
    """
    model.to(device)

    optimizer = AdamW(model.parameters(), lr=learning_rate)
    criterion = CrossEntropyLoss()

    best_accuracy = 0.0
    best_model_state = None

    for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")

        model.train()
        train_loss = 0.0
        for batch in tqdm(train_dataloader, desc="Training"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            optimizer.zero_grad()

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            active_loss = labels != -100
            active_logits = logits.view(-1, logits.size(-1))[active_loss.view(-1)]
            active_labels = labels.view(-1)[active_loss.view(-1)]
            loss = criterion(active_logits, active_labels)
            train_loss += loss.item()

            loss.backward()
            optimizer.step()

        avg_train_loss = train_loss / len(train_dataloader)
        print(f"Training loss: {avg_train_loss:.4f}")

        model.eval()
        val_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for batch in tqdm(val_dataloader, desc="Validation"):
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].to(device)

                outputs = model(input_ids, attention_mask=attention_mask)
                logits = outputs.logits

                active_loss = labels != -100
                active_logits = logits.view(-1, logits.size(-1))[active_loss.view(-1)]
                active_labels = labels.view(-1)[active_loss.view(-1)]
                loss = criterion(active_logits, active_labels)
                val_loss += loss.item()

                predictions = torch.argmax(active_logits, dim=-1)
                correct += (predictions == active_labels).sum().item()
                total += active_labels.size(0)

        avg_val_loss = val_loss / len(val_dataloader)
        accuracy = correct / total
        print(f"Validation loss: {avg_val_loss:.4f}, Accuracy: {accuracy:.4f}")

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_model_state = model.state_dict()

    print(f"Best validation accuracy: {best_accuracy:.4f}")

    if best_model_state is not None:
        torch.save(best_model_state, "best_ner_model.pth")
        print("Best model saved as 'best_ner_model.pth'")

## Model Training

Train the model using the custom training function with specified hyperparameters.

In [12]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

num_epochs = 3
learning_rate = 5e-5
batch_size = 16
max_length = 128

train_dataloader = create_dataloader(dataset["train"], tokenizer, max_length=max_length, batch_size=batch_size)
val_dataloader = create_dataloader(dataset["validation"], tokenizer, max_length=max_length, batch_size=batch_size)

train_model(
    model=model,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    num_epochs=num_epochs,
    learning_rate=learning_rate,
    device=device
)

Using device: cuda
Epoch 1/3


Training: 100%|██████████| 878/878 [00:50<00:00, 17.46it/s]


Training loss: 0.8973


Validation: 100%|██████████| 204/204 [00:11<00:00, 18.18it/s]


Validation loss: 0.5803, Accuracy: 0.8350
Epoch 2/3


Training: 100%|██████████| 878/878 [00:49<00:00, 17.57it/s]


Training loss: 0.5038


Validation: 100%|██████████| 204/204 [00:11<00:00, 18.01it/s]


Validation loss: 0.4126, Accuracy: 0.8831
Epoch 3/3


Training: 100%|██████████| 878/878 [00:50<00:00, 17.26it/s]


Training loss: 0.3792


Validation: 100%|██████████| 204/204 [00:11<00:00, 17.55it/s]


Validation loss: 0.3145, Accuracy: 0.9189
Best validation accuracy: 0.9189
Best model saved as 'best_ner_model.pth'


## Model Evaluation

Evaluate the best saved model on the test dataset to measure its performance on unseen data.

In [13]:
def evaluate_model(model, test_dataloader, device="cuda"):
    """
    Evaluate the model on the test dataset and report accuracy.
    Args:
        model (nn.Module): The trained DistilBERT model.
        test_dataloader (DataLoader): DataLoader for the test set.
        device (str): Device to evaluate the model ('cuda' or 'cpu').
    """
    model.load_state_dict(torch.load("best_ner_model.pth"))
    model.to(device)
    model.eval()

    correct = 0
    total = 0
    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Testing"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            active_loss = labels != -100
            active_logits = logits.view(-1, logits.size(-1))[active_loss.view(-1)]
            active_labels = labels.view(-1)[active_loss.view(-1)]
            predictions = torch.argmax(active_logits, dim=-1)

            correct += (predictions == active_labels).sum().item()
            total += active_labels.size(0)

    accuracy = correct / total
    print(f"Test Accuracy: {accuracy:.4f}")

test_dataloader = create_dataloader(dataset["test"], tokenizer, max_length=128, batch_size=16)

evaluate_model(model, test_dataloader, device=device)

  model.load_state_dict(torch.load("best_ner_model.pth"))
Testing: 100%|██████████| 216/216 [00:12<00:00, 17.11it/s]

Test Accuracy: 0.9173





## Hugging Face Trainer Integration

Train the model using Hugging Face's Trainer API for comparison with our custom training loop. This approach offers built-in features like gradient accumulation, mixed precision, and more.

In [15]:
from transformers import Trainer, TrainingArguments
import evaluate

metric = evaluate.load("accuracy")
def compute_metrics(pred):
    """
    Compute accuracy for evaluation.
    Args:
        pred (EvalPrediction): Hugging Face evaluation prediction object.
    Returns:
        Dict: Accuracy metric.
    """
    labels = pred.label_ids
    predictions = pred.predictions.argmax(axis=-1)

    true_predictions = [
        pred for preds, labs in zip(predictions, labels)
        for pred, lab in zip(preds, labs) if lab != -100
    ]
    true_labels = [
        lab for labs in labels
        for lab in labs if lab != -100
    ]

    accuracy = metric.compute(predictions=true_predictions, references=true_labels)
    return {"accuracy": accuracy["accuracy"]}

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=NERDataset(dataset["train"], tokenizer),
    eval_dataset=NERDataset(dataset["validation"], tokenizer),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

trainer.save_model("best_huggingface_model")
print("Best model saved as 'best_huggingface_model'")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2396,0.234526,0.944744
2,0.2081,0.218406,0.948523
3,0.2101,0.213867,0.949399


Best model saved as 'best_huggingface_model'


## Results and Model Availability

The trained model and results have been published to Hugging Face for public access and further experimentation.

#### Model pushed to : https://huggingface.co/aren-golazizian/distilbert-ner-finetuned-conll2003
#### Results pushed to : https://huggingface.co/aren-golazizian/results