<a href="https://colab.research.google.com/github/IyadSultan/educational/blob/main/train_a_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a Transformer for Medical Problem NER in Clinical Text

Named Entity Recognition (NER) in healthcare can automatically highlight medical problems (like diseases or symptoms) in clinical notes, helping clinicians quickly identify key conditions. In this hands-on tutorial, we will fine-tune a Transformer model to recognize medical problems in text using Hugging Face 🤗 Transformers. We’ll use a domain-specific BERT model (BioClinicalBERT) and an open medical NER dataset, and walk through the entire process from data loading to evaluation. What you will learn:
1. Setting up the Colab environment and installing necessary libraries.
2. Loading an open-access medical NER dataset from the Hugging Face Hub.
3. Preprocessing clinical text data and aligning entity labels with subword tokens.
4. Choosing a suitable pre-trained Transformer model (e.g. BioClinicalBERT) for clinical text​.
5. Fine-tuning the model for NER using the Hugging Face Trainer API.
6. Evaluating model performance (precision, recall, F1-score) before and after fine-tuning.
7. Testing the model on example clinical text to see qualitative improvements in NER.

Prerequisites: Basic Python coding and familiarity with general concepts of machine learning. We will explain NLP and Transformer concepts in simple terms, so healthcare professionals with beginner coding experience should be able to follow along.

# 1. Environment Setup
First, we need to set up our environment. This tutorial is designed for Google Colab – make sure you've selected a GPU runtime for faster training (go to Runtime > Change runtime type > Hardware accelerator > GPU). We’ll install Hugging Face’s Transformers, Datasets, and evaluation libraries.

In [13]:
!pip install transformers datasets evaluate seqeval



*Explanation* : The transformers library provides the model and training tools, datasets will let us easily load the dataset, and seqeval (via evaluate) is a library for computing NER metrics like precision and F1. After running this, you should see the libraries installing.

In [14]:
# Let’s also verify that a GPU is available:
import torch
print("GPU available:", torch.cuda.is_available())


GPU available: True


If the output says GPU available: True, we’re ready to go (Colab Pro’s GPU will help speed up training). If it’s False, double-check the runtime settings.

# 2. Loading the Dataset
We will use an open-access dataset from the Hugging Face Hub. One good option is the NCBI Disease corpus, which contains biomedical text (PubMed abstracts) annotated with disease names​. This **English-language dataset** is open-access and suitable for our task of extracting medical problems. Let's load the dataset using 🤗 Datasets:

In [15]:
from datasets import load_dataset

# Load the NCBI Disease dataset (with train/val/test splits)
raw_datasets = load_dataset("ncbi_disease")
print(raw_datasets)


DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 5433
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 924
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 941
    })
})


This should download the dataset and show the splits (train/validation/test). The NCBI Disease corpus has 793 abstracts with disease annotations​, already split into about 5,433 training sentences, 924 validation sentences, and 941 test sentences. Each data sample in this dataset is a sentence with two fields:
"tokens": the list of word tokens in the sentence.
"ner_tags": the list of numeric labels for each token (0 = outside any entity, 1 = beginning of a disease entity, 2 = inside a disease entity)​.

Let’s inspect an example from the training set:

In [16]:
# Peek at one training example
example = raw_datasets["train"][0]
print(example["tokens"])
print(example["ner_tags"])


['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']
[0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0]


In this example, the tokens ['adenomatous', 'polyposis', 'coli', 'tumour'] have tags [1, 2, 2, 2], meaning "adenomatous" is tagged as B-disease (beginning of a disease name) and the next three tokens are I-disease (continuation of the disease name), while all other tokens are 0 (not an entity). The disease mentioned here is “adenomatous polyposis coli tumour.”

**Understanding the task**: Our goal is to train a model that takes a sequence of tokens (a clinical sentence) and outputs the correct tag (O, B-problem, I-problem) for each token, thereby identifying spans of text that are medical problems (diseases, in this dataset).

# 3. Data Preprocessing and Tokenization
Before training, we need to preprocess the data for our Transformer model. Transformers like BERT cannot take raw text strings directly for NER; we must:
Tokenize the text with the model’s tokenizer, and Align the labels to the tokenizer’s output tokens.

*Why align labels?*

Models like BERT use subword tokenization. A single word (e.g., "diabetes") might be broken into multiple subtokens (e.g., "dia", "##betes"). Our dataset labels are at the word level, so we need to create a label for each subtoken as well. Typically, we give the same label to all subtokens of a word, or label only the first subtoken and mark the rest as “ignore” in the loss calculation. We will use the Hugging Face AutoTokenizer to handle tokenization. Let’s choose our model’s tokenizer (we will use BioClinicalBERT in the next section, which has a fast tokenizer available):

In [17]:
from transformers import AutoTokenizer

model_checkpoint = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


Now we define a function to tokenize the inputs and align the NER tags. We'll use is_split_into_words=True because our dataset already provides a list of tokens for each example (so we don't want the tokenizer to re-split the sentence into words, but rather tokenize each given token).

**For each tokenized example:**
- We get a mapping of token indices to original word indices using tokenized_inputs.word_ids(batch_index).
= We assign each subtoken the label of its originating word. We will mark only the first subtoken of each word with the original label and use -100 for the remaining subtokens. The value -100 is a special label that tells the loss function to ignore those positions (so we don't double-count a multi-subtoken word).

In [18]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], is_split_into_words=True, truncation=True)
    all_labels = examples["ner_tags"]
    aligned_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their word index in the original example
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                # Special token like [CLS] or [SEP]
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # Start of a new word, use that word's label
                label_ids.append(labels[word_idx])
            else:
                # Continuation of the same word, mark for ignoring
                label_ids.append(-100)
            previous_word_idx = word_idx
        aligned_labels.append(label_ids)
    # Include the aligned labels in the tokenized input
    tokenized_inputs["labels"] = aligned_labels
    return tokenized_inputs

# Apply the tokenization and alignment to the entire dataset
tokenized_datasets = raw_datasets.map(tokenize_and_align_labels, batched=True)


Map:   0%|          | 0/924 [00:00<?, ? examples/s]

*Explanation:* We iterate through each example’s word indices from the tokenizer. When word_idx is None, it corresponds to special tokens (like [CLS] start or [SEP] end tokens) – we assign -100 so that they are ignored in training. When we see a new word (word index not equal to the previous one), we take that word’s original label (labels[word_idx]). If the tokenizer output is still on the same word as the previous token (meaning the word was split into multiple subtokens), we also assign -100 to those subtokens. This way, only the first subtoken of a word carries the label, and the loss will be calculated only once per original word.

**Note: An alternative would be to label subsequent subtokens as "I-" (inside) tags of the same entity. In our simple approach, we ignore them for loss calculation to avoid overweighting long words. This is a common technique to handle subword tokenization in NER​**.

Now that we have tokenized_datasets, each entry has not only "input_ids" and "attention_mask" (from tokenization) but also an aligned "labels" sequence for training. We can verify the tokenization on our earlier example to see how the labels align:

In [19]:
# Check tokenization and label alignment on the example
tokens = tokenizer(example["tokens"], is_split_into_words=True)
print("Original tokens:", example["tokens"])
print("Original labels:", example["ner_tags"])
print("Subword tokens:", tokens.tokens())
print("Aligned labels:", tokenize_and_align_labels({"tokens": [example["tokens"]], "ner_tags": [example["ner_tags"]]})["labels"][0])


Original tokens: ['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']
Original labels: [0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0]
Subword tokens: ['[CLS]', 'identification', 'of', 'a', '##p', '##c', '##2', ',', 'a', 'ho', '##mo', '##logue', 'of', 'the', 'ad', '##eno', '##mat', '##ous', 'p', '##oly', '##po', '##sis', 'co', '##li', 't', '##umour', 'suppress', '##or', '.', '[SEP]']
Aligned labels: [-100, 0, 0, 0, -100, -100, -100, 0, 0, 0, -100, -100, 0, 0, 1, -100, -100, -100, 2, -100, -100, -100, 2, -100, 2, -100, 0, -100, 0, -100]


This will show the original tokens and their labels, the subword tokens (including special [CLS] and [SEP]), and the aligned labels for each subword token. You should see -100 for the special tokens and second part of any split word. For instance, if "polyposis" was split into ["poly", "##posis"], the first subtoken might get label 2 (I-disease) and the second subtoken -100.

# 4. Choosing a Pretrained Model (BioClinicalBERT)
Selecting a good pretrained model as a starting point is crucial. We need a model that understands clinical language. BioClinicalBERT is an excellent choice: it’s a BERT-based model that was pretrained on biomedical literature and clinical notes (MIMIC-III ICU records), giving it a strong grasp of medical terminology and context. This domain-specific pretraining should help it recognize disease names and other medical entities better than a generic BERT.

Other options could be BioBERT (pretrained on biomedical articles) or even a smaller model like DistilBERT fine-tuned on a medical NER dataset. However, BioClinicalBERT is well-suited for clinical text, and since we have a GPU, its size (base BERT, ~110M parameters) is manageable.

Let's load the model with a classification head for token classification (NER). We will use AutoModelForTokenClassification which adds a token-level classification layer on top of the transformer. We need to specify the number of labels our model will predict:

- Label 0: "O" (no entity)
- Label 1: "B-Disease" (beginning of a disease mention)
- Label 2: "I-Disease" (inside a disease mention)

We'll also pass **id2label and label2id** mappings for better clarity (so the model knows which label index corresponds to "B-Disease", etc., which will be useful for inference).

In [20]:
from transformers import AutoModelForTokenClassification

num_labels = 3  # O, B-Disease, I-Disease
id2label = {0: "O", 1: "B-Disease", 2: "I-Disease"}
label2id = {"O": 0, "B-Disease": 1, "I-Disease": 2}

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)


When this runs, it will download the BioClinicalBERT weights. You might see a warning that some weights are not used (those are for the original masked language modeling head) and some new weights are initialized (the classification head). That’s expected because we are repurposing the base model for NER.

**Note: BioClinicalBERT was initialized from BioBERT and trained on. It’s a domain-specific model ideal for clinical text. If you were dealing with general text, you might use a generic BERT base model, but for our medical task, leveraging domain knowledge should give us a boost.**

# 5. Fine-Tuning the Model with Trainer API
Now we have our prepared dataset (tokenized_datasets) and our model. The next step is to fine-tune the model on the training data. We'll use Hugging Face’s Trainer API, which simplifies the training loop and handles things like gradient accumulation, evaluation, and more. Setup Training Arguments: We need to specify how we train, e.g., number of epochs, batch size, learning rate, etc. For NER with a base BERT model:
- A few epochs (2-4) are usually sufficient for a dataset of this size.
- A learning rate around 2e-5 to 5e-5 works well for fine-tuning BERT.
- We’ll use the validation set to evaluate after each epoch.

We also define a compute_metrics function to compute precision, recall, and F1 using the seqeval metric. We will focus on the “Disease” entity class performance and overall metrics.

In [21]:
import numpy as np
import evaluate

# Load the seqeval metric
seqeval = evaluate.load("seqeval")

# Define compute_metrics to use during evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    true_labels = []
    true_predictions = []
    for pred, lab in zip(predictions, labels):
        # Remove ignored index (subword pieces)
        lab = [l for l in lab if l != -100]
        pred = pred[:len(lab)]  # truncate prediction to same length as lab
        true_labels.append([id2label[l] for l in lab])
        true_predictions.append([id2label[p] for p in pred])
    # Compute metrics using seqeval
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    # We will return overall precision, recall, F1, and accuracy
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }


*A quick explanation:* We take the logits (model outputs before softmax) and labels from the evaluation batch. We argmax the logits to get the predicted label indices. We then filter out any -100 in the labels (those correspond to subword pieces we chose to ignore), and make sure to truncate the predictions to the same length (since we don't want to count predictions for subword tokens that were ignored in labels). We map the numeric labels back to their string names (using id2label) and feed the lists to seqeval.compute(). Seqeval will calculate precision, recall, F1, etc., treating the labels as sequences of entity tags. We return the overall metrics. (Seqeval also provides class-specific metrics, but “overall” here mostly reflects the single entity type we care about, Disease.)

In [22]:
# Now, we set up the TrainingArguments and the Trainer:

from transformers import TrainingArguments, Trainer


batch_size = 16
logging_steps = len(tokenized_datasets["train"]) // batch_size  # log once per epoch (approx)
training_args = TrainingArguments(
    output_dir="med_ner_model",
    save_strategy="no",
    # Remove evaluation_strategy and use eval_steps instead
    eval_steps=logging_steps,  # Evaluate once per epoch
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=logging_steps,
    log_level="error",
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = Trainer(


Key points in these settings:
We set evaluation_strategy="epoch" to evaluate on the validation set at the end of each epoch.
We chose 3 epochs. You can adjust this; often 2-4 is enough for convergence on a small dataset, and more can cause overfitting.
per_device_train_batch_size=16 is reasonable for a base model on a GPU with enough memory (Colab Pro GPUs can handle this). If you get OOM errors, you might reduce the batch size.
We disable saving checkpoints (save_strategy="no") and pushing to hub for simplicity, but in a real scenario, you might want to save the fine-tuned model.
Before training, let's optionally evaluate the untrained model on the validation set to get a baseline. (Since the classification head is random initially, we expect it to perform very poorly – likely predicting no entities correctly.)

In [23]:
# Evaluate the model *before* training (baseline performance)
baseline_metrics = trainer.evaluate()
print("Baseline performance:", baseline_metrics)


ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).