# Biomedical Named Entity Recognition with BioBERT

**Ready-to-run Jupyter/Colab notebook** â€” train, evaluate, and run inference for biomedical NER using `dmis-lab/biobert-base-cased-v1.1` (HuggingFace).

**Notebook contents**
1. Install & setup
2. Load dataset (BC5CDR example via `datasets`)
3. Preprocessing & token-label alignment (BIO)
4. Model setup (`BertCRFForNER`)
5. Training with `Trainer`
6. Evaluation
7. Inference helper & demo

**Notes**
- This notebook expects an environment with internet access (to download models/datasets). For Colab, select a GPU runtime.
- If you're behind a firewall, download datasets and models manually and adjust paths.

In [23]:
!pip install "transformers>=4.44" "datasets>=2.21" "seqeval" "torch"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




## Imports

In [24]:
import os
from typing import List, Dict, Any
import transformers, datasets
import numpy as np
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer,
)

from seqeval.metrics import precision_score, recall_score, f1_score, classification_report


In [25]:
# Check versions
print('transformers', transformers.__version__)
print('datasets', datasets.__version__)

transformers 4.57.1
datasets 3.6.0


## Configuration

In [26]:
MODEL_NAME = "dmis-lab/biobert-base-cased-v1.1"
DATASET_NAME = "tner/bc5cdr"  # pre-split, tokenized BC5CDR with tags:contentReference[oaicite:1]{index=1}

label_list = ["O", "B-Chemical", "B-Disease", "I-Disease", "I-Chemical"]
id2label = {i: l for i, l in enumerate(label_list)}
label2id = {l: i for i, l in enumerate(label_list)}

## Load dataset and tokenizer

In [27]:
# Load dataset (BC5CDR for chemicals/diseases) via HuggingFace datasets
print("Loading dataset:", DATASET_NAME)
tner_dataset = load_dataset(DATASET_NAME)

print("Splits:", tner_dataset)

Loading dataset: tner/bc5cdr
Splits: DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5228
    })
    validation: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5330
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 5865
    })
})


In [28]:
# Inspect an example
example = tner_dataset['train'][0]
print('keys:', example.keys())
if 'tokens' in example:
    print('tokens sample:', example['tokens'][:40])
if 'tags' in example:
    print('tags sample:', example['tags'][:40])

keys: dict_keys(['tokens', 'tags'])
tokens sample: ['Naloxone', 'reverses', 'the', 'antihypertensive', 'effect', 'of', 'clonidine', '.']
tags sample: [1, 0, 0, 0, 0, 0, 1, 0]


In [29]:
# Preprocessing: tokenize and align labels (BIO scheme)
print("Loading tokenizer:", MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

Could not locate the tokenizer configuration file, will try to use the model config instead.


Loading tokenizer: dmis-lab/biobert-base-cased-v1.1


loading configuration file config.json from cache at /Users/hadarpur/.cache/huggingface/hub/models--dmis-lab--biobert-base-cased-v1.1/snapshots/924f12e0c3db7f156a765ad53fb6b11e7afedbc8/config.json
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file vocab.txt from cache at /Users/hadarpur/.cache/huggingface/hub/models--dmis-lab--biobert-base-cased-v1.1/snapshots/924f12e0c3db7f156a765ad53fb6b11e7afedbc8/vocab.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cac

## Tokenization + label alignment

In [30]:
# Utility to align labels for tokenized inputs
def tokenize_and_align_labels(examples):
    """
    Tokenize the list of token sequences and align the BIO labels
    to the resulting wordpieces.

    examples["tokens"]: List[List[str]]
    examples["tags"]:   List[List[int]]  (indices into label_list)
    """
        
    tokenized = tokenizer(
        examples["tokens"],
        is_split_into_words=True,  # because we already have word tokens
        truncation=True,
        padding=False,
    )
    
    all_labels = examples['tags']
    new_labels = []

    for i, labels in enumerate(all_labels):
        # word_ids maps each subtoken position to its originating word index
        word_ids = tokenized.word_ids(batch_index=i)

        previous_word_id = None
        label_ids = []

        for word_id in word_ids:
            if word_id is None:
                # Special tokens (CLS, SEP, padding later)
                label_ids.append(-100)
            else:
                original_label_id = labels[word_id]

                if word_id != previous_word_id:
                    # First subtoken of the word: use original label
                    label_ids.append(original_label_id)
                else:
                    # Subsequent subtokens of the same word:
                    # convert B-* to I-* to respect BIO scheme
                    if original_label_id == label2id["B-Chemical"]:
                        label_ids.append(label2id["I-Chemical"])
                    elif original_label_id == label2id["B-Disease"]:
                        label_ids.append(label2id["I-Disease"])
                    else:
                        # For I-* or O, keep same
                        label_ids.append(original_label_id)

                previous_word_id = word_id

        new_labels.append(label_ids)

    tokenized["labels"] = new_labels
    return tokenized

In [31]:
print("Tokenizing and aligning labels...")
remove_columns = tner_dataset["train"].column_names  # ["tokens", "tags"]
tokenized_datasets = tner_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=remove_columns,
)

tokenized_datasets

Tokenizing and aligning labels...


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5228
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5330
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5865
    })
})

## Data collator

In [32]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

## Load BioBERT token-classification model

In [33]:
print("Loading model:", MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

loading configuration file config.json from cache at /Users/hadarpur/.cache/huggingface/hub/models--dmis-lab--biobert-base-cased-v1.1/snapshots/924f12e0c3db7f156a765ad53fb6b11e7afedbc8/config.json
Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-Chemical",
    "2": "B-Disease",
    "3": "I-Disease",
    "4": "I-Chemical"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-Chemical": 1,
    "B-Disease": 2,
    "I-Chemical": 4,
    "I-Disease": 3,
    "O": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



Loading model: dmis-lab/biobert-base-cased-v1.1


loading weights file pytorch_model.bin from cache at /Users/hadarpur/.cache/huggingface/hub/models--dmis-lab--biobert-base-cased-v1.1/snapshots/924f12e0c3db7f156a765ad53fb6b11e7afedbc8/pytorch_model.bin
Attempting to create safetensors variant
Some weights of the model checkpoint at dmis-lab/biobert-base-cased-v1.1 were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
-

## Metrics

In [34]:
def compute_metrics(p):
    """
    p is an EvalPrediction with:
    - p.predictions: np.array (batch, seq_len, num_labels)
    - p.label_ids:   np.array (batch, seq_len)
    """
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels: List[List[str]] = []
    true_predictions: List[List[str]] = []

    for pred_seq, label_seq in zip(predictions, labels):
        # filter out positions where label == -100
        valid_indices = label_seq != -100
        pred_seq = pred_seq[valid_indices]
        label_seq = label_seq[valid_indices]

        true_labels.append([id2label[int(l)] for l in label_seq])
        true_predictions.append([id2label[int(p_i)] for p_i in pred_seq])

    precision = precision_score(true_labels, true_predictions)
    recall = recall_score(true_labels, true_predictions)
    f1 = f1_score(true_labels, true_predictions)

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }

## Training configuration

In [35]:
output_dir = "./biobert_bc5cdr_ner"

training_args = TrainingArguments(
    output_dir=output_dir,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


## Trainer

In [36]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(
Safetensors PR exists


## Training + evaluation

In [37]:
def train_and_evaluate():
    # Train
    trainer.train()

    # Evaluate on validation and test
    print("Validation metrics:")
    val_metrics = trainer.evaluate(tokenized_datasets["validation"])
    print(val_metrics)

    print("Test metrics:")
    test_metrics = trainer.evaluate(tokenized_datasets["test"])
    print(test_metrics)

    # Optional: detailed report on test set
    print("Detailed seqeval report on test set:")
    predictions, labels, _ = trainer.predict(tokenized_datasets["test"])
    predictions = np.argmax(predictions, axis=2)

    true_labels = []
    true_predictions = []

    for pred_seq, label_seq in zip(predictions, labels):
        valid_indices = label_seq != -100
        pred_seq = pred_seq[valid_indices]
        label_seq = label_seq[valid_indices]

        true_labels.append([id2label[int(l)] for l in label_seq])
        true_predictions.append([id2label[int(p_i)] for p_i in pred_seq])

    print(classification_report(true_labels, true_predictions))

In [None]:
train_and_evaluate()

***** Running training *****
  Num examples = 5,228
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3,270
  Number of trainable parameters = 107,723,525


Epoch,Training Loss,Validation Loss


## Inference helper

In [None]:
def ner_inference(text: str, max_length: int = 256):
    """
    Run NER on a new biomedical sentence/abstract.
    Returns entities with type and char spans.
    """
    model.eval()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Tokenize as raw text (not pre-split)
    encoded = tokenizer(
        text,
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    ).to(device)

    with torch.no_grad():
        outputs = model(**encoded)
        logits = outputs.logits  # (1, seq_len, num_labels)
        pred_ids = torch.argmax(logits, dim=-1).cpu().numpy()[0]

    # Map subtokens back to words using tokenizer
    tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])
    # We will reconstruct entity spans in a simple way: group consecutive non-"O"
    entities = []
    current_entity = None

    # skip [CLS] (0) and stop at [SEP]
    for i, (token, label_id) in enumerate(zip(tokens, pred_ids)):
        if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
            if current_entity is not None:
                entities.append(current_entity)
                current_entity = None
            if token == tokenizer.sep_token:
                break
            continue

        label = id2label[int(label_id)]
        if label == "O":
            if current_entity is not None:
                entities.append(current_entity)
                current_entity = None
            continue

        # label is B-* or I-*
        label_type = label.split("-", 1)[1]

        # Approximate char span via tokenizer offsets
        offsets = encoded.token_to_chars(i)
        if offsets is None:
            # This can happen rarely; we just skip char span
            start_char, end_char = None, None
        else:
            start_char, end_char = offsets.start, offsets.end

        if current_entity is None:
            current_entity = {
                "type": label_type,
                "text": text[start_char:end_char] if start_char is not None else token,
                "start": start_char,
                "end": end_char,
            }
        else:
            # Same type? continue span
            if current_entity["type"] == label_type:
                if start_char is not None and end_char is not None:
                    # extend span
                    current_entity["end"] = end_char
                    current_entity["text"] = text[current_entity["start"]:current_entity["end"]]
            else:
                # different type, close previous and start new
                entities.append(current_entity)
                current_entity = {
                    "type": label_type,
                    "text": text[start_char:end_char] if start_char is not None else token,
                    "start": start_char,
                    "end": end_char,
                }

    if current_entity is not None:
        entities.append(current_entity)

    return entities


In [None]:
# Example inference after training:
example = "Paracetamol can cause liver toxicity in high doses."
ents = ner_inference(example)
print("\nExample inference:")
print("Text:", example)
for e in ents:
    print(e)


Example inference:
Text: Paracetamol can cause liver toxicity in high doses.
{'type': 'Chemical', 'text': 'Paracetamol', 'start': 0, 'end': 11}
{'type': 'Disease', 'text': 'liver toxicity', 'start': 22, 'end': 36}
