# Task 1 – Medical Named Entity Recognition (NER)

In this notebook, we fine-tune encoder-only transformer models for a medical Named Entity Recognition (NER) task on Swedish text. 

The dataset used in this task consists of Swedish medical text from the 1177 Vårdguiden subset. The goal is to identify and classify medical entities such as disorders, pharmaceutical drugs, and body structures.


In [1]:
import numpy as np
import re
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer
)
from seqeval.metrics import precision_score, recall_score, f1_score

import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0))


CUDA available: True
GPU: NVIDIA GeForce RTX 4080 SUPER


## Dataset Loading

The dataset is loaded from the Hugging Face Datasets library. Only the 1177 Vårdguiden subset is used, as specified in the assignment instructions. This dataset contains medical text annotated with entity spans and entity types.

Before training, the dataset is inspected and split into training, validation, and test sets.


## Data Splitting

The dataset is split following the prescribed setup:
- 80% of the data is used for training and validation
- 20% is held out as a test set and used only for final evaluation
- The training–validation portion is further split into 90% training data and 10% validation data

This ensures that the test set remains fully isolated during model development.


In [2]:
ds = load_dataset("community-datasets/swedish_medical_ner", '1177')

full_dataset = ds["train"]
print(len(full_dataset))
full_ds = ds["train"]  # use all examples

split_80_20 = full_ds.train_test_split(
    test_size=0.2,
    seed=67
)

trainval_ds = split_80_20["train"]  # 80%
test_ds     = split_80_20["test"]   # 20% (final test set)

split_90_10 = trainval_ds.train_test_split(
    test_size=0.1,
    seed=67
)

train_ds = split_90_10["train"]  # 72%
val_ds   = split_90_10["test"]   # 8%

final_ds = DatasetDict({
    "train": train_ds,
    "validation": val_ds,
    "test": test_ds
})

print(final_ds.keys())



README.md: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train-00000-of-00001.parquet:   0%|          | 0.00/77.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/927 [00:00<?, ? examples/s]

927
dict_keys(['train', 'validation', 'test'])


## Tagging Schemes

To train a token-level NER model, entity span annotations must be converted into token-level labels. Two standard tagging schemes are used in this notebook:

- **BIO (Beginning–Inside–Outside)**: Marks the beginning of an entity, continuation inside an entity, and non-entity tokens.
- **BIOES (Beginning–Inside–Outside–End–Single)**: Extends BIO by explicitly marking the end of multi-token entities and single-token entities.

Both tagging schemes are applied to the same underlying data in order to compare their impact on model performance.


In [3]:
TYPE_MAP = {
    0: "DISORDER",
    1: "DRUG",
    2: "BODY"
}

In [4]:
WORD_RE = re.compile(r"\S+")

def words_with_offsets(text):
    words, offsets = [], []
    for match in WORD_RE.finditer(text):
        words.append(match.group())
        offsets.append((match.start(), match.end()))
    return words, offsets


def normalize_entities(entities):
    """
    Convert dataset-style entity dict-of-lists into list-of-dicts.
    """
    starts = entities["start"]
    ends = entities["end"]
    types = entities["type"]

    normalized = []
    for s, e, t in zip(starts, ends, types):
        normalized.append({
            "start": int(s),
            "end": int(e),
            "type": int(t)
        })
    return normalized

## Tokenization and Label Alignment

The pretrained transformer models use subword tokenization, which means that a single word can be split into multiple subword tokens. To handle this, entity labels are aligned with the tokenized output.

Only the first subword token of each word is assigned a label, while subsequent subword tokens and special tokens are assigned an ignore label. This ensures that the loss is computed correctly during training.


In [5]:
def spans_to_word_labels(text, entities, scheme="BIO"):
    words, offsets = words_with_offsets(text)
    labels = ["O"] * len(words)

    entities = normalize_entities(entities)

    for ent in entities:
        start, end = ent["start"], ent["end"]
        ent_type = TYPE_MAP[ent["type"]]

        covered = [
            i for i, (ws, we) in enumerate(offsets)
            if not (we <= start or ws >= end)
        ]

        if not covered:
            continue

        if scheme == "BIO":
            labels[covered[0]] = f"B-{ent_type}"
            for i in covered[1:]:
                labels[i] = f"I-{ent_type}"

        elif scheme == "BIOES":
            if len(covered) == 1:
                labels[covered[0]] = f"S-{ent_type}"
            else:
                labels[covered[0]] = f"B-{ent_type}"
                for i in covered[1:-1]:
                    labels[i] = f"I-{ent_type}"
                labels[covered[-1]] = f"E-{ent_type}"

    return words, labels


In [6]:
def build_label_list(scheme):
    entity_types = ["DISORDER", "DRUG", "BODY"]
    if scheme == "BIO":
        return ["O"] + [f"{p}-{t}" for t in entity_types for p in ["B", "I"]]
    else:  # BIOES
        return ["O"] + [f"{p}-{t}" for t in entity_types for p in ["B", "I", "E", "S"]]


In [7]:
def tokenize_and_align_labels(words, word_labels, tokenizer, label2id):
    enc = tokenizer(
        words,
        is_split_into_words=True,
        truncation=True,
        max_length=256
    )

    word_ids = enc.word_ids()
    labels = []
    prev_word_id = None

    for word_id in word_ids:
        if word_id is None:
            labels.append(-100)
        elif word_id != prev_word_id:
            labels.append(label2id[word_labels[word_id]])
        else:
            labels.append(-100)
        prev_word_id = word_id

    enc["labels"] = labels
    return enc


In [8]:
def preprocess_dataset(dataset_dict, tokenizer, scheme):
    label_list = build_label_list(scheme)
    label2id = {l: i for i, l in enumerate(label_list)}

    def preprocess(example):
        words, labels = spans_to_word_labels(
            example["sentence"],
            example["entities"],
            scheme=scheme
        )
        return tokenize_and_align_labels(words, labels, tokenizer, label2id)

    encoded = DatasetDict()
    for split in dataset_dict.keys():
        encoded[split] = dataset_dict[split].map(preprocess)

    return encoded, label_list, label2id

In [9]:
def compute_metrics(eval_pred, id2label):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    true_preds, true_labels = [], []

    for pred, lab in zip(predictions, labels):
        p_seq, l_seq = [], []
        for p, l in zip(pred, lab):
            if l == -100:
                continue
            p_seq.append(id2label[p])
            l_seq.append(id2label[l])
        true_preds.append(p_seq)
        true_labels.append(l_seq)

    return {
        "precision": precision_score(true_labels, true_preds),
        "recall": recall_score(true_labels, true_preds),
        "f1": f1_score(true_labels, true_preds),
    }


## Training Setup

Models are fine-tuned using the Hugging Face Trainer API. Training is performed with fixed hyperparameters across experiments, except for the tagging scheme and learning rate when conducting the optional extension.

GPU acceleration and mixed-precision training are used when available to improve training efficiency. Model performance is monitored on the validation set, and the best model is selected based on F1-score.



In [12]:
def train_ner_model(model_name, scheme, learning_rates=[2e-5]):
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    encoded_ds, label_list, label2id = preprocess_dataset(
        final_ds, tokenizer, scheme
    )

    id2label = {i: l for l, i in label2id.items()}

    lr_results = {}

    for lr in learning_rates:
        print(f"\nTraining {model_name} | {scheme} | lr={lr}")

        model = AutoModelForTokenClassification.from_pretrained(
            model_name,
            num_labels=len(label_list),
            id2label=id2label,
            label2id=label2id
        )

        args = TrainingArguments(
            output_dir=f"./outputs/ner_{model_name.replace('/', '_')}_{scheme}_lr{lr}",
            eval_strategy="epoch",
            save_strategy="epoch",
            learning_rate=lr,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            num_train_epochs=5,
            weight_decay=0.01,
            load_best_model_at_end=True,
            metric_for_best_model="f1",
            logging_steps=50,
            fp16=True,
            report_to="none"
        )

        trainer = Trainer(
            model=model,
            args=args,
            train_dataset=encoded_ds["train"],
            eval_dataset=encoded_ds["validation"],
            tokenizer=tokenizer,
            data_collator=DataCollatorForTokenClassification(tokenizer),
            compute_metrics=lambda p: compute_metrics(p, id2label)
        )

        trainer.train()
        test_metrics = trainer.evaluate(encoded_ds["test"])

        lr_results[lr] = test_metrics

    return lr_results




## Model Selection

Two encoder-only transformer models are evaluated in this notebook:
- A Swedish-specific BERT model trained on Swedish text
- A multilingual BERT model trained on text from multiple languages, including Swedish

These models are selected to compare language-specific pretraining with multilingual pretraining in a domain-specific, low-resource setting.


## Optional Extension: Learning Rate Comparison

As an optional extension, the effect of different learning rates on NER performance is evaluated. Learning rates of 1e-5, 2e-5, and 5e-5 are tested across model and tagging scheme configurations.

This experiment helps analyze how optimization settings influence convergence and final performance for a token-level sequence labeling task.


In [13]:
models = [
    "KB/bert-base-swedish-cased",
    "google-bert/bert-base-multilingual-cased"
]

schemes = ["BIO", "BIOES"]

learning_rates = [1e-5, 2e-5, 5e-5]

results = {}

for model_name in models:
    for scheme in schemes:
        print(f"\n=== Training {model_name} | Scheme: {scheme} ===")

        lr_results = train_ner_model(
            model_name=model_name,
            scheme=scheme,
            learning_rates=learning_rates
        )

        # Store results with explicit keys
        for lr, metrics in lr_results.items():
            results[(model_name, scheme, lr)] = metrics


=== Training KB/bert-base-swedish-cased | Scheme: BIO ===


Map:   0%|          | 0/75 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training KB/bert-base-swedish-cased | BIO | lr=1e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.103624,0.833333,0.925926,0.877193
2,0.681900,0.012841,0.987805,1.0,0.993865
3,0.066900,0.003806,1.0,1.0,1.0
4,0.037700,0.002448,1.0,1.0,1.0
5,0.015700,0.002106,1.0,1.0,1.0


Some weights of BertForTokenClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training KB/bert-base-swedish-cased | BIO | lr=2e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.016062,0.940476,0.975309,0.957576
2,0.543500,0.003522,1.0,1.0,1.0
3,0.030300,0.001086,1.0,1.0,1.0
4,0.021200,0.000734,1.0,1.0,1.0
5,0.004000,0.000699,1.0,1.0,1.0


Some weights of BertForTokenClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training KB/bert-base-swedish-cased | BIO | lr=5e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.026354,0.916667,0.950617,0.933333
2,0.344400,0.023088,0.987805,1.0,0.993865
3,0.007000,0.002382,0.987805,1.0,0.993865
4,0.010400,0.000216,1.0,1.0,1.0
5,0.002500,0.000204,1.0,1.0,1.0



=== Training KB/bert-base-swedish-cased | Scheme: BIOES ===


Map:   0%|          | 0/666 [00:00<?, ? examples/s]

Map:   0%|          | 0/75 [00:00<?, ? examples/s]

Map:   0%|          | 0/186 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training KB/bert-base-swedish-cased | BIOES | lr=1e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.205239,0.590476,0.765432,0.666667
2,1.048800,0.029451,0.94186,1.0,0.97006
3,0.111300,0.009235,1.0,1.0,1.0
4,0.053600,0.005224,1.0,1.0,1.0
5,0.023500,0.004425,1.0,1.0,1.0


Some weights of BertForTokenClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training KB/bert-base-swedish-cased | BIOES | lr=2e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.03093,0.918605,0.975309,0.946108
2,0.704800,0.005036,1.0,1.0,1.0
3,0.037700,0.001747,1.0,1.0,1.0
4,0.017600,0.001168,1.0,1.0,1.0
5,0.004600,0.000991,1.0,1.0,1.0


Some weights of BertForTokenClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training KB/bert-base-swedish-cased | BIOES | lr=5e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.037248,0.941176,0.987654,0.963855
2,0.440800,0.023933,0.987805,1.0,0.993865
3,0.006400,0.000417,1.0,1.0,1.0
4,0.005600,0.00083,1.0,1.0,1.0
5,0.001300,0.001439,1.0,1.0,1.0



=== Training google-bert/bert-base-multilingual-cased | Scheme: BIO ===


Map:   0%|          | 0/666 [00:00<?, ? examples/s]

Map:   0%|          | 0/75 [00:00<?, ? examples/s]

Map:   0%|          | 0/186 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training google-bert/bert-base-multilingual-cased | BIO | lr=1e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.13472,0.927711,0.950617,0.939024
2,0.635300,0.022325,0.963415,0.975309,0.969325
3,0.076300,0.008056,0.940476,0.975309,0.957576
4,0.048200,0.005911,0.963855,0.987654,0.97561
5,0.023800,0.005599,0.987805,1.0,0.993865


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training google-bert/bert-base-multilingual-cased | BIO | lr=2e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.030081,1.0,1.0,1.0
2,0.432400,0.005663,0.951807,0.975309,0.963415
3,0.033900,0.001927,1.0,1.0,1.0
4,0.013400,0.001136,1.0,1.0,1.0
5,0.004300,0.001062,1.0,1.0,1.0


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training google-bert/bert-base-multilingual-cased | BIO | lr=5e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.015753,0.952381,0.987654,0.969697
2,0.285000,0.008531,0.97561,0.987654,0.981595
3,0.015100,0.000505,1.0,1.0,1.0
4,0.005900,0.000386,1.0,1.0,1.0
5,0.001500,0.000355,1.0,1.0,1.0



=== Training google-bert/bert-base-multilingual-cased | Scheme: BIOES ===


Map:   0%|          | 0/666 [00:00<?, ? examples/s]

Map:   0%|          | 0/75 [00:00<?, ? examples/s]

Map:   0%|          | 0/186 [00:00<?, ? examples/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training google-bert/bert-base-multilingual-cased | BIOES | lr=1e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.244654,0.598039,0.753086,0.666667
2,0.870800,0.047609,0.939024,0.950617,0.944785
3,0.136400,0.016838,0.929412,0.975309,0.951807
4,0.075200,0.011802,0.963855,0.987654,0.97561
5,0.040200,0.010089,1.0,1.0,1.0


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training google-bert/bert-base-multilingual-cased | BIOES | lr=2e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.076176,0.917647,0.962963,0.939759
2,0.604500,0.011392,0.928571,0.962963,0.945455
3,0.052700,0.00458,1.0,1.0,1.0
4,0.022200,0.001899,1.0,1.0,1.0
5,0.009700,0.001656,1.0,1.0,1.0


Some weights of BertForTokenClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training google-bert/bert-base-multilingual-cased | BIOES | lr=5e-05


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.030166,0.987805,1.0,0.993865
2,0.385500,0.013051,0.906977,0.962963,0.934132
3,0.022200,0.000969,1.0,1.0,1.0
4,0.007200,0.000673,1.0,1.0,1.0
5,0.002000,0.000584,1.0,1.0,1.0


In [None]:
for (model, scheme, lr), metrics in results.items():
    print(f"\nModel: {model} | Scheme: {scheme} | Learning rate: {lr}")
    for k, v in metrics.items():
        if isinstance(v, float):
            print(f"{k}: {v:.4f}")
        else:
            print(f"{k}: {v}")
        

ValueError: too many values to unpack (expected 2)

## Evaluation

Model performance is evaluated on the held-out test set using standard NER metrics: accuracy, precision, recall, and F1-score. The test set is used only once per configuration, after training is complete.

Results are collected and stored for later analysis and inclusion in the final report.


In [None]:
import pandas as pd

rows = []

for (model, scheme, lr), metrics in results.items():
    rows.append({
        "Model": model,
        "Tagging Scheme": scheme,
        "Learning Rate": lr,
        "Precision": metrics["eval_precision"],
        "Recall": metrics["eval_recall"],
        "F1": metrics["eval_f1"]
    })

ner_results_df = pd.DataFrame(rows)
ner_results_df.sort_values(["Model", "Tagging Scheme", "Learning Rate"])

ner_results_df.to_csv("results/ner_results_with_lr.csv", index=False)