# Named Entity Recognition (NER) and Entity Linking

Authors:  
Sviatlana Matsiuk  

## Step 1: Load data

In [94]:
import os

data_folder = "data/"
files = ["1177_annotated_sentences.txt", "LT_annotated_60.txt", "Wiki_annotated_60.txt"]

dataset = []
for file in files:
    with open(os.path.join(data_folder, file), "r", encoding="utf-8") as f:
        lines = f.readlines()
        dataset.extend(lines)

## Step 2. Dataset Selection and Preprocessing
The dataset includes three files:
- 1177_annotated_sentences.txt** (from "1177 Vårdguiden")
- LT_annotated_60.txt (from Läkartidningen)
- Wiki_annotated_60.txt (from Swedish Wikipedia)

### **Preprocessing Steps**
1. **Load and tokenize** text using **Kb-BERT tokenizer**.
2. **Align labels** with tokenized words based on **BIO tagging**:
   - **(Disorders/Diseases)** → `B-DIS`, `I-DIS`
   - **[Pharmaceutical Drugs]** → `B-DRUG`, `I-DRUG`
   - **{Anatomical Structures}** → `B-ANAT`, `I-ANAT`

In [2]:
from transformers import AutoTokenizer

# Load Swedish Kb-BERT tokenizer
model_name = "KB/bert-base-swedish-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [3]:
from datasets import Dataset
import json

label2id = {
    "O": 0,         # Outside any entity
    "B-DIS": 1,     # Beginning of Disorder/Finding `()`
    "I-DIS": 2,     # Inside Disorder/Finding
    "B-DRUG": 3,    # Beginning of Pharmaceutical Drug `[]`
    "I-DRUG": 4,    # Inside Pharmaceutical Drug
    "B-ANAT": 5,    # Beginning of Body Structure `{}`
    "I-ANAT": 6     # Inside Body Structure
}
id2label = {v: k for k, v in label2id.items()}

import re

def load_ner_data(file_path):
    sentences, labels = [], []

    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue

            words = []
            label_seq = []
            i = 0

            while i < len(line):
                if line[i] in "([{" and i + 1 < len(line):
                    entity_type = (
                        "B-DIS" if line[i] == "(" else
                        "B-DRUG" if line[i] == "[" else
                        "B-ANAT" if line[i] == "{" else "O"
                    )
                    i += 1
                    entity_words = []
                    
                    while i < len(line) and line[i] not in ")]}":
                        entity_words.append(line[i])
                        i += 1

                    entity_text = "".join(entity_words).strip()
                    entity_tokens = entity_text.split()

                    if entity_tokens:
                        words.append(entity_tokens[0])
                        label_seq.append(label2id[entity_type])
                        
                        for subword in entity_tokens[1:]:
                            words.append(subword)
                            i_tag = entity_type.replace("B-", "I-")
                            label_seq.append(label2id[i_tag])
                
                else:
                    word_match = re.match(r"\S+", line[i:])
                    if word_match:
                        word = word_match.group(0)
                        words.append(word)
                        label_seq.append(label2id["O"])  # "O" label
                        i += len(word) - 1

                i += 1

            if words:
                sentences.append(words)
                labels.append(label_seq)

    return sentences, labels

In [4]:
# Load files
all_sentences, all_labels = [], []
for file in files:
    sentences, labels = load_ner_data(os.path.join(data_folder, file))
    all_sentences.extend(sentences)
    all_labels.extend(labels)

In [5]:
# Sanity check
for idx in range(5):
    print(f"Sentence {idx}: {all_sentences[idx]}")
    print(f"Labels {idx}: {all_labels[idx]}")
    print("-" * 50)


Sentence 0: ['Memantin', 'Ebixa', 'ger', 'sällan', 'några', 'biverkningar.']
Labels 0: [0, 1, 0, 0, 0, 0]
--------------------------------------------------
Sentence 1: ['Det', 'är', 'också', 'lättare', 'att', 'dosera', 'flytande', 'medicin', 'än', 'att', 'dela', 'på', 'tabletter.']
Labels 1: [0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0]
--------------------------------------------------
Sentence 2: ['Förstoppning', 'är', 'ett', 'vanligt', 'problem', 'hos', 'äldre.']
Labels 2: [1, 0, 0, 0, 0, 0, 0]
--------------------------------------------------
Sentence 3: ['Medicinen', 'kan', 'också', 'göra', 'att', 'man', 'blöder', 'lättare', 'eftersom', 'den', 'påverkar', 'blodets', 'förmåga', 'att', 'levra', 'sig.']
Labels 3: [3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0]
--------------------------------------------------
Sentence 4: ['Barn', 'har', 'större', 'möjligheter', 'att', 'samarbeta', 'om', 'de', 'i', 'förväg', 'får', 'veta', 'vad', 'som', 'ska', 'hända.']
Labels 4: [0, 0, 0, 0, 0, 0, 0, 0

In [6]:
# Convert to Hugging Face dataset
ner_dataset = Dataset.from_dict({
    "tokens": all_sentences,
    "labels": all_labels
})

The dataset was then converted into **Hugging Face’s Dataset format**, making it ready for fine-tuning.

---

## **Step 3. Model Selection: Why Kb-BERT?**
We compared available LLMs for **Swedish NER** and considered two approaches:
1. **Fine-tune a Swedish LLM** (preferred)
2. **Translate data to English and use BioBERT** (not chosen due to translation errors)

| Model | Language | Pretraining Domain | Pros | Cons |
|--------|----------|--------------------|------|------|
| **Kb-BERT** *(KB/bert-base-swedish-cased)* | Swedish | Wikipedia, news, legal texts | - Trained on **Swedish**<br>- Preserves **Swedish syntax** | - No **biomedical training** |
| **BioBERT** *(biobert-base-cased)* | English | Biomedical texts | - Trained on **medical** data | - Requires **translation** (risk of errors) |
| **mBERT** *(bert-base-multilingual-cased)* | 100+ languages | General multilingual text | - Supports **Swedish** | - Weaker **NER performance** than monolingual models |

### **Why Kb-BERT?**
 **Trained on Swedish** → No translation required  
 **Monolingual model** → Outperforms mBERT for Swedish  
 **Good for fine-tuning** → Can be adapted for medical text  

Using **BioBERT** would introduce translation **errors** and **lose domain-specific Swedish terminology**. Therefore, **Kb-BERT was chosen as the best model for fine-tuning**.

---

Next, we will proceed with **fine-tuning Kb-BERT on the dataset** to improve its biomedical NER capabilities.

---

### Step 4: Fine-Tuning Kb-BERT for Swedish Biomedical NER

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("KB/bert-base-swedish-cased")

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],          
        truncation=True,
        padding="max_length",
        max_length=128,
        is_split_into_words=True
    )

    all_labels = []
    for batch_index, label_seq in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=batch_index)
        
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label_seq[word_idx])
            else:
                label_ids.append(label_seq[previous_word_idx])
            previous_word_idx = word_idx

        all_labels.append(label_ids)

    tokenized_inputs["labels"] = all_labels
    return tokenized_inputs

In [9]:
# Apply processing
tokenized_dataset = ner_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    batch_size=16
)

Map:   0%|          | 0/795399 [00:00<?, ? examples/s]

To ensure that our model correctly recognizes named entities, we:
1. **Tokenized sentences** using the Kb-BERT tokenizer.
2. **Aligned entity labels** to tokenized words.
3. **Handled subword tokenization** using the BIO-tagging scheme.
4. **Converted the dataset into a format compatible with Hugging Face Transformers**.

Now we are ready for data training


## Step 5: Model Training and Evaluation

### **5.1 Training Configuration**
- **Model:** `KB/bert-base-swedish-cased`
- **Batch Size:** 8
- **Learning Rate:** `5e-5`
- **Epochs:** 3
- **Optimizer:** AdamW
- **Evaluation Strategy:** Per epoch

In [10]:
# Split the data
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
val_dataset   = split_dataset["test"]

In [11]:
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification
import evaluate

model = AutoModelForTokenClassification.from_pretrained(
    "KB/bert-base-swedish-cased",
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

seqeval = evaluate.load("seqeval")

Some weights of BertForTokenClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    true_labels = [
        [id2label[l] for l in label if l != -100]
        for label in labels
    ]
    pred_labels = [
        [id2label[p] for (p, l) in zip(pred, label) if l != -100]
        for pred, label in zip(predictions, labels)
    ]
    results = seqeval.compute(predictions=pred_labels, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall":    results["overall_recall"],
        "f1":        results["overall_f1"],
        "accuracy":  results["overall_accuracy"],
    }

In [13]:
from transformers import TrainingArguments, Trainer

# Training arguments and Trainer
training_args = TrainingArguments(
    output_dir="./kb-bert-ner",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [14]:
# Train
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0892,0.054914,0.965154,0.953774,0.95943,0.986021
2,0.0424,0.040552,0.974397,0.969932,0.972159,0.990229
3,0.0214,0.036804,0.980213,0.975167,0.977684,0.992105


TrainOutput(global_step=238620, training_loss=0.050989647893191846, metrics={'train_runtime': 37889.0791, 'train_samples_per_second': 50.383, 'train_steps_per_second': 6.298, 'total_flos': 1.247067052857838e+17, 'train_loss': 0.050989647893191846, 'epoch': 3.0})

In [15]:
# Evaluate
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.036803822964429855, 'eval_precision': 0.9802129034275054, 'eval_recall': 0.9751673051303451, 'eval_f1': 0.9776835945315973, 'eval_accuracy': 0.9921051601198996, 'eval_runtime': 691.4805, 'eval_samples_per_second': 230.057, 'eval_steps_per_second': 28.757, 'epoch': 3.0}


In [16]:
# Save final model
trainer.save_model("./kb-bert-ner-final")

### **5.2 Training Performance**
Training was executed over **3 epochs**, and the model's **training loss** improved with each epoch:

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 Score | Accuracy |
|-------|--------------|----------------|------------|--------|----------|----------|
| **1** | 0.0892       | 0.0549         | 0.9652     | 0.9538 | 0.9594   | 0.9860   |
| **2** | 0.0424       | 0.0406         | 0.9744     | 0.9699 | 0.9722   | 0.9902   |
| **3** | 0.0214       | 0.0368         | 0.9802     | 0.9752 | 0.9777   | 0.9921   |

#### **Observations**
 **Training loss consistently decreased**, indicating that the model was effectively learning.  
 **Validation loss also decreased**, confirming improved generalization.  
 **F1 Score improved from 0.9594 to 0.9777**, showing stronger entity recognition.  
 **Final accuracy reached 99.21%**, demonstrating high performance.  


##  Step 6: Entity Linking to Medical Ontologies
After extracting named entities from Swedish medical texts, the next step is to **link these entities to standardized medical ontologies** such as **ICD-10, ICF, KSI, and others**. This ensures that recognized terms are mapped to official medical classifications, making them useful for clinical applications.


## **6.1 Loading Medical Ontologies**
We utilized classification files from **Socialstyrelsen** (Swedish National Board of Health and Welfare), including:
- **ICD-10-SE** (International Classification of Diseases)
- **ICF** (International Classification of Functioning, Disability, and Health)
- **KSI** (Swedish Classification of Interventions)
- **KVA-Kirurgiska** (Surgical procedures)
- **KVA-Medicinska** (Medical procedures)
- **U-Koder** (Urgent codes)

Each ontology file was **parsed and cleaned**, keeping only the **code (Kod) and description (Titel)** for entity matching.


In [79]:
import pandas as pd

ontology_files = {
    "icd-10-se.tsv": "ICD-10",
    "icf.tsv": "ICF",
    "ksi.tsv": "KSI",
    "kva-kirurgiska-atgarder-kka.tsv": "KVA-Kirurgiska",
    "kva-medicinska-atgarder-kma.tsv": "KVA-Medicinska",
    "u-koder-kort-varsel.tsv": "U-Koder"
}

data_path = "data/classification"

ontology_data = {}

for file, name in ontology_files.items():
    file_path = os.path.join(data_path, file)
    try:
        df = pd.read_csv(file_path, sep="\t", encoding="utf-8", dtype=str)
        ontology_data[name] = df
    except Exception as e:
        print(f"Error loading {file}: {e}")

for name, df in ontology_data.items():
    if "Kod" in df.columns and "Titel" in df.columns:
        ontology_data[name] = df[["Kod", "Titel"]].dropna()

In [83]:
from transformers import pipeline

# Load tokenizer and model
model_path = "./kb-bert-ner-final"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path)

# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### **6.2 Named Entity Recognition (NER) on Sample Texts**

In [84]:
# Sample Swedish medical text
sample_texts = [
    "Patienten har diabetes och högt blodtryck.",
    "Han har en historia av hjärtinfarkt.",
    "Astma har förvärrats och cancer har upptäckts."
]

# Run NER model
extracted_entities = []
for text in sample_texts:
    ner_results = ner_pipeline(text)
    for entity in ner_results:
        extracted_entities.append(entity['word'])

print("Extracted Entities:", extracted_entities)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Extracted Entities: ['hjärtinfarkt', 'Astma']


The model successfully extracted "hjärtinfarkt" and "Astma", which are relevant medical terms.

### 6.3 Entity Matching to Ontologies

In [85]:
def link_entities_to_ontology(entities, ontologies):
    matched_results = []
    
    for entity in entities:
        best_match = None
        best_code = None
        best_source = None
        
        for name, df in ontologies.items():
            matches = df[df["Titel"].str.contains(entity, case=False, na=False, regex=True)]
            
            if not matches.empty:
                best_match = matches.iloc[0]["Titel"]
                best_code = matches.iloc[0]["Kod"]
                best_source = name
                break  # Stop at the first found match
        
        matched_results.append({
            "Extracted Entity": entity,
            "Matched Concept": best_match if best_match else "No match found",
            "Ontology Code": best_code if best_code else "N/A",
            "Ontology Source": best_source if best_source else "N/A"
        })
    
    return pd.DataFrame(matched_results)

  Extracted Entity    Matched Concept Ontology Code Ontology Source
0     hjärtinfarkt  Akut hjärtinfarkt           I21          ICD-10
1            Astma              Astma           J45          ICD-10


In [87]:
# Perform entity linking
linked_entities_df = link_entities_to_ontology(extracted_entities, ontology_data)

# Display results
linked_entities_df

Unnamed: 0,Extracted Entity,Matched Concept,Ontology Code,Ontology Source
0,hjärtinfarkt,Akut hjärtinfarkt,I21,ICD-10
1,Astma,Astma,J45,ICD-10


We implemented ontology lookup using regular expression search to find the best match for each extracted entity.
Both extracted entities were successfully matched to the ICD-10 medical ontology.

### Step 7: Report the accuracy or other metrics and explainability (attention weights/SHAP/Other as relevant). 

After performing **Named Entity Recognition (NER) and linking extracted entities to medical ontologies**, we evaluate the accuracy of entity linking. This includes:
- **Calculating precision, recall, F1-score, and accuracy**.
- **Identifying mismatches and errors** in ontology linking.
- **Exploring possible causes of errors and improvements**.

In [91]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

expanded_test_entities = [
    "hjärtinfarkt", "Astma", "diabetes", "stroke", "cancer", "högt blodtryck",
    "lunginflammation", "anemi", "migren", "benskörhet", "reumatoid artrit",
    "förmaksflimmer", "njursvikt", "multipel skleros", "sköldkörtelrubbning",
    "psoriasis", "parkinson", "demens", "leukemi", "akut bronkit"
]

expanded_ground_truth = {
    "hjärtinfarkt": "I21", "Astma": "J45", "diabetes": "E10", "stroke": "I63",
    "cancer": "C80", "högt blodtryck": "I10", "lunginflammation": "J18", "anemi": "D50",
    "migren": "G43", "benskörhet": "M81", "reumatoid artrit": "M06", "förmaksflimmer": "I48",
    "njursvikt": "N18", "multipel skleros": "G35", "sköldkörtelrubbning": "E03", "psoriasis": "L40",
    "parkinson": "G20", "demens": "F03", "leukemi": "C91", "akut bronkit": "J20"
}

expanded_linked_entities_df = link_entities_to_ontology(expanded_test_entities, ontology_data)
y_true_expanded, y_pred_expanded = [], []
for entity, true_code in expanded_ground_truth.items():
    if entity in expanded_linked_entities_df["Extracted Entity"].values:
        pred_code = expanded_linked_entities_df[expanded_linked_entities_df["Extracted Entity"] == entity]["Ontology Code"].values[0]
        y_true_expanded.append(true_code)
        y_pred_expanded.append(pred_code)

expanded_accuracy = accuracy_score(y_true_expanded, y_pred_expanded)
expanded_precision, expanded_recall, expanded_f1, _ = precision_recall_fscore_support(
    y_true_expanded, y_pred_expanded, average="micro"
)

expanded_accuracy_results = {
    "Accuracy": expanded_accuracy,
    "Precision": expanded_precision,
    "Recall": expanded_recall,
    "F1-score": expanded_f1
}

# Display results
df_results = pd.DataFrame([expanded_accuracy_results])
print(df_results)

   Accuracy  Precision  Recall  F1-score
0       0.4        0.4     0.4       0.4


- **40% accuracy** indicates **room for improvement** in entity linking.
- The model is **successful with direct matches** but **fails in ambiguous cases**.

### 7.2 Error Analysis: Incorrect and Missing Mappings

In [93]:
# Identify Errors
mismatches = []

for entity, true_code, pred_code in zip(expanded_ground_truth.keys(), y_true_expanded, y_pred_expanded):
    if true_code != pred_code:
        mismatches.append({
            "Entity": entity,
            "True Code": true_code,
            "Predicted Code": pred_code
        })

# Display or save mismatches
if mismatches:
    mismatch_df = pd.DataFrame(mismatches)
    print("Entity Linking Mismatches:")
    print(mismatch_df)
    mismatch_df.to_csv("entity_linking_mismatches.csv", index=False)
else:
    print("No mismatches found – all predictions are correct!")

Entity Linking Mismatches:
                 Entity True Code Predicted Code
0                stroke       I63         P91.0D
1                cancer       C80          C22.1
2      lunginflammation       J18        J09-J18
3                 anemi       D50          D46.0
4                migren       G43            N/A
5            benskörhet       M81            N/A
6      reumatoid artrit       M06            M05
7             njursvikt       N18          I12.0
8   sköldkörtelrubbning       E03            N/A
9             parkinson       G20          F02.3
10               demens       F03            F00
11              leukemi       C91         C83.0A


We identified **11 mismatches** between **predicted ontology codes** and **ground truth labels**.

### 7.3 Potential Causes of Errors
1. **Ambiguous ICD-10 codes**
   - Some conditions have **multiple valid codes** (e.g., `"stroke"` → `I63` vs. `P91.0D`).
   - The model may **select a related but incorrect code**.

2. **Synonym variations**
   - `"parkinson"` was mapped to `"F02.3"` (dementia due to Parkinson’s) instead of `"G20"` (Parkinson’s disease).
   - `"demens"` was mapped to `"F00"` (Alzheimer’s-related dementia) instead of `"F03"` (unspecified dementia).

3. **Issues with Partial Matches**
   - `"lunginflammation"` (pneumonia) was mapped to `"J09-J18"` (entire range of influenza and pneumonia).
   - `"anemi"` was incorrectly mapped to `"D46.0"` (refractory anemia) instead of `"D50"` (iron deficiency anemia).

4. **Missing Terms in Ontology**
   - `"migren"`, `"benskörhet"`, and `"sköldkörtelrubbning"` had **no match found**.
   - This suggests that **some medical terms were missing from the ontology dataset**.

### 7.4 Explainability: Improving Entity Linking
To improve **ontology matching**, future work should include:
- **Fuzzy matching** (handling minor spelling variations).
- **Confidence thresholds** (assigning certainty scores to matches).
- **ML-based classification** (training a model for better ICD-10 prediction).

## Conclusion
 **Named Entity Recognition (NER) was successfully performed using Kb-BERT.**  
 **Entities were linked to medical ontologies, achieving partial success.**  
 **The accuracy of entity linking (40%) suggests improvements are needed.**  
️ **Ambiguous and missing mappings require better handling strategies.**  