# Assignment 1: Named Entity Recognition and Entity Linking Group Project

Authors:  
Alberto de los Ríos

## Imports

In [1]:
import pandas as pd
import os
import torch
import numpy as np



In [None]:
# It is recommended to start with general import statements
#from utility_functions import *

## Load data

This section should load the raw dataset for the task.  
Remember to use relative paths to load any files in the notebook.

In [2]:
url_1177 = "hf://datasets/community-datasets/swedish_medical_ner/1177/train-00000-of-00001.parquet"
url_lt = "hf://datasets/community-datasets/swedish_medical_ner/lt/train-00000-of-00001.parquet"
url_wiki = "hf://datasets/community-datasets/swedish_medical_ner/wiki/train-00000-of-00001.parquet"

df_1177 = pd.read_parquet(url_1177)
#df_lt = pd.read_parquet(url_lt)
df_wiki = pd.read_parquet(url_wiki)

os.makedirs("raw_data", exist_ok=True)
df_1177.to_parquet("raw_data/1177_train.parquet", engine="pyarrow")
#df_lt.to_parquet("raw_data/lt_train.parquet", engine="pyarrow")
#df_wiki.to_parquet("raw_data/wiki_train.parquet", engine="pyarrow")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Task 1: LLMs for NER Survey

This survey will be evalutated based on the following metrics and dimensions:

- Language Capability (Swedish proficiency)
- Biomedical Knowledge (Domain relevance)
- Computational Efficiency (Training/inference costs)
- Performance Metrics (NER-specific scores)
- Ontology Compatibility (Linking to ICD/ICF/LOINC)

PONER UN POCO DE HISTORIA DE BERT creado por los researches the Google

For this first task we will only use the 1177 Vårdguiden dataset since it is the lightest with only 927 sentences.

# LLM in Swedish

Advantages:
1. Authenticity - Maintains original clinical nuance and terminology
2. No translation errors - Avoids introducing errors from machine translation
3. Proper noun handling - Swedish patient/place names and untranslatable terms remain intact
4. Future applicability - Model will work natively with Swedish EHR systems
5. Matching Swedish ontologies - Direct alignment with Swedish medical coding systems

Limitations:
1. Limited models - Fewer Swedish biomedical LLMs available:
- KB/bert-base-swedish-cased: [Swedish BERT models for NER](https://huggingface.co/KB/bert-base-swedish-cased)
- [RoBERTa large](https://huggingface.co/AI-Sweden-Models/roberta-large-1160k)
- Swedish GPT models with limited NER capacity: [AI Sweden Model Hub](https://huggingface.co/AI-Sweden-Models)
2. Smaller datasets - The Swedish medical NER dataset has only ~6,000 annotated entries
3. Debugging difficulty - Hard to verify annotations/errors without Swedish knowledge
4. Resource scarcity - Few Swedish stopword lists, tokenizers, etc.


# LLM in English

- [Swedish BERT models for NER](https://huggingface.co/KB/bert-base-swedish-cased)
- [RoBERTa large](https://huggingface.co/AI-Sweden-Models/roberta-large-1160k)
- [AI Sweden Model Hub](https://huggingface.co/AI-Sweden-Models)


Advantages:
1. Model availability - Access powerful English biomedical models:
- [BioBERT](https://github.com/naver/biobert-pretrained?tab=readme-ov-file)
- [ClinicalBERT](https://huggingface.co/medicalai/ClinicalBERT)
- [now BiomedBERT, previously known as PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext)
2. Larger datasets - Can augment with English medical NER datasets (~20+ available)
3. Easier debugging - You can understand the text for error analysis
4. More tutorials - Abundant English NLP examples
5. Ontology linking - English ontologies (ICD-10 English) have more community support

Limitations:
1. Translation errors - Clinical terms often mistranslated:
2. Back-translation complexity - Need to map English predictions back to Swedish text
3. Loss of context - Swedish compound words get split unnaturally
4. Ontology mismatch - Swedish medical codes don't align perfectly with English
Added pipeline complexity - Requires translation component

- [BioBERT vs PubMedBERT](https://medium.com/@EleventhHourEnthusiast/model-comparison-biobert-vs-pubmedbert-8c2d78178d10)
- [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://dl.acm.org/doi/10.1145/3458754)


# Multilingual LLM

- [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased), 2018
- [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base), 2019


Advantages:
1. Cross-Lingual Knowledge Transfer: Leverages patterns from high-resource languages (e.g., German medical terms help Swedish). Fine-tune on Swedish data but benefit from pretraining on multilingual medical corpora.
2. Handling Code-Switching: No need for manual language detection.
3. Robust Tokenization: SentencePiece (used in XLM-R) handles Swedish compounds better than WordPiece (mBERT):
4. Future-Proofing: One model can support other languages (e.g., adding Norwegian EHRs later).

Limitations:
1. Translation errors - Clinical terms often mistranslated:
2. Back-translation complexity - Need to map English predictions back to Swedish text
3. Loss of context - Swedish compound words get split unnaturally
4. Ontology mismatch - Swedish medical codes don't align perfectly with English
Added pipeline complexity - Requires translation component

- [BioBERT vs PubMedBERT](https://medium.com/@EleventhHourEnthusiast/model-comparison-biobert-vs-pubmedbert-8c2d78178d10)
- [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://dl.acm.org/doi/10.1145/3458754)


In [3]:
print("GPU available:", torch.cuda.is_available())

def analyse_df(df):

  df.head(10)
  df.columns
  print(f"Total number of sentences: {len(df)}")

  # Count sentences with at least one entity
  has_entity = df["entities"].apply(lambda ent: len(ent["start"]) > 0)
  print(f"Number of sentences with at least one entity: {has_entity.sum()}")

  from collections import Counter
  # Initialize counter
  type_counter = Counter()

  # Loop through entities and count each type
  for entity in df["entities"]:
      if "type" in entity:
          types = entity["type"]
          if isinstance(types, np.ndarray) and types.size > 0:
              type_counter.update(types.tolist())

  # Print results
  print("Entity counts from dataset:")
  print(f"Disorder/Finding (type 0): {type_counter[0]}")
  print(f"Pharmaceutical Drug (type 1): {type_counter[1]}")
  print(f"Body Structure (type 2): {type_counter[2]}")

#analyse_df(df_1177)
#analyse_df(df_lt)
#analyse_df(df_wiki)



GPU available: True


This results do not match the specifications of the DataSet Summary where it states that we have 2740 annotations, out of which:

- 1574 are disorder and findings (type 0)
- 546 are pharmaceutical drug (type 1)
- 620 are body structure. (type 2)

---



In [50]:
from transformers import AutoTokenizer
import pandas as pd

# Load the tokenizer (multilingual, supports Swedish)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

def process_dataset(df, num_examples=10, max_length=128):
    for i in range(min(num_examples, len(df))):
        row = df.iloc[i]
        sentence = row["sentence"]
        entities = row["entities"]  # must have 'start', 'end', 'type'

        # Tokenize with offsets
        encoding = tokenizer(
            sentence,
            return_offsets_mapping=True,
            truncation=True,
            padding="max_length",
            max_length=max_length
        )

        labels = [-100] * len(encoding["input_ids"])

        # Align labels using offset mapping
        for start, end, label in zip(entities["start"], entities["end"], entities["type"]):
            for idx, (token_start, token_end) in enumerate(encoding["offset_mapping"]):
                if token_start >= start and token_end <= end:
                    labels[idx] = int(label)

        # Print tokens and labels
        #print(f"\nSentence {i+1}:")
        #print("Tokens and Labels:")
        tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])
        for token, label in zip(tokens, labels):
            print(f"{token:15} -> {label}")

        # Remove offset mapping after debugging
        #encoding.pop("offset_mapping")
        #encoding["labels"] = labels

process_dataset(df_1177, num_examples= 10)



Sentence 1:
Tokens and Labels:
<s>             -> -100
▁Mem            -> -100
ant             -> -100
in              -> -100
▁(              -> 0
▁Eb             -> 0
ixa             -> 0
▁)              -> 0
▁ger            -> -100
▁sällan         -> -100
▁några          -> -100
▁bi             -> -100
verk            -> -100
ningar          -> -100
.               -> -100
</s>            -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -100
<pad>           -> -

Dataset is already tokenized and shows the spans intances and types of labels.


 What you do need to do:
1. Tokenize the sentence using a pretrained tokenizer, like BertTokenizerFast, which supports alignment of character offsets to token indices.

2. Use tokenizer(..., return_offsets_mapping=True) to map character positions (like 'start' and 'end') to token indices.

3. Create a labels list per sentence (same length as the number of tokens) initialized to -100 (special value that tells the model to ignore those positions), and then assign label IDs (0, 1, 2) to the tokens that overlap with entity spans.

In [44]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels= 3)

from datasets import Dataset

dataset = Dataset.from_pandas(df_1177)



tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

NameError: name 'NUM_LABELS' is not defined

## Task 2: This is the title of task 2

This section should contain the solution of task 2.

## Results and Discussion

This section should contain:
- Results.
- Summary of best model performance:
    - Name of best model file as saved in /models.
    - Relevant scores such as: accuracy, precision, recall, F1-score, etc.
- Key discussion points.

In [None]:
# Always use comments in the code to document specific steps