# Assignment 1: Named Entity Recognition and Entity Linking Group Project

Authors:  
Alberto de los Ríos

## Imports

In [1]:
import pandas as pd
import os
import torch
import numpy as np



In [None]:
# It is recommended to start with general import statements
#from utility_functions import *

## Load data

This section should load the raw dataset for the task.  
Remember to use relative paths to load any files in the notebook.

In [2]:
url_1177 = "hf://datasets/community-datasets/swedish_medical_ner/1177/train-00000-of-00001.parquet"
url_lt = "hf://datasets/community-datasets/swedish_medical_ner/lt/train-00000-of-00001.parquet"
url_wiki = "hf://datasets/community-datasets/swedish_medical_ner/wiki/train-00000-of-00001.parquet"

df_1177 = pd.read_parquet(url_1177)
#df_lt = pd.read_parquet(url_lt)
df_wiki = pd.read_parquet(url_wiki)

os.makedirs("raw_data", exist_ok=True)
df_1177.to_parquet("raw_data/1177_train.parquet", engine="pyarrow")
#df_lt.to_parquet("raw_data/lt_train.parquet", engine="pyarrow")
#df_wiki.to_parquet("raw_data/wiki_train.parquet", engine="pyarrow")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Task 1: LLMs for NER Survey

This survey will be evalutated based on the following metrics and dimensions:

- Language Capability (Swedish proficiency)
- Biomedical Knowledge (Domain relevance)
- Computational Efficiency (Training/inference costs)
- Performance Metrics (NER-specific scores)
- Ontology Compatibility (Linking to ICD/ICF/LOINC)

PONER UN POCO DE HISTORIA DE BERT creado por los researches the Google

For this first task we will only use the 1177 Vårdguiden dataset since it is the lightest with only 927 sentences.

# LLM in Swedish

Advantages:
1. Authenticity - Maintains original clinical nuance and terminology
2. No translation errors - Avoids introducing errors from machine translation
3. Proper noun handling - Swedish patient/place names and untranslatable terms remain intact
4. Future applicability - Model will work natively with Swedish EHR systems
5. Matching Swedish ontologies - Direct alignment with Swedish medical coding systems

Limitations:
1. Limited models - Fewer Swedish biomedical LLMs available:
- KB/bert-base-swedish-cased: [Swedish BERT models for NER](https://huggingface.co/KB/bert-base-swedish-cased)
- [RoBERTa large](https://huggingface.co/AI-Sweden-Models/roberta-large-1160k)
- Swedish GPT models with limited NER capacity: [AI Sweden Model Hub](https://huggingface.co/AI-Sweden-Models)
2. Smaller datasets - The Swedish medical NER dataset has only ~6,000 annotated entries
3. Debugging difficulty - Hard to verify annotations/errors without Swedish knowledge
4. Resource scarcity - Few Swedish stopword lists, tokenizers, etc.


# LLM in English

- [Swedish BERT models for NER](https://huggingface.co/KB/bert-base-swedish-cased)
- [RoBERTa large](https://huggingface.co/AI-Sweden-Models/roberta-large-1160k)
- [AI Sweden Model Hub](https://huggingface.co/AI-Sweden-Models)


Advantages:
1. Model availability - Access powerful English biomedical models:
- [BioBERT](https://github.com/naver/biobert-pretrained?tab=readme-ov-file)
- [ClinicalBERT](https://huggingface.co/medicalai/ClinicalBERT)
- [now BiomedBERT, previously known as PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext)
2. Larger datasets - Can augment with English medical NER datasets (~20+ available)
3. Easier debugging - You can understand the text for error analysis
4. More tutorials - Abundant English NLP examples
5. Ontology linking - English ontologies (ICD-10 English) have more community support

Limitations:
1. Translation errors - Clinical terms often mistranslated:
2. Back-translation complexity - Need to map English predictions back to Swedish text
3. Loss of context - Swedish compound words get split unnaturally
4. Ontology mismatch - Swedish medical codes don't align perfectly with English
Added pipeline complexity - Requires translation component

- [BioBERT vs PubMedBERT](https://medium.com/@EleventhHourEnthusiast/model-comparison-biobert-vs-pubmedbert-8c2d78178d10)
- [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://dl.acm.org/doi/10.1145/3458754)


# Multilingual LLM

- [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased), 2018
- [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base), 2019


Advantages:
1. Cross-Lingual Knowledge Transfer: Leverages patterns from high-resource languages (e.g., German medical terms help Swedish). Fine-tune on Swedish data but benefit from pretraining on multilingual medical corpora.
2. Handling Code-Switching: No need for manual language detection.
3. Robust Tokenization: SentencePiece (used in XLM-R) handles Swedish compounds better than WordPiece (mBERT):
4. Future-Proofing: One model can support other languages (e.g., adding Norwegian EHRs later).

Limitations:
1. Translation errors - Clinical terms often mistranslated:
2. Back-translation complexity - Need to map English predictions back to Swedish text
3. Loss of context - Swedish compound words get split unnaturally
4. Ontology mismatch - Swedish medical codes don't align perfectly with English
Added pipeline complexity - Requires translation component

- [BioBERT vs PubMedBERT](https://medium.com/@EleventhHourEnthusiast/model-comparison-biobert-vs-pubmedbert-8c2d78178d10)
- [Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing](https://dl.acm.org/doi/10.1145/3458754)


In [32]:
#df_1177.head()

# Analyse raw data

In [3]:
print("GPU available:", torch.cuda.is_available())

def analyse_df(df):

  df.columns
  print(f"Total number of sentences: {len(df)}")

  # Count sentences with at least one entity
  has_entity = df["entities"].apply(lambda ent: len(ent["start"]) > 0)
  print(f"Number of sentences with at least one entity: {has_entity.sum()}")

  from collections import Counter
  # Initialize counter
  type_counter = Counter()

  # Loop through entities and count each type
  for entity in df["entities"]:
      if "type" in entity:
          types = entity["type"]
          if isinstance(types, np.ndarray) and types.size > 0:
              type_counter.update(types.tolist())

  # Print results
  print("Entity counts from dataset:")
  print(f"Disorder/Finding (type 0): {type_counter[0]}")
  print(f"Pharmaceutical Drug (type 1): {type_counter[1]}")
  print(f"Body Structure (type 2): {type_counter[2]}")

#analyse_df(df_1177)
#analyse_df(df_lt)
#analyse_df(df_wiki)



GPU available: True


This results do not match the specifications of the DataSet Summary where it states that we have 2740 annotations, out of which:

- 1574 are disorder and findings (type 0)
- 546 are pharmaceutical drug (type 1)
- 620 are body structure. (type 2)

---



# Max number of Tokens

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

max_tokens = 0

for sent in df_1177["sentence"]:
    tokens = tokenizer(sent)["input_ids"]
    max_tokens = max(max_tokens, len(tokens))

print("Maximum number of tokens:", max_tokens)

Maximum number of tokens: 72


# Encoding and aligment of tokens and labels

In [37]:
from transformers import AutoTokenizer

# Load the tokenizer (multilingual, supports Swedish)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# BIO label mapping
label2id = {
    "O": 0,
    "B-DIS": 1,
    "I-DIS": 2,
    "B-DRUG": 3,
    "I-DRUG": 4,
    "B-BODY": 5,
    "I-BODY": 6
}

# Map integer entity types to BIO label prefixes
type_to_bio_prefix = {
    0: "DIS",
    1: "DRUG",
    2: "BODY"
}

def convert_to_bio_labels(encoding, entity_spans): #Converts character-level entity spans to BIO-formatted token-level labels.

    labels = [label2id["O"]] * len(encoding["input_ids"])

    for (ent_start, ent_end, ent_type) in entity_spans:
        bio_label = type_to_bio_prefix[ent_type]
        first_token = True

        for idx, (tok_start, tok_end) in enumerate(encoding["offset_mapping"]):
            if tok_start == tok_end == 0:
                continue  # Skip special tokens

            if tok_end > ent_start and tok_start < ent_end:  # Token overlaps entity
                if first_token:
                    labels[idx] = label2id[f"B-{bio_label}"]
                    first_token = False
                else:
                    labels[idx] = label2id[f"I-{bio_label}"]

    return labels

def process_dataset(df, num_examples):
    processed = []

    for i in range(len(df)):
        row = df.iloc[i]
        sentence = row["sentence"]
        entities = row["entities"]  # expects keys: 'start', 'end', 'type'

        encoding = tokenizer(
            sentence,
            return_offsets_mapping=True,
            truncation=True,
            padding="max_length",
            max_length=80 #max_number of tokens for df_1177=72 80 to be safe. check the other df
        )

        # Prepare entity spans for BIO conversion
        entity_spans = list(zip(entities["start"], entities["end"], entities["type"]))
        bio_labels = convert_to_bio_labels(encoding, entity_spans)
        encoding["labels"] = bio_labels

        # We no longer need offset_mapping
        encoding.pop("offset_mapping")

        if i < num_examples:
          print(f"\nSentence {i+1}:")
          tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"])
          for token, label_id in zip(tokens, encoding["labels"]):
              label_name = list(label2id.keys())[list(label2id.values()).index(label_id)]
              print(f"{token:15} -> {label_name}")

        processed.append(encoding)

    return processed

    #print("\nEncoding keys example:", encoding.keys())


encodings_df_1177 = process_dataset(df_1177, 3)
torch.save(encodings_df_1177, "processed_data/encodings_df_1177.pt") #Save the processed data




Sentence 1:
<s>             -> O
▁Mem            -> O
ant             -> O
in              -> O
▁(              -> B-DIS
▁Eb             -> I-DIS
ixa             -> I-DIS
▁)              -> I-DIS
▁ger            -> O
▁sällan         -> O
▁några          -> O
▁bi             -> O
verk            -> O
ningar          -> O
.               -> O
</s>            -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>           -> O
<pad>

Dataset is already tokenized and shows the spans intances and types of labels.


 What you do need to do:
1. Tokenize the sentence using a pretrained tokenizer, like BertTokenizerFast, which supports alignment of character offsets to token indices.

2. Use tokenizer(..., return_offsets_mapping=True) to map character positions (like 'start' and 'end') to token indices.

3. Create a labels list per sentence (same length as the number of tokens) initialized to -100 (special value that tells the model to ignore those positions), and then assign label IDs (0, 1, 2) to the tokens that overlap with entity spans.


CUENTA TAMBIEN LOS CARACTERES ESPECIALES ({[ DE CADA CLASE ASIQ VALORAR SI NO CONTARLOS O SI. PERO SON INTICATIVOS DE LA TYPO DE LABEL.

doubts:
1. Why do all sentences need to be the same length (via padding)?
Transformers process input as batches of fixed-size sequences. That’s because:

Matrix computations (done on GPU) require tensors of uniform shape.

Efficient batching boosts performance and stability during training.

2. Explanation of common fields in encoding:
- input_ids: The token IDs for your sentence (numbers representing tokens).
- token_type_ids: Used for distinguishing sentence segments (mostly for tasks like question answering). Might be all zeros if not applicable.
- attention_mask: Indicates which tokens are real (1) and which are padding (0).
- offset_mapping: For each token, the start and end character positions in the original sentence (useful for aligning labels).

3. Explicación:
- entities["type"] contains flat integers like 0, 1, 2, etc.
- type_to_bio_prefix maps those integers to entity type names (DIS, DRUG, etc.)
- convert_to_bio_labels() uses this mapping to apply B- and I- prefixes to the correct tokens.
- Then it converts those prefixes using label2id.


PONER EXPLICACIÓN DE BIO LABELS Y DE QUE MUCHOS MODELOS SOLO ACEPTAN ESTE TIPO DE LABELS.


# CREATE CUSTOM PYTORCH DATASET

Why create a Dataset class if you already have encodings?
- Integration with PyTorch DataLoader:
The Dataset class provides a standardized way for PyTorch to access your data. The DataLoader uses the Dataset to efficiently load data in batches, shuffle it, and handle multi-threaded loading.

- Batching & Shuffling:
When training a model, you typically don’t want to feed data one example at a time. You want mini-batches (e.g., 16 or 32 samples per batch). The DataLoader uses the Dataset to create batches, and you don’t have to manually slice your encodings.

- Lazy Access / Memory Efficiency:
The Dataset class lets PyTorch load one sample at a time on demand, rather than keeping everything as tensors in memory at once (especially useful with large datasets).

- Transforms / Augmentations:
If you want to apply any transformations (like token masking, noise, data augmentation) on the fly, the Dataset class is the place to do it — at retrieval time.

- Uniform Interface:
The Dataset abstracts the data representation so your training loop just works with the Dataset/DataLoader API without worrying about how data is stored.

- [Custom Named Entity Recognition with BERT](https://towardsdatascience.com/custom-named-entity-recognition-with-bert-cf1fd4510804/)

In [38]:
def collate_encodings(processed):
    encodings = {"input_ids": [], "attention_mask": []}
    labels = []
    for item in processed:
        encodings["input_ids"].append(item["input_ids"])
        encodings["attention_mask"].append(item["attention_mask"])
        labels.append(item["labels"])
    return encodings, labels

encodings, labels = collate_encodings(encodings_df_1177)

In [39]:
import torch
from torch.utils.data import Dataset

class NERDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # For each key (input_ids, attention_mask), convert the sequence at idx to tensor
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # Also convert labels for that idx to tensor
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Split and create the Dataset

In [40]:

from sklearn.model_selection import train_test_split

# Separate inputs and labels
all_input_ids = [enc['input_ids'] for enc in encodings_df_1177]
all_attention_mask = [enc['attention_mask'] for enc in encodings_df_1177]
all_labels = [enc['labels'] for enc in encodings_df_1177]

# Pack inputs into one dictionary for convenience
all_encodings = {
    'input_ids': all_input_ids,
    'attention_mask': all_attention_mask
}

# Split

train_input_ids, val_input_ids, train_attention_mask, val_attention_mask, train_labels, val_labels = train_test_split(
    all_input_ids, all_attention_mask, all_labels, test_size=0.1, random_state=42)

# Re-create encoding dicts for train and val
train_encodings = {
    'input_ids': train_input_ids,
    'attention_mask': train_attention_mask
}

val_encodings = {
    'input_ids': val_input_ids,
    'attention_mask': val_attention_mask
}

# Now create datasets
train_dataset = NERDataset(train_encodings, train_labels)
val_dataset = NERDataset(val_encodings, val_labels)





## Task 2: This is the title of task 2

This section should contain the solution of task 2.

## Results and Discussion

This section should contain:
- Results.
- Summary of best model performance:
    - Name of best model file as saved in /models.
    - Relevant scores such as: accuracy, precision, recall, F1-score, etc.
- Key discussion points.

In [None]:
# Always use comments in the code to document specific steps