# FineTuning BERT model for NER task

Fine-tuning the BERT model. In this case, to make the code easier to read, we decided to use the wrapper functions provided by Hugging Face.

## Imports and environmnet setup

In [None]:
!pip install wandb -q
!pip install seqeval -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [None]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.53.1-py3-none-any.whl.metadata (40 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.53.1-py3-none-any.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m92.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.53.0
    Uninstalling transformers-4.53.0:
      Successfully uninstalled transformers-4.53.0
Successfully installed transformers-4.53.1


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from datasets import DatasetDict, Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from seqeval.metrics import precision_score, recall_score, f1_score

import ast


In [None]:
import wandb
from google.colab import userdata

api_key = userdata.get('WANDB_API_KEY')
wandb.login(key=api_key)

import os
os.environ["WANDB_PROJECT"] = "<BERT-SanRaffaele>"

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mceccadaniele00[0m ([33mSanRaffaele[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
from google .colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Constants

In [None]:
DATA_PATH='/content/drive/MyDrive/SanRaffaele/Data/Dataset NER/clean_NER_LLAMA70B.csv'

In [None]:
OUTPUT_MODEL_PATH='/content/drive/MyDrive/SanRaffaele/Model'

## Data
A wrapper that converts a pandas DataFrame into a Hugging Face-compatible Dataset object.

Supports quick mapping, batch tokenization, train/test splitting, saving, etc.

Manual alternative: you would need to manually handle data splits and batching using torch.utils.data.Dataset

In [None]:
df=pd.read_csv(DATA_PATH)

Cast the label feature from string to list.

In [None]:
df['label'] = df['label'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)


We verufy that the number of words is euql to the number of labels

In [None]:
counter = 0
for idx, row in df.iterrows():
    if len(row['label']) != len(row['frase'].split()):
        counter += 1
        print("Frase:", row['frase'])
        print("Label:", row['label'])
        print("Lunghezza frase (parole):", len(row['frase'].split()))
        print("Lunghezza label:", len(row['label']))
print(f"Discrepanza #{counter}")


Discrepanza #0


Split and convert to Hugging Face Dataset

In [None]:
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)
dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "eval": Dataset.from_pandas(test_df)
})

In [None]:
label_to_id = {'O': 0, 'B-TARGET': 1, 'I-TARGET': 2}

id_to_label = {v: k for k, v in label_to_id.items()}

def convert_labels(example):
    example['label'] = [label_to_id[label] for label in example['label']]
    return example

dataset = dataset.map(convert_labels)

Map:   0%|          | 0/12189 [00:00<?, ? examples/s]

Map:   0%|          | 0/1355 [00:00<?, ? examples/s]

In [None]:
print(dataset['train'].shape)
print(dataset['eval'].shape)

(12189, 3)
(1355, 3)


##Tokenizer and model

An intelligent wrapper that automatically downloads the correct tokenizer for a model (BERT, RoBERTa, DistilBERT, etc.).

It handles tokenization, padding, truncation, and mapping from text to IDs.

Manual alternative: you would need to manually manage subword tokenization, vocabularies, and padding.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

We use this function to preprocess each example in the dataset for token classification tasks like Named Entity Recognition (NER). It tokenizes the input sentence while preserving word boundaries and aligns the word-level labels with the tokenizer’s output. Because tokenizers can split words into multiple sub-tokens, the function assigns the label only to the first token of each word and marks the rest with -100 to ignore them during training. It also assigns -100 to special tokens such as [CLS] and [SEP] to ensure they are not considered during loss calculation. This ensures that the model correctly learns from the labeled tokens and ignores padding, sub-token fragments, and special tokens. The output is a dictionary with tokenized inputs and aligned labels, ready for model training.

In [None]:
def tokenize_and_align_labels(example):
    words = example["frase"].split()
    tokenized_inputs = tokenizer(
        words,
        truncation=True,
        is_split_into_words=True,
        padding="max_length",  # aggiungi padding fino a max_length (o un valore fisso)
        max_length=128,        # max_length coerente col il modello/dati
        return_tensors=None    # lascia come dict
    )

    labels = []
    word_ids = tokenized_inputs.word_ids()
    previous_word_idx = None
    for word_idx in word_ids:
        if word_idx is None:#SPECIAL CHARCATERS
            labels.append(-100)
        elif word_idx != previous_word_idx:
            labels.append(example["label"][word_idx])
        else:
            labels.append(-100)
        previous_word_idx = word_idx

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


In [None]:
# Prendiamo un esempio dal dataset
example = {
    "frase": dataset["train"][1]['frase'],
    "label": dataset["train"][1]['label']  # ipotetiche etichette per parola
}

tokenized_example = tokenize_and_align_labels(example)

print("Tokens:", tokenizer.convert_ids_to_tokens(tokenized_example["input_ids"]))
print("Labels:", tokenized_example["labels"])


Tokens: ['[CLS]', 'normal', '##e', 'press', '##ione', 'p', '##olm', '##ona', '##re', '.', 'ass', '##en', '##za', 'di', 'versa', '##mento', 'per', '##ica', '##rdi', '##co', '.', 'rest', '##anti', 're', '##pert', '##i', 'in', '##var', '##ia', '##ti', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]',

In [None]:
tokenized_train = dataset["train"].map(tokenize_and_align_labels, batched=False)
tokenized_val = dataset["eval"].map(tokenize_and_align_labels, batched=False)

Map:   0%|          | 0/12189 [00:00<?, ? examples/s]

Map:   0%|          | 0/1355 [00:00<?, ? examples/s]

In [None]:
print(tokenized_train[0].keys())  # deve contenere 'labels', non 'label'
print(tokenized_train[0]['labels'])  # deve essere una lista di int, non una lista di liste o altro


dict_keys(['frase', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'])
[-100, 1, -100, -100, 2, 2, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]


In [None]:
tokenized_train = tokenized_train.remove_columns("label")
tokenized_val = tokenized_val.remove_columns("label")

## Data collator
A wrapper function that dynamically builds batches (padding, labels, attention masks, etc.) during training.

Specific to the token classification task.

Manual alternative: manage this within a custom Dataset class.


In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)


## Model

Loads a pre-trained model (e.g., BERT) and automatically adapts it for the token classification task, such as Named Entity Recognition.

Adds a final linear head with num_labels neurons.

Manual alternative: you would need to build the model head, define the loss function, and implement backpropagation from scratch.

In [None]:
# Model
model = AutoModelForTokenClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",
                                                        num_labels=len(label_to_id))

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

##Training
The core of the Hugging Face training interface.
It encapsulates:

Training loop

Evaluation

Logging

Saving

Scheduler

Mixed precision (with fp16=True)

Callbacks

Manual alternative: write the optimizer, loss, loops, scheduler, etc., from scratch.



In [None]:
args = TrainingArguments(
    output_dir="./biobert-ner-custom",
    eval_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=20,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    report_to="wandb"
)

Metrics to evaluate the model output

In [None]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    true_labels = [[id_to_label[l] for l in label if l != -100] for label in labels]
    true_preds = [[id_to_label[p] for (p, l) in zip(pred, label) if l != -100] for pred, label in zip(preds, labels)]


    precision = precision_score(true_labels, true_preds)
    recall = recall_score(true_labels, true_preds)
    f1 = f1_score(true_labels, true_preds)

    return {"precision": precision, "recall": recall, "f1": f1}

In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [None]:
# Train
trainer.train()



model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,No log,0.346989,0.667355,0.646431,0.656727
2,0.359200,0.362873,0.661589,0.666444,0.664008
3,0.246600,0.352235,0.65893,0.681788,0.670164
4,0.190500,0.364114,0.654662,0.679119,0.666667
5,0.190500,0.37692,0.673286,0.701134,0.686928
6,0.148900,0.418445,0.66773,0.698466,0.682752
7,0.123300,0.490812,0.687666,0.691795,0.689724
8,0.103300,0.503196,0.668156,0.698466,0.682975
9,0.103300,0.524611,0.666877,0.705137,0.685473
10,0.085100,0.593118,0.671111,0.705137,0.687703


TrainOutput(global_step=7620, training_loss=0.11344699227590886, metrics={'train_runtime': 5137.8498, 'train_samples_per_second': 47.448, 'train_steps_per_second': 1.483, 'total_flos': 1.592487481379328e+16, 'train_loss': 0.11344699227590886, 'epoch': 20.0})

## Test

In [None]:
# 6. Evaluate
results = trainer.evaluate()
print(results)

{'eval_loss': 0.7280011177062988, 'eval_precision': 0.6883365200764818, 'eval_recall': 0.7204803202134756, 'eval_f1': 0.7040417209908735, 'eval_runtime': 9.659, 'eval_samples_per_second': 140.284, 'eval_steps_per_second': 4.452, 'epoch': 20.0}


## Save model

In [None]:
model.save_pretrained(os.path.join(OUTPUT_MODEL_PATH, "bioclinicalbert-ner-final"))

In [None]:
tokenizer.save_pretrained(os.path.join(OUTPUT_MODEL_PATH, "bioclinicalbert-ner-final"))

('/content/drive/MyDrive/SanRaffaele/Model/bioclinicalbert-ner-final/tokenizer_config.json',
 '/content/drive/MyDrive/SanRaffaele/Model/bioclinicalbert-ner-final/special_tokens_map.json',
 '/content/drive/MyDrive/SanRaffaele/Model/bioclinicalbert-ner-final/vocab.txt',
 '/content/drive/MyDrive/SanRaffaele/Model/bioclinicalbert-ner-final/added_tokens.json',
 '/content/drive/MyDrive/SanRaffaele/Model/bioclinicalbert-ner-final/tokenizer.json')