# 03 - Model Training

In this notebook, we fine-tune the multilingual `xlm-roberta-base` model for token-level classification (NER-like task) on our preprocessed PII dataset.

In [11]:
%pip install evaluate

import json
import numpy as np
import torch
import evaluate
from pathlib import Path
from datasets import load_from_disk
from transformers import (AutoTokenizer, AutoModelForTokenClassification,
                          TrainingArguments, Trainer, DataCollatorForTokenClassification)


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 1. Load tokenized datasets and labels

 In this step, we load the dataset that was tokenized and aligned in Notebook 2 - `02_preprocessing.ipynb`.
 We also reload the label mappings (`label2id`, `id2label`) that we saved as JSON.
 This ensures our model knows how to map between numeric IDs and string labels.


In [4]:
# Load tokenized dataset from disk
ds = load_from_disk("data/hf_tokenized")

# Load label metadata
meta = json.loads(Path("data/labels.json").read_text())
label2id = meta["label2id"]
id2label = {int(v): k for k, v in label2id.items()}

print("Dataset splits:", ds)
print("Number of labels:", len(label2id))
print("Sample label mapping:", list(label2id.items())[:10])

Dataset splits: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 331106
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 82931
    })
})
Number of labels: 41
Sample label mapping: [('B-AGE', 0), ('B-BUILDINGNUM', 1), ('B-CITY', 2), ('B-CREDITCARDNUMBER', 3), ('B-DATE', 4), ('B-DRIVERLICENSENUM', 5), ('B-EMAIL', 6), ('B-GENDER', 7), ('B-GIVENNAME', 8), ('B-IDCARDNUM', 9)]


# 2. Load tokenizer and initialize model

We load the same tokenizer (`xlm-roberta-base`) that we used during preprocessing. Then we initialize a fresh token classification model, specifying the number of labels and the label ↔ id mappings.

In [5]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

MODEL = "xlm-roberta-base"

# Reload tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)

# Load metadata (labels and mappings)
with open("data/labels.json", "r") as f:
    meta = json.load(f)

label2id = meta["label2id"]
id2label = {int(v): k for k, v in label2id.items()}

# Initialize model for token classification
print("Initializing model...")
model = AutoModelForTokenClassification.from_pretrained(
    MODEL,
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

print("Model and tokenizer loaded.")


Loading tokenizer...




Initializing model...


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model and tokenizer loaded.


## 3. Load preprocessed dataset and prepare data collator

Load dataset that we saved in preprocessing step (tokenized + aligned labels)

In [8]:
from transformers import DataCollatorForTokenClassification

ds = load_from_disk("data/hf_tokenized")

# Load meta information about labels (id2label and label2id)
meta = json.loads(Path("data/labels.json").read_text())
label2id = meta["label2id"]
id2label = {int(v): k for k, v in label2id.items()}

# Load tokenizer again (same as model)
tok = AutoTokenizer.from_pretrained("xlm-roberta-base", use_fast=True)

# Data collator handles dynamic padding and batches
collator = DataCollatorForTokenClassification(tok)

print("Dataset:", ds)
print("Number of labels:", len(label2id))

Dataset: DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 331106
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 82931
    })
})
Number of labels: 41


This dataset contains two splits: `train` (331,106 samples) and `validation` (82,931 samples). Each sample includes `input_ids`, `attention_mask`, and `labels`. There are 41 unique label classes for token classification.

# 4. Load model for token classification

We load the XLM-R model with a classification head for token-level tasks. The number of labels is determined from the label mapping we loaded in Cell 3.

In [9]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    MODEL,
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id
)

print("Model loaded with", len(label2id), "labels.")

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded with 41 labels.


## 5. Setup data collator and evaluation metric

The data collator takes care of batching inputs together and properly padding them. For token classification, Hugging Face provides a dedicated collator.

We also load the `seqeval` metric, which is a standard for NER-like tasks.

It computes precision, recall, and F1 based on entity spans.

In [13]:
collator = DataCollatorForTokenClassification(tok)

%pip install seqeval

metric = evaluate.load("seqeval")

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py) ... [?25l  Preparing metadata (setup.py) ... [?25l-done
done
Building wheels for collected packages: seqeval
[33m  DEPRECATION: Building 'seqeval' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'seqeval'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m  Building wheel for seqeval (setup.py) ... [?25lBuilding wheels for collected packages: seqeval
[33m  DEPRECATION: Building 'seqeval' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A

## 6. Define `compute_metrics` function

Define compute_metrics function to evaluate model performance using seqeval (standard for sequence labeling / NER tasks)

In [14]:
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=-1)  # take highest scoring label per token
    labels = p.label_ids

    true_predictions, true_labels = [], []
    for pred_seq, label_seq in zip(preds, labels):
        pred_labels, gold_labels = [], []
        for p_i, l_i in zip(pred_seq, label_seq):
            if l_i == -100:  # ignore padding tokens
                continue
            pred_labels.append(id2label[p_i])
            gold_labels.append(id2label[l_i])
        true_predictions.append(pred_labels)
        true_labels.append(gold_labels)

    # Compute precision, recall, F1
    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"]
    }


## 7. Training arguments
Here we define hyperparameters and training setup for fine-tuning XLM-R


In [18]:
SEED = 42

args = TrainingArguments(
    output_dir="runs/xlmr-baseline",           # where checkpoints/logs are saved
    learning_rate=3e-5,                        # optimizer learning rate
    per_device_train_batch_size=16,            # batch size for training
    per_device_eval_batch_size=16,             # batch size for evaluation
    num_train_epochs=3,                        # number of epochs to train
    evaluation_strategy="epoch",               # when to run eval
    save_strategy="epoch",                     # when to save checkpoints
    logging_steps=50,                          # how often to log
    seed=SEED,                                 # random seed for reproducibility
    report_to="none"                           # disables reporting to external services (e.g., wandb)
)


## 8. Define Trainer

The Hugging Face `Trainer` class brings everything together:
- The **model** (`xlm-roberta-base` fine-tuning head for token classification).
- The **training arguments** (batch size, learning rate, logging).
- The **datasets** (`train` and `validation`).
- The **data collator** (ensures batches are padded correctly).
- The **metrics function** (computes precision, recall, F1 using seqeval).

This allows us to start training with a single `.train()` call.


In [None]:
# Build Trainer
# Build Hugging Face Trainer for token classification
trainer = Trainer(
    model=model,                      # The XLM-R model with token classification head
    args=args,                        # Training arguments (batch size, epochs, etc.)
    train_dataset=ds["train"],        # Training split of the dataset
    eval_dataset=ds["validation"],    # Validation split for evaluation
    tokenizer=tok,                    # Tokenizer for preprocessing
    data_collator=collator,           # Data collator for dynamic padding
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics (precision, recall, F1)
)


## 9. Inspect and Run Training

We first initialize the `Trainer` object.  
It encapsulates our entire training pipeline.  
Before launching training, we can print or inspect the object to confirm everything looks correct.  
Then, calling `trainer.train()` will start the fine-tuning process.


In [20]:
# Inspect Trainer setup
print(trainer)

# Start training
trainer.train()


<transformers.trainer.Trainer object at 0x1524b9c90>


  0%|          | 0/62085 [00:00<?, ?it/s]

  incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask


RuntimeError: MPS backend out of memory (MPS allocated: 4.41 GB, other allocations: 2.18 GB, max allowed: 6.80 GB). Tried to allocate 732.43 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

## 10. Evaluate and Save the Model

After training finishes, we:
1. Run evaluation on the validation set to get metrics (precision, recall, F1).
2. Save the fine-tuned model for later use.


In [None]:
# Evaluate on validation set
results = trainer.evaluate()
print("Evaluation results:", results)

# Save fine-tuned model and tokenizer
trainer.save_model("model_xlmr_openpii")
tok.save_pretrained("model_xlmr_openpii")

print("Model and tokenizer saved to: model_xlmr_openpii")
