# Training a BERT Model for PII Detection

This notebook walks through the complete process of training a multilingual BERT model to automatically identify personally identifiable information (PII) in text. We'll use a large dataset of 400,000 examples to teach the model to recognize things like names, emails, phone numbers, addresses, and other sensitive information.

## Setting Up the Environment and Loading Data

First, we need to install the required libraries and load our dataset. We're installing the Hugging Face transformers library (which gives us access to BERT), along with tools for evaluation and data handling.

The dataset we're using comes from ai4privacy and contains 400,000 text examples where PII has already been labeled. This means someone has gone through and marked exactly where names, emails, and other sensitive info appear in each text. We'll use these labels to teach our model what PII looks like.

We're using "bert-base-multilingual-cased" as our starting point. This is a pre-trained model that already understands multiple languages, which we'll fine-tune specifically for detecting PII.

In [None]:
!pip install evaluate
!pip install -U transformers accelerate evaluate seqeval datasets

from datasets import load_dataset, DatasetDict
from transformers import (AutoTokenizer, AutoModelForTokenClassification,
                          DataCollatorForTokenClassification, TrainingArguments,
                          Trainer)
import numpy as np
import evaluate
import re
from collections import defaultdict

ds = load_dataset("ai4privacy/pii-masking-400k")  # has train and validation
train_ds = ds["train"]
val_ds   = ds["validation"]

model_name = "bert-base-multilingual-cased"
tokenizer  = AutoTokenizer.from_pretrained(model_name)




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Checking the Transformers Version

Quick check to see which version of the transformers library we have installed. Different versions might have slightly different features or options available.

In [None]:
import transformers
print(transformers.__version__)

4.57.1


## Exploring the Dataset Structure

Let's take a look at what columns our dataset has. This helps us understand what information is available and how the data is organized. We need to know the column names so we can access the pre-tokenized text and labels correctly.

In [None]:
print(train_ds.column_names)

['source_text', 'locale', 'language', 'split', 'privacy_mask', 'uid', 'masked_text', 'mbert_tokens', 'mbert_token_classes']


## Setting Up Column Names

Here we're just defining which columns in our dataset contain the tokens (small chunks of text) and their corresponding labels. This makes the code easier to read and modify later if the dataset structure changes.

In [None]:
TOK_COL = "mbert_tokens"
LAB_COL = "mbert_token_classes"

## Building Label Mappings and Preparing Data

This is where we prepare the data for training. We're doing several important things here:

1. **Creating label mappings**: We collect all the unique labels from our dataset (like "B-EMAIL", "I-EMAIL", "O", etc.) and assign each one a number. The model works with numbers, not text labels, so we need these mappings to convert back and forth.

2. **Understanding tokens and labels**: Think of tokens as individual words or word pieces. For example, "kevin@email.com" might be split into several tokens. Each token needs a label telling the model what type of information it is (or "O" if it's not PII).

3. **Encoding the data**: The `encode_batch` function takes our text tokens and converts them into the format BERT expects. It also aligns the labels properly so each token has the correct label. Some special tokens (like padding) get a label of -100, which tells the model to ignore them during training.

4. **Processing the dataset**: We apply this encoding to all our training and validation examples, transforming them into a format ready for model training.

In [None]:
# Build label list directly from the dataset's BIO labels
label_set = set()
for ex in train_ds[LAB_COL]:
    label_set.update(ex)
# Ensure "O" is first
label_list = ["O"] + sorted([lab for lab in label_set if lab != "O"])
label2id = {lab: i for i, lab in enumerate(label_list)}
id2label = {i: lab for lab, i in label2id.items()}
# LAB_COL is a column where each row is a list of BIO tags, e.g. ["O", B-EMAIL, I-EMAIL, ..], one tage per token
# In short a token is just a small chunk of text that the model sees as one unit, e.g. ["my" "name" "is" Kevin"], each of these are tokens and are mapped to a integer id [1010, 2450, 2001, 12345, etc]
# whats it moportant is that the model aligns labels to those tokens: so we want a label per token like: "my" -> O, "name" -> O, is -> O, kevin -> B-Name etc
print("Labels:", label_list)

def encode_batch(examples):
    batch_tokens = examples[TOK_COL]
    batch_labels = examples[LAB_COL]

    # Tokenize *tokens*, not raw text
    enc = tokenizer(
        batch_tokens,
        is_split_into_words=True,
        truncation=True,
        max_length=256,
        padding=False,
    )

    all_labels = []

    # We need to get word_ids per example
    # enc.word_ids(batch_index=i) returns a list of word indices for that example's tokens
    for i in range(len(batch_tokens)):
        word_ids = enc.word_ids(batch_index=i)
        sent_labels = batch_labels[i]

        tok_labels = []
        for w_id in word_ids:
            if w_id is None:
                # Special tokens / padding ignore in loss
                tok_labels.append(-100)
            else:
                tok_labels.append(label2id[sent_labels[w_id]])
        all_labels.append(tok_labels)

    enc["labels"] = all_labels
    return enc

# Downsample
MAX_TRAIN = len(train_ds)
MAX_VAL   = len(val_ds)
train_small = train_ds.select(range(min(MAX_TRAIN, len(train_ds))))
val_small   = val_ds.select(range(min(MAX_VAL, len(val_ds))))
encoded_train = train_small.map(
    encode_batch,
    batched=True,
    remove_columns=train_small.column_names,
)
encoded_val = val_small.map(
    encode_batch,
    batched=True,
    remove_columns=val_small.column_names,
)

Labels: ['O', 'B-ACCOUNTNUM', 'B-BUILDINGNUM', 'B-CITY', 'B-CREDITCARDNUMBER', 'B-DATEOFBIRTH', 'B-DRIVERLICENSENUM', 'B-EMAIL', 'B-GIVENNAME', 'B-IDCARDNUM', 'B-PASSWORD', 'B-SOCIALNUM', 'B-STREET', 'B-SURNAME', 'B-TAXNUM', 'B-TELEPHONENUM', 'B-USERNAME', 'B-ZIPCODE', 'I-ACCOUNTNUM', 'I-BUILDINGNUM', 'I-CITY', 'I-CREDITCARDNUMBER', 'I-DATEOFBIRTH', 'I-DRIVERLICENSENUM', 'I-EMAIL', 'I-GIVENNAME', 'I-IDCARDNUM', 'I-PASSWORD', 'I-SOCIALNUM', 'I-STREET', 'I-SURNAME', 'I-TAXNUM', 'I-TELEPHONENUM', 'I-USERNAME', 'I-ZIPCODE']


Map:   0%|          | 0/81379 [00:00<?, ? examples/s]

## Training the Model

Now comes the main event - actually training our model. Here's what's happening:

**Setting up the model**: We start with the pre-trained multilingual BERT and add a classification head on top. This head will learn to predict which label each token should have (is it a name? an email? just regular text?).

**Evaluation metrics**: We're using seqeval, which is specifically designed for sequence labeling tasks like ours. It calculates precision (how many of our predictions are correct), recall (how many actual PII entities we found), and F1 score (a balance between the two).

**Training configuration**: We set up parameters like:
- Learning rate (how fast the model updates)
- Batch size (how many examples to look at once)
- Number of epochs (how many times to go through the entire dataset)
- We enable FP16 (mixed precision training) if a GPU is available to speed things up

**The training loop**: The Trainer handles all the complex stuff - feeding data to the model, calculating loss, updating weights, and periodically checking performance on the validation set. It saves checkpoints along the way and keeps the best performing model.

After training finishes, we evaluate on the validation set one more time to see our final performance numbers, then save the trained model so we can use it later.

In [None]:

from transformers import DataCollatorForTokenClassification, TrainingArguments, Trainer
import torch, numpy as np, evaluate
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)
seqeval = evaluate.load("seqeval")
def compute_metrics(p):
    preds  = np.argmax(p.predictions, axis=-1)
    labels = p.label_ids

    true_labels, true_preds = [], []
    for pred_row, lab_row in zip(preds, labels):
        cur_labels, cur_preds = [], []
        for p_id, l_id in zip(pred_row, lab_row):
            if l_id == -100:
                continue
            cur_labels.append(id2label[l_id])
            cur_preds.append(id2label[p_id])
        true_labels.append(cur_labels)
        true_preds.append(cur_preds)

    results = seqeval.compute(predictions=true_preds, references=true_labels)
    return {
        "precision": results.get("overall_precision", 0.0),
        "recall":    results.get("overall_recall", 0.0),
        "f1":        results.get("overall_f1", 0.0),
        "accuracy":  results.get("overall_accuracy", 0.0),
    }
# Try adjusting number of epochs to check if there is a difference
common_kwargs = dict(
    output_dir="pii-detector-bert-mcased-news",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=100,
    fp16=torch.cuda.is_available(),
)
try:
    args = TrainingArguments(
        eval_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=3,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        **common_kwargs,
    )
except TypeError:
    # If stuck on an older transformers version without these args
    print("Older transformers version detected- falling back to step-based saving")
    args = TrainingArguments(
        save_steps=10_000, # rare saves
        output_dir="pii-detector-bert-mcased",
        **common_kwargs,
    )
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=encoded_train,
    eval_dataset=encoded_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
# If resuming from last checkpoint
#last_ckpt = "/content/pii-detector-bert-mcased-news/checkpoint-40690"
#trainer.train(resume_from_checkpoint=last_ckpt)
trainer.train()
eval_res = trainer.evaluate()
print("Validation metrics:", eval_res)
trainer.save_model("pii-detector-bert-mcased-news/best")
tokenizer.save_pretrained("pii-detector-bert-mcased-news/best")


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
2,0.0435,0.038495,0.887051,0.895476,0.891244,0.98717
3,0.0181,0.035619,0.906124,0.914934,0.910508,0.98906


Validation metrics: {'eval_loss': 0.035619135946035385, 'eval_precision': 0.9061237697783927, 'eval_recall': 0.9149343698946538, 'eval_f1': 0.9105077562187103, 'eval_accuracy': 0.9890603846904966, 'eval_runtime': 369.3374, 'eval_samples_per_second': 220.338, 'eval_steps_per_second': 13.773, 'epoch': 3.0}


('pii-detector-bert-mcased-news/best/tokenizer_config.json',
 'pii-detector-bert-mcased-news/best/special_tokens_map.json',
 'pii-detector-bert-mcased-news/best/vocab.txt',
 'pii-detector-bert-mcased-news/best/added_tokens.json',
 'pii-detector-bert-mcased-news/best/tokenizer.json')

## Getting Detailed Performance Metrics

After training, we want to see detailed results broken down by entity type. This cell runs the model on the entire validation set and calculates metrics for each type of PII separately.

We'll get to see not just overall performance, but specifically how well the model does at finding emails versus names versus phone numbers, etc. This helps us understand if the model struggles with certain types of information more than others.

The seqeval library gives us comprehensive metrics including precision, recall, and F1 scores for each entity type, which is really helpful for understanding where the model excels and where it might need improvement.

In [None]:
# Get raw predictions on val set
preds_output = trainer.predict(encoded_val)
preds = preds_output.predictions.argmax(-1)
labels = preds_output.label_ids

true_labels, true_preds = [], []
for pred_row, lab_row in zip(preds, labels):
    cur_labels, cur_preds = [], []
    for p_id, l_id in zip(pred_row, lab_row):
        if l_id == -100:
            continue
        cur_labels.append(id2label[l_id])
        cur_preds.append(id2label[p_id])
    true_labels.append(cur_labels)
    true_preds.append(cur_preds)

import evaluate
seqeval = evaluate.load("seqeval")
results = seqeval.compute(predictions=true_preds, references=true_labels)
print(results)  # will include per-entity metrics too


{'ACCOUNTNUM': {'precision': np.float64(0.8244444444444444), 'recall': np.float64(0.8381024096385542), 'f1': np.float64(0.8312173263629574), 'number': np.int64(3984)}, 'BUILDINGNUM': {'precision': np.float64(0.904), 'recall': np.float64(0.8781570913127948), 'f1': np.float64(0.8908911727439109), 'number': np.int64(3603)}, 'CITY': {'precision': np.float64(0.9247442766682903), 'recall': np.float64(0.9404334365325078), 'f1': np.float64(0.9325228710014122), 'number': np.int64(8075)}, 'CREDITCARDNUMBER': {'precision': np.float64(0.9096989966555183), 'recall': np.float64(0.9543859649122807), 'f1': np.float64(0.9315068493150684), 'number': np.int64(2565)}, 'DATEOFBIRTH': {'precision': np.float64(0.897063099738296), 'recall': np.float64(0.8220090594191314), 'f1': np.float64(0.8578976640711903), 'number': np.int64(3753)}, 'DRIVERLICENSENUM': {'precision': np.float64(0.9538834951456311), 'recall': np.float64(0.9504232164449818), 'f1': np.float64(0.9521502119927318), 'number': np.int64(2481)}, 'EM

## Saving the Model (Optional)

This section is for when you're working in Google Colab and want to download your trained model to your local computer. It zips up the model files and triggers a download.

You only need to run this if you want to save the model locally. Otherwise, the model is already saved in the output directory and can be loaded from there.

## No need to run unless you want to save models as zip

In [None]:
# Point to your saved folder
OUTPUT_DIR = "/content/pii-detector-bert-mcased-news/best"
# Zip it into /content
import shutil, os
ZIP_BASENAME = "/content/pii-detector-bert-mcased-news/best"
zip_path = shutil.make_archive(ZIP_BASENAME, "zip", OUTPUT_DIR)
# Download to your desktop
from google.colab import files
files.download(zip_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>