## 1. Setup: Libraries, Dataset Paths, and Base Model

In this first step, we set up everything we need to fine-tune our PII detector on the TAB (Text Anonymization Benchmark) ECHR dataset.

- We import:
  - **Hugging Face Transformers** (for the tokenizer, model, Trainer, etc.),
  - **Datasets** (for convenient dataset handling),
  - and some standard Python tools like `json`, `torch`, and `pathlib`.
- We then **clone the TAB benchmark repo** from GitHub so we can access the ECHR JSON files locally.
- We define:
  - `TRAIN_JSON` and `DEV_JSON` pointing to the train and dev splits,
  - `BASE_MODEL_ID` as our starting model (`Th3red/privacy_mBert_1`, which is already trained for PII detection),
  - `MAX_LENGTH` as the maximum sequence length for BERT,
  - and a `device` variable so we can use GPU if it's available.

This cell basically wires up file paths, our base model, and the hardware so the rest of the notebook can focus on data processing and training.


In [None]:
import json
import torch
from pathlib import Path
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer,
)
from datasets import Dataset
!git clone https://github.com/NorskRegnesentral/text-anonymization-benchmark.git
TAB_DIR = Path("text-anonymization-benchmark")
# Files
TRAIN_JSON = TAB_DIR / "echr_train.json"
DEV_JSON = TAB_DIR / "echr_dev.json"
# Base model
BASE_MODEL_ID = "Th3red/privacy_mBert_1"
# Max sequence length
MAX_LENGTH = 512
# Device
device = "cuda" if torch.cuda.is_available() else "cpu"
device

Cloning into 'text-anonymization-benchmark'...
remote: Enumerating objects: 90, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 90 (delta 26), reused 27 (delta 26), pack-reused 57 (from 1)[K
Receiving objects: 100% (90/90), 7.18 MiB | 14.52 MiB/s, done.
Resolving deltas: 100% (45/45), done.


'cuda'

## 2. Loading the TAB ECHR Documents

Here we load the **ECHR subset** of the TAB benchmark directly from the JSON files we just pointed to:

- `train_docs` contains all training documents.
- `dev_docs` contains all development (validation) documents.
- Each document is a JSON object with fields such as:
  - `text`: the raw document text,
  - `annotations`: human-labeled entity mentions,
  - `doc_id`, `dataset_type`, and some metadata.

We print:
- the number of training and dev documents, and  
- the keys of the first training document,

just to confirm that the data structure is what we expect before we start extracting PII spans.


In [None]:
with open(TRAIN_JSON, "r", encoding="utf-8") as f:
    train_docs = json.load(f)

with open(DEV_JSON, "r", encoding="utf-8") as f:
    dev_docs = json.load(f)

len(train_docs), len(dev_docs), train_docs[0].keys()

(1014,
 127,
 dict_keys(['annotations', 'quality_checked', 'text', 'task', 'meta', 'doc_id', 'dataset_type']))

## 3. Extracting Gold PII Spans from TAB Annotations

TAB uses a slightly nested JSON format for annotations, so this helper function standardizes it:

- Given a single `doc` from TAB, `get_gold_spans` returns a list of `(start_offset, end_offset)` pairs.
- We focus on **two types of identifiers**:
  - `DIRECT`: clear identifiers like names, emails, phone numbers, etc.
  - `QUASI`: quasi-identifiers that can still reveal identity when combined (e.g., locations, dates).

The function:
- Handles two cases:
  1. Entity mentions directly under `doc["entity_mentions"]`.
  2. The usual case where entity mentions are nested under `doc["annotations"][annotator_id]["entity_mentions"]`.
- Filters out any entity whose `identifier_type` is not `DIRECT` or `QUASI`.

These character offsets are our ground truth for where PII appears in each document.


In [None]:
def get_gold_spans(doc):
    """
    From a TAB document dict, return a list of (start_offset, end_offset)
    for all entity mentions that should be masked (identifier_type in {DIRECT, QUASI}).

    Handles the TAB JSON format where mentions live under doc["annotations"].
    """

    spans = []

    # If the doc already has entity_mentions at top level for some reason, handle that too.
    if "entity_mentions" in doc:
        for ent in doc["entity_mentions"]:
            if ent.get("identifier_type") in ("DIRECT", "QUASI"):
                spans.append((ent["start_offset"], ent["end_offset"]))
        return spans

    # Standard TAB format: doc["annotations"] is a dict of annotator_id -> annotation_object
    annotations = doc.get("annotations", {})

    # annotations is usually a dict like {"0": { "entity_mentions": [...] }, "1": {...}, ...}
    for ann in annotations.values():
        # Some versions might put entity mentions directly under the annotation object
        ents = ann.get("entity_mentions", [])
        for ent in ents:
            if ent.get("identifier_type") in ("DIRECT", "QUASI"):
                spans.append((ent["start_offset"], ent["end_offset"]))

    return spans



## 4. Tokenizer and Simplified PII Label Space

Next, we load the tokenizer that matches our base model, and define a **simplified label space** for TAB fine-tuning:

- We use `AutoTokenizer.from_pretrained(BASE_MODEL_ID)` to ensure the tokenizer is consistent with the model.
- For TAB, instead of having many different PII types, we collapse them into:
  - `O`    → non-PII token,
  - `B-PII` → beginning of a PII span,
  - `I-PII` → inside a PII span.

We define:
- `label2id` for converting tag strings to integer IDs, and  
- `id2label` for converting back from IDs to tags.

This gives us a **3-label token classification setup** that focuses on “is this token part of any PII?” rather than its specific subtype.


In [None]:
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)

# New label space for TAB fine-tuning
label2id = {"O": 0, "B-PII": 1, "I-PII": 2}
id2label = {v: k for k, v in label2id.items()}
NUM_LABELS = len(label2id)

label2id, id2label


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

({'O': 0, 'B-PII': 1, 'I-PII': 2}, {0: 'O', 1: 'B-PII', 2: 'I-PII'})

## 5. Turning Character Spans into Token-Level BIO Labels

This function is the core of our preprocessing: it converts raw text and gold spans into **token IDs + aligned labels** that BERT can train on.

Steps inside `encode_with_labels`:

1. **Tokenize the raw text** using the BERT tokenizer, asking for:
   - `input_ids` and `attention_mask`, and
   - `offset_mapping`, which tells us which character positions each token covers.
2. **Initialize labels**:
   - Start by marking every token as non-PII (0).
3. **Mark tokens that overlap any PII span**:
   - For each gold span `(start, end)`, we mark any overlapping token as “inside a PII region” (temporarily set to 1).
4. **Convert to BIO tags**:
   - We scan the labels once more:
     - Special tokens get label `-100` so they are ignored by the loss.
     - The **first token in a PII segment** becomes `B-PII`.
     - Any **following tokens in the same segment** become `I-PII`.
     - All other tokens become `O`.
5. We remove the `offset_mapping` and attach the `labels` back into the encoding dictionary.

The quick test at the bottom encodes one training document so we can verify:
- the sequence lengths match, and
- the labels look reasonable (a mix of `-100`, `0`, `1`, `2`).


In [None]:
def encode_with_labels(text, spans, tokenizer, max_length=512):
    enc = tokenizer(
        text,
        truncation=True,
        max_length=max_length,
        return_offsets_mapping=True,
    )

    offsets = enc["offset_mapping"]
    # Initialize all labels as "O" (0)
    labels = [0] * len(offsets)   # we'll later turn PII tokens into 1 (B) or 2 (I)
    # Mark tokens that overlap any gold span as PII (temporary mark = 1)
    for (gs, ge) in spans:
        for i, (s, e) in enumerate(offsets):
            if s == e:  # special token or padding
                continue
            if e <= gs or s >= ge:
                continue
            # Token overlaps gold span
            labels[i] = 1
    # Now convert contiguous sequences of 1s into B-PII / I-PII
    for i in range(len(labels)):
        s, e = offsets[i]
        if s == e:
            # Ignore special tokens in loss
            labels[i] = -100
        elif labels[i] == 1:
            # If previous real token is not 1/2, this is B-PII, else I-PII
            if i == 0:
                labels[i] = label2id["B-PII"]
            else:
                prev_label = labels[i - 1]
                prev_s, prev_e = offsets[i - 1]
                if prev_s == prev_e or prev_label in (0, -100):
                    labels[i] = label2id["B-PII"]
                else:
                    labels[i] = label2id["I-PII"]
        elif labels[i] == 0:
            labels[i] = label2id["O"]
    # Drop offsets from encoding; keep token data labels
    enc.pop("offset_mapping")
    enc["labels"] = labels
    return enc
# Quick test on a single train doc
test_doc = train_docs[0]
test_spans = get_gold_spans(test_doc)
encoded = encode_with_labels(test_doc["text"], test_spans, tokenizer, max_length=MAX_LENGTH)

len(encoded["input_ids"]), len(encoded["labels"]), encoded["labels"][:30]


(512,
 512,
 [-100,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  2,
  2,
  2,
  0,
  0,
  0,
  1,
  2,
  2,
  0,
  0,
  0,
  0,
  0])

## 6. Building Train and Dev Datasets for Fine-Tuning

Now we wrap everything into `build_tab_dataset`, which converts the raw TAB documents into a **Hugging Face `Dataset`** ready for training.

For each document:

1. We call `get_gold_spans(doc)` to obtain the PII character ranges we want to detect.
2. If there are no PII spans, we optionally skip the document (to focus training on examples that actually contain identifiers).
3. We call `encode_with_labels` to get:
   - `input_ids`,
   - `attention_mask`,
   - and aligned `labels` in the {`O`, `B-PII`, `I-PII`} scheme.
4. We collect these into lists and finally build a `Dataset` with the three fields.

We do this for:
- `tab_train`: training set,
- `tab_dev`: development (validation) set.

Printing `tab_train` and one example lets us confirm that:
- the dataset has the right number of rows, and
- each example has the expected fields and label shapes.


In [None]:
def build_tab_dataset(docs, tokenizer, max_length=512):
    all_input_ids = []
    all_attention_masks = []
    all_labels = []

    for doc in docs:
        spans = get_gold_spans(doc)
        if not spans:
            # Skip docs with no identifiers (optional; you can keep them with all-O labels if you want)
            continue

        encoded = encode_with_labels(doc["text"], spans, tokenizer, max_length=max_length)
        all_input_ids.append(encoded["input_ids"])
        all_attention_masks.append(encoded["attention_mask"])
        all_labels.append(encoded["labels"])

    ds = Dataset.from_dict({
        "input_ids": all_input_ids,
        "attention_mask": all_attention_masks,
        "labels": all_labels,
    })

    return ds

tab_train = build_tab_dataset(train_docs, tokenizer, max_length=MAX_LENGTH)
tab_dev   = build_tab_dataset(dev_docs, tokenizer, max_length=MAX_LENGTH)

tab_train, tab_dev, tab_train[0]


(Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 1014
 }),
 Dataset({
     features: ['input_ids', 'attention_mask', 'labels'],
     num_rows: 127
 }),
 {'input_ids': [101,
   23837,
   49378,
   33809,
   98348,
   11259,
   10117,
   13474,
   60256,
   10106,
   10151,
   19800,
   113,
   10192,
   119,
   37257,
   98041,
   120,
   10719,
   114,
   11327,
   10105,
   14648,
   10108,
   25854,
   108850,
   18832,
   10169,
   10105,
   14100,
   10571,
   26295,
   11069,
   10108,
   10105,
   25318,
   10142,
   10105,
   36682,
   10108,
   15426,
   22305,
   10111,
   26762,
   50938,
   22326,
   10107,
   113,
   100,
   10105,
   25318,
   100,
   114,
   10155,
   169,
   29876,
   11844,
   117,
   12916,
   19965,
   45896,
   107992,
   11534,
   113,
   100,
   10105,
   72894,
   21307,
   10368,
   100,
   114,
   117,
   10135,
   10413,
   10735,
   10214,
   119,
   10117,
   72894,
   21307,
   10368,
   10134,
   18839,
  

## 7. Adapting the Base PII Model to the TAB Label Space

Our starting model, `privacy_mBert_1`, was originally trained with a **larger PII label set**. For the TAB ECHR task, we want to fine-tune it with just three labels: `O`, `B-PII`, and `I-PII`.

Here we:

1. Load the **base config** from the existing model.
2. Override:
   - `num_labels` to `3`,
   - `id2label` and `label2id` to match our new label mapping.
3. Load the model with `AutoModelForTokenClassification.from_pretrained`, passing the updated config and `ignore_mismatched_sizes=True`:
   - This tells Transformers to **reuse the encoder weights** (the BERT body) but **reinitialize the classifier head** to match the new 3-label output layer
   
We then move the model to GPU if available and print out the label mappings and number of labels, to confirm the model is configured correctly for this new task.

In [None]:
base_config = AutoConfig.from_pretrained(BASE_MODEL_ID)
base_config.num_labels = NUM_LABELS
base_config.id2label = id2label
base_config.label2id = label2id

model = AutoModelForTokenClassification.from_pretrained(
    BASE_MODEL_ID,
    config=base_config,
    ignore_mismatched_sizes=True, # new classifier head
)

model.to(device)
model.config.id2label, model.config.label2id, model.num_labels


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/709M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at Th3red/privacy_mBert_1 and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([35]) in the checkpoint and torch.Size([3]) in the model instantiated
- classifier.weight: found shape torch.Size([35, 768]) in the checkpoint and torch.Size([3, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


({0: 'O', 1: 'B-PII', 2: 'I-PII'}, {'O': 0, 'B-PII': 1, 'I-PII': 2}, 3)

## 8. Fine-Tuning on TAB with the Hugging Face Trainer

Here we set up and run the actual fine-tuning on the TAB ECHR dataset.

Key pieces:

- **`DataCollatorForTokenClassification`**:  
  Handles dynamic padding and batching so that each batch is nicely aligned for token classification.

- **TrainingArguments**:
  - `output_dir`: where checkpoints and logs are saved,
  - `eval_strategy="epoch"` and `save_strategy="epoch"`: evaluate and save at the end of each epoch,
  - `learning_rate=5e-5`: standard fine-tuning LR for BERT,
  - `batch_size=4` for both train and eval,
  - `num_train_epochs=4`,
  - `weight_decay=0.01` to regularize the model,
  - `load_best_model_at_end=True` with `metric_for_best_model="eval_loss"`: automatically keep the checkpoint that performs best on the dev set.

- **`Trainer`**:
  - Ties together the model, data, and training configuration,
  - Handles the training loop, evaluation, checkpointing, and integration with Weights & Biases (W&B) for logging.

We then call `trainer.train()`, which runs the fine-tuning process and tracks:
- training loss,
- evaluation loss,
- runtime statistics,
- and W&B logs (if configured).


In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

output_dir = "privacy_mBert_TAB_finetuned"

training_args = TrainingArguments(
    output_dir=output_dir,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=4,
    weight_decay=0.01,
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tab_train,
    eval_dataset=tab_dev,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


  trainer = Trainer(
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mhiroschmitz[0m ([33mhiroschmitz-denver[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,0.1034,0.150979
2,0.0724,0.156336


TrainOutput(global_step=508, training_loss=0.11063974028027902, metrics={'train_runtime': 282.9688, 'train_samples_per_second': 7.167, 'train_steps_per_second': 1.795, 'total_flos': 529914613542912.0, 'train_loss': 0.11063974028027902, 'epoch': 2.0})

## 9. Saving the Fine-Tuned Model and Final Evaluation

After training finishes, we:

1. **Save the model** to output_dir, including the new 3-label classification head.
2. **Save the tokenizer** alongside it, so we have a complete, reusable package.
3. Run `trainer.evaluate()` on the dev set to get the final evaluation metrics:
   - eval_loss,
   - samples per second,
   - steps per second,
   - and the epoch at which these metrics were computed.

These metrics summarize how well the TAB-fine-tuned model fits the ECHR data, and this saved model can now be used in our anonymization pipeline or uploaded to the Hugging Face Hub for reuse

In [None]:
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)


('privacy_mBert_TAB_finetuned/tokenizer_config.json',
 'privacy_mBert_TAB_finetuned/special_tokens_map.json',
 'privacy_mBert_TAB_finetuned/vocab.txt',
 'privacy_mBert_TAB_finetuned/added_tokens.json',
 'privacy_mBert_TAB_finetuned/tokenizer.json')

In [None]:
eval_results = trainer.evaluate()
eval_results


{'eval_loss': 0.15097862482070923,
 'eval_runtime': 4.0829,
 'eval_samples_per_second': 31.105,
 'eval_steps_per_second': 7.838,
 'epoch': 2.0}

In [None]:
!zip -r privacy_mBert_TAB_finetuned.zip privacy_mBert_TAB_finetuned
from google.colab import files
files.download("privacy_mBert_TAB_finetuned.zip")


  adding: privacy_mBert_TAB_finetuned/ (stored 0%)
  adding: privacy_mBert_TAB_finetuned/vocab.txt (deflated 45%)
  adding: privacy_mBert_TAB_finetuned/config.json (deflated 54%)
  adding: privacy_mBert_TAB_finetuned/special_tokens_map.json (deflated 80%)
  adding: privacy_mBert_TAB_finetuned/tokenizer.json (deflated 67%)
  adding: privacy_mBert_TAB_finetuned/runs/ (stored 0%)
  adding: privacy_mBert_TAB_finetuned/runs/Dec05_02-19-04_b74275c1d393/ (stored 0%)
  adding: privacy_mBert_TAB_finetuned/runs/Dec05_02-19-04_b74275c1d393/events.out.tfevents.1764901301.b74275c1d393.4149.0 (deflated 60%)
  adding: privacy_mBert_TAB_finetuned/runs/Dec05_02-19-04_b74275c1d393/events.out.tfevents.1764902569.b74275c1d393.4149.1 (deflated 25%)
  adding: privacy_mBert_TAB_finetuned/model.safetensors (deflated 7%)
  adding: privacy_mBert_TAB_finetuned/tokenizer_config.json (deflated 74%)
  adding: privacy_mBert_TAB_finetuned/checkpoint-254/ (stored 0%)
  adding: privacy_mBert_TAB_finetuned/checkpoint-25

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>