# 02 — Preprocessing: BIO Alignment & Tokenization

**Dataset:** `ai4privacy/open-pii-masking-500k-ai4privacy`  
**Languages:** English (en), German (de), Italian (it), French (fr)  
**Model/tokenizer:** `xlm-roberta-base`

**What this notebook does**
1) Load & filter dataset to en/de/it/fr  
2) Detect span field, normalize to (start, end, label)  
3) Build BIO tag set + mappings (`label2id`, `id2label`)  
4) Align char-level spans → token-level BIO using offsets  
5) Save tokenized dataset to `data/hf_tokenized` + `data/labels.json`


In [1]:
# Reproducible setup + paths
import json, random  # JSON for saving/loading, random for reproducibility
from pathlib import Path  # For file and directory paths
from typing import List, Tuple  # For type hints
import pandas as pd  # Data analysis and manipulation
from datasets import load_dataset, DatasetDict  # Hugging Face datasets
from transformers import AutoTokenizer  # Tokenizer for model

# Set random seed for reproducibility
SEED = 42
random.seed(SEED)

# Model and language settings
MODEL = "xlm-roberta-base"
LANGS = {"en", "de", "it", "fr"}
OUT = Path("data"); OUT.mkdir(parents=True, exist_ok=True)  # Output directory

print("Settings:", {"MODEL": MODEL, "LANGS": sorted(LANGS), "OUT": str(OUT)})


Settings: {'MODEL': 'xlm-roberta-base', 'LANGS': ['de', 'en', 'fr', 'it'], 'OUT': 'data'}


**Reproducibility Note:**
To ensure full reproducibility, we set the random seed for Python, NumPy, and PyTorch. This helps guarantee that results are consistent across runs, especially when using DataLoader shuffling or other stochastic operations.

In [None]:
import numpy as np  # For numerical operations
import torch  # For deep learning
np.random.seed(SEED)  # Set numpy seed
torch.manual_seed(SEED)  # Set torch seed

### We use `xlm-roberta-base` because:

Multilingual capability — It’s pretrained on 100+ languages, including English, German, Italian, and French, so it can represent all our target languages well.

Tokenization robustness — Uses SentencePiece subword tokenization, which handles different scripts, diacritics, and word forms without needing separate vocabularies.

Strong NER performance — XLM-R models are proven competitive in multilingual NER benchmarks (like XTREME and CoNLL), often outperforming older multilingual BERT.

Good size/speed trade-off — base version (~270M parameters) balances accuracy and training speed, making it feasible to fine-tune on a single GPU.

Hugging Face ecosystem support — Fully integrated with 🤗 Transformers, which simplifies token classification training, evaluation, and deployment.

## 3. Load dataset

We pull the dataset and keep only the target languages for **train/validation**.


In [3]:
from datasets import load_dataset  # Import the function to load datasets
ds_raw = load_dataset(
    "ai4privacy/open-pii-masking-500k-ai4privacy",  # Dataset name on Hugging Face Hub
    cache_dir="hf-cache",                 # Use a local cache directory for downloads
    download_mode="force_redownload"      # Always redownload to avoid stale cache
)
ds_raw  # Show the loaded dataset

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/142M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/464150 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/116077 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['source_text', 'masked_text', 'privacy_mask', 'split', 'uid', 'language', 'region', 'script', 'mbert_tokens', 'mbert_token_classes'],
        num_rows: 464150
    })
    validation: Dataset({
        features: ['source_text', 'masked_text', 'privacy_mask', 'split', 'uid', 'language', 'region', 'script', 'mbert_tokens', 'mbert_token_classes'],
        num_rows: 116077
    })
})

### 4. Filtering by Language

We filter the dataset to include only samples in our target languages: English (`en`), German (`de`), Italian (`it`), and French (`fr`). This ensures that all subsequent processing and modeling steps focus exclusively on these four languages.

In [5]:
ds = DatasetDict({
    "train": ds_raw["train"].filter(lambda x: x["language"] in LANGS),  # Keep only target languages in train
    "validation": ds_raw["validation"].filter(lambda x: x["language"] in LANGS),  # ...and in validation
})
print(ds)  # Print dataset summary
print("Columns:", ds["train"].column_names)  # Show available columns
print("Example:", {k: type(ds['train'][0][k]).__name__ for k in ds["train"].column_names})  # Show types of fields

Filter:   0%|          | 0/116077 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['source_text', 'masked_text', 'privacy_mask', 'split', 'uid', 'language', 'region', 'script', 'mbert_tokens', 'mbert_token_classes'],
        num_rows: 331106
    })
    validation: Dataset({
        features: ['source_text', 'masked_text', 'privacy_mask', 'split', 'uid', 'language', 'region', 'script', 'mbert_tokens', 'mbert_token_classes'],
        num_rows: 82931
    })
})
Columns: ['source_text', 'masked_text', 'privacy_mask', 'split', 'uid', 'language', 'region', 'script', 'mbert_tokens', 'mbert_token_classes']
Example: {'source_text': 'str', 'masked_text': 'str', 'privacy_mask': 'list', 'split': 'str', 'uid': 'int', 'language': 'str', 'region': 'str', 'script': 'str', 'mbert_tokens': 'list', 'mbert_token_classes': 'list'}


### Dataset Summary

**Splits & Sizes**
| Split       | # Samples |
|-------------|-----------|
| Train       | 331,106   |
| Validation  | 82,931    |

**Columns (Features)**
| Column Name             | Type  | Description |
|-------------------------|-------|-------------|
| `source_text`           | str   | Original sentence/text |
| `masked_text`           | str   | Text with PII replaced/masked |
| `privacy_mask`          | list  | List of entity spans and labels |
| `split`                 | str   | Dataset split indicator (train/validation) |
| `uid`                   | int   | Unique identifier |
| `language`              | str   | Language code (e.g., en, de, it, fr) |
| `region`                | str   | Region code |
| `script`                | str   | Writing system |
| `mbert_tokens`          | list  | Tokens from mBERT tokenizer |
| `mbert_token_classes`   | list  | Token-level labels for mBERT |

**Purpose**
- Multilingual dataset containing text with **personally identifiable information (PII)**.
- Designed for **token classification** tasks such as NER (Named Entity Recognition) and PII masking.


## 5. Detect span field

The dataset stores spans in a field like `privacy_mask`.  
We auto‑detect and normalize to a list of **(start, end, label)** triples using the `privacy_mask` field (or similar, e.g., `spans`, `entities`). This ensures compatibility across possible dataset schemas.


In [11]:
CANDIDATE_SPAN_FIELDS = ["privacy_mask", "spans", "entities", "span_labels"]  # Possible span field names
SPAN_FIELD = next((c for c in CANDIDATE_SPAN_FIELDS if c in ds["train"].column_names), None)  # Find which exists
assert SPAN_FIELD, f"No span field found. Columns: {ds['train'].column_names}"
print("Using span field:", SPAN_FIELD)

def iter_spans(example) -> List[Tuple[int, int, str]]:
    """Return [(start, end, label), ...] regardless of the original schema."""
    items = example.get(SPAN_FIELD) or []  # Get the list of spans
    triples = []
    for it in items:
        if isinstance(it, dict):  # If span is a dict, extract fields by key
            triples.append((it["start"], it["end"], it["label"]))
        elif isinstance(it, (list, tuple)) and len(it) >= 3:  # If span is a tuple/list
            triples.append((it[0], it[1], it[2]))
    return triples

Using span field: privacy_mask


## 6. Build label set

Collect all distinct entity labels across splits (en/de/it/fr only), then build BIO tags.


In [12]:
labels = set()  # Collect unique entity labels
for split in ["train", "validation"]:
    for ex in ds[split]:
        for s, e, lab in iter_spans(ex):
            labels.add(lab)  # Add each label found

labels = sorted(labels)  # Sort for consistency
print("Found labels:", labels)

# BIO mapping: B-*, I-* plus O
label2id = {f"B-{l}": i for i, l in enumerate(labels)}  # Beginning of entity
label2id.update({f"I-{l}": i + len(labels) for i, l in enumerate(labels)})  # Inside entity
O_ID = len(label2id)  # ID for 'O' (outside any entity)
label2id["O"] = O_ID
id2label = {v: k for k, v in label2id.items()}  # Reverse mapping

print("Num BIO classes (incl O):", len(label2id))
pd.Series(label2id).sort_values().head()  # Show a sample of the mapping

Found labels: ['AGE', 'BUILDINGNUM', 'CITY', 'CREDITCARDNUMBER', 'DATE', 'DRIVERLICENSENUM', 'EMAIL', 'GENDER', 'GIVENNAME', 'IDCARDNUM', 'PASSPORTNUM', 'SEX', 'SOCIALNUM', 'STREET', 'SURNAME', 'TAXNUM', 'TELEPHONENUM', 'TIME', 'TITLE', 'ZIPCODE']
Num BIO classes (incl O): 41


B-AGE                 0
B-BUILDINGNUM         1
B-CITY                2
B-CREDITCARDNUMBER    3
B-DATE                4
dtype: int64

The code in the previous cell collects all unique entity labels from the dataset's span annotations (such as AGE, EMAIL, TAXNUM) and constructs mappings for BIO tagging. For each entity label, it creates a `B-` (begin) and `I-` (inside) tag, plus a single `O` (outside) tag for non-entity tokens. These mappings (`label2id`, `id2label`) are used to convert entity spans into token-level classification targets for model training. The output displays the discovered labels and the resulting BIO tag-to-ID mapping.

## 7. Tokenizer & alignment approach

- Use `xlm-roberta-base` fast tokenizer with `return_offsets_mapping=True`
- For each token span `(s, e)`:
  - If (s == e): **special token** → label `-100` (ignored by loss/metrics)
  - Else, look at char‑tags in `text[s:e]`:
    - Prefer a `B-` if present, else first non‑`O` tag (an `I-`)
  - If none found → `O`
- Truncate to `max_length=256` tokens.


In [None]:
tok = AutoTokenizer.from_pretrained(MODEL, use_fast=True)  # Load fast tokenizer

def align_to_bio(example, max_length=256):
    """Map char-level spans to token-level BIO ids with -100 for specials."""
    text = example["source_text"]
    spans = iter_spans(example)

    # Char-level tags initialized to O
    char_tags = ["O"] * len(text)
    for s, e, lab in spans:
        if 0 <= s < e <= len(text):
            char_tags[s] = f"B-{lab}"  # Mark beginning of entity
            for i in range(s + 1, e):
                char_tags[i] = f"I-{lab}"  # Mark inside entity

    # Tokenize with offsets
    enc = tok(text, return_offsets_mapping=True, truncation=True, max_length=max_length)
    labels_tok = []
    for (s, e) in enc["offset_mapping"]:
        if s == e:
            labels_tok.append(-100)  # special tokens -> ignored
            continue
        window = char_tags[s:e]
        if any(t != "O" for t in window):
            tag = next((t for t in window if t.startswith("B-")),
                       next((t for t in window if t != "O"), "O"))  # Prefer B- tag, else I-
        else:
            tag = "O"
        labels_tok.append(label2id.get(tag, O_ID))

    # Remove offsets; keep ids, mask, labels
    enc.pop("offset_mapping")
    enc["labels"] = labels_tok
    return enc

## 8. Apply alignment to splits

We map `align_to_bio` over train/validation and keep only model‑needed columns.


In [15]:
cols_keep = ["input_ids", "attention_mask", "labels"]  # Only keep model-required columns
ds_tok = DatasetDict({
    name: split.map(align_to_bio, remove_columns=split.column_names, desc=f"Align {name}")  # Map function to each split
               .with_format("torch", columns=cols_keep)  # Set format for PyTorch
    for name, split in ds.items()
})
print(ds_tok)  # Show the processed dataset

Align validation:   0%|          | 0/82931 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 331106
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 82931
    })
})


## 9. Save tokenized dataset + label mappings

Artifacts:
- `data/hf_tokenized/` (HF dataset on disk)
- `data/labels.json` (labels, label2id, id2label)


In [16]:
# Save label mappings to JSON file
(OUT / "labels.json").write_text(json.dumps({
    "labels": labels,
    "label2id": label2id,
    "id2label": id2label
}, indent=2))
ds_tok.save_to_disk(str(OUT / "hf_tokenized"))  # Save tokenized dataset to disk
print("Saved:", OUT / "hf_tokenized", "and", OUT / "labels.json")

Saving the dataset (0/1 shards):   0%|          | 0/331106 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/82931 [00:00<?, ? examples/s]

Saved: data/hf_tokenized and data/labels.json


**Versioning for Reproducibility**
It's good practice to record the versions of key libraries used for preprocessing and modeling. This ensures that results can be reproduced exactly in the future or by other collaborators.

In [None]:
import datasets, transformers  # Import libraries to check versions
print(f"datasets: {datasets.__version__}")  # Print datasets version
print(f"transformers: {transformers.__version__}")  # Print transformers version
print(f"torch: {torch.__version__}")  # Print torch version
print(f"pandas: {pd.__version__}")  # Print pandas version

## 10. Sanity checks

- Reload from disk  
- Print sizes & a small sample  
- Validate label ids / shapes


In [18]:
from datasets import load_from_disk  # For loading saved dataset
import json  # For reading label mappings
try:
    reloaded = load_from_disk(str(OUT / "hf_tokenized"))  # Reload tokenized dataset from disk
except Exception as e:
    print(f"Error loading tokenized dataset: {e}")
    raise
print(reloaded)  # Print dataset info
print("Example batch shapes:", {k: v[0].shape if hasattr(v[0], "shape") else type(v[0]) for k, v in reloaded["train"][0].items()})  # Show shapes/types
print("Sample record:", reloaded["train"][0])  # Show a sample record
# Basic assertions
assert set(["input_ids","attention_mask","labels"]).issubset(reloaded["train"].features), "Missing features."  # Check required features
label2id_disk = json.loads((OUT/"labels.json").read_text())["label2id"]  # Load label2id from disk
assert label2id_disk == label2id, "label2id mapping mismatch between disk and memory."  # Check mapping matches
print("Sanity checks passed ✅")

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 331106
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 82931
    })
})
Example batch shapes: {'input_ids': torch.Size([]), 'attention_mask': torch.Size([]), 'labels': torch.Size([])}
Sample record: {'input_ids': tensor([     0,    717,      9,    246,   5303,    100,    201,    927,   8055,
         37719,     12,  23356,    678,  23243,     53,   1950,  16762,     99,
           209,  22950,     47,  45252,     70, 149849,   4516,  17164,    111,
           378,  77804,  24794,    294,  84780,  21130,  61019, 132350,  18044,
           454,   2592,    268,      5,      2]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), 'labels': tensor([-100,   40,   40,   40,   40,   40,    4,   24,   24,   24,   4

#  
Summary: What We Did in This Notebook

In this notebook, we prepared a multilingual dataset for training a model to recognize and mask personally identifiable information (PII) in text. Here’s what we did, step by step:

1. **Loaded the Dataset:** We started by loading a large dataset containing sentences in several languages, each with PII labeled.

2. **Filtered for Target Languages:** We kept only the samples in English, German, Italian, and French, focusing our work on these four languages.

3. **Detected and Normalized Entity Spans:** We identified where PII appears in each sentence and converted the information into a standard format: the start and end positions of each entity, plus its label (like NAME, EMAIL, etc.).

4. **Built BIO Tag Mappings:** We created a system to label each word (or token) in a sentence as either the Beginning (B-) or Inside (I-) of an entity, or Outside (O) any entity. This is a common approach for training models to find entities in text.

5. **Tokenized and Aligned Labels:** We used a powerful multilingual tokenizer to split sentences into tokens and aligned our BIO labels to these tokens, so the model can learn from them.

6. **Saved the Processed Data:** We saved the tokenized dataset and the label mappings to disk, making them ready for model training.

7. **Checked Our Work:** We reloaded the saved data, checked its structure, and made sure everything matched up correctly.

8. **Recorded Library Versions:** We printed out the versions of the main libraries we used, so anyone can reproduce our results in the future.

By following these steps, we turned raw multilingual text data into a format that’s ready for training a machine learning model to automatically detect and mask PII. This process is essential for building safe, privacy-aware AI systems.