
# EDA: Multilingual PII Dataset (ai4privacy/open-pii-masking-500k-ai4privacy)

This notebook performs a **quick exploratory data analysis (EDA)** on the
Hugging Face dataset **`ai4privacy/open-pii-masking-500k-ai4privacy`**,
focusing on the languages **English (en)**, **German (de)**, **Italian (it)**,
and **French (fr)**.

**Goals**
- Verify we can load the dataset successfully.
- Inspect the size of the splits for the selected languages.
- Check label distribution (which PII types are most common?).
- Look at span-length statistics to understand typical entity sizes.

> Tip: Run cells top-to-bottom. If this is your first time using the dataset,
> the first cell that loads it will download and cache it (this can take a bit).



## 1. Imports & Settings

We import the core libraries for data handling and the Hugging Face `datasets`
loader. We also fix a random seed for reproducibility of any sampling steps.


In [None]:

# Cell 1: imports & settings
import pandas as pd, numpy as np, matplotlib.pyplot as plt, json, random
from datasets import load_dataset
random.seed(42)



## 2. Load Dataset & Filter Languages

- We load the dataset with `load_dataset`.
- We filter to our target languages: **en/de/it/fr** using the dataset's
  `language` field.
- The dataset exposes splits like `train` and `validate` (naming may vary by
  dataset), so we use those and apply the language filter.
- The last line returns the sizes of the filtered splits for a quick sanity check.


In [None]:

# Cell 2: load subset (en/de/it/fr)
LANGS = {"en","de","it","fr"}
ds = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy")
train = ds["train"].filter(lambda x: x["language"] in LANGS)
valid = ds["validate"].filter(lambda x: x["language"] in LANGS)
len(train), len(valid)



## 3. Language Counts (Train Split)

We count the number of examples per language (within our filtered training data).
This helps us see whether the dataset is balanced across **en/de/it/fr** or
dominated by a single language.


In [None]:

# Cell 3: language counts
pd.Series([ex["language"] for ex in train]).value_counts()



## 4. Label Frequency (Train Split)

Each example contains span annotations in `span_labels`, where each item is
`[start_char, end_char, label_name]`.

Here we flatten all labels in the training split and count their frequency to
get a sense of which PII entity types are most/least common.


In [None]:

# Cell 4: label freq (flatten span_labels)
from collections import Counter
def label_counts(split):
    c = Counter()
    for ex in split:
        for s in ex["span_labels"] or []:
            c[s[2]] += 1
    return pd.Series(c).sort_values(ascending=False)

label_counts(train)



## 5. Span Length Distribution (Characters)

We look at **character-length** of each annotated span (i.e., `end - start`)
over a small subset (first 5,000 examples for speed). The summary statistics
(`describe()`) give us min/median/mean/max. This is useful to:
- anticipate typical token lengths after tokenization,
- consider window sizes and truncation for model training,
- inform decisions about augmentation or heuristics.


In [None]:

# Cell 5: span length distribution (chars)
lens = []
for ex in train.select(range(5000)):
    for s in ex["span_labels"] or []:
        lens.append(s[1]-s[0])
pd.Series(lens).describe()



## Next Steps

- Plot distributions (e.g., histograms of span lengths) if needed.
- Inspect examples by label for quality checks.
- Proceed to **token-level label alignment** (BIO/BILOU) and baseline fine-tuning
  with a multilingual model (e.g., `xlm-roberta-base`).

When you're ready, we can add a **preprocessing script** and **training script**
to this repo.
