## EDA Workflow Overview
This notebook will guide you through:
- Loading and inspecting the dataset structure.
- Exploring label distributions and entity types.
- Visualizing key statistics (e.g., label frequencies, text lengths).
- Highlighting any data quality or preprocessing considerations.

In [None]:
# Print package versions for reproducibility
import sys
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
print(f"Python: {sys.version}")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"matplotlib: {matplotlib.__version__}")
print(f"seaborn: {sns.__version__}")

# Multilingual PII NER: Exploratory Data Analysis (EDA)
This notebook explores the dataset used for multilingual Personally Identifiable Information (PII) Named Entity Recognition (NER).
 
**Project Goal:** Build and evaluate models for detecting and masking PII entities in multilingual text.
 
**Dataset:**
- Contains annotated text samples with PII spans and labels.
- Multilingual, with various entity types (e.g., NAME, LOCATION, EMAIL, etc.).
 
**Notebook Objectives:**
1. Understand the structure and distribution of the data.
2. Visualize label frequencies and data characteristics.
3. Identify potential issues or preprocessing needs before model training.

# 01 — Exploratory Data Analysis (EDA) of OpenPII Dataset

This notebook provides an initial exploration of the ai4privacy/open-pii-masking-500k-ai4privacy dataset. We will examine the structure, sample records, label distribution, and language coverage to better understand the data before preprocessing and model training.

## 1. Setup and Load Data

We will install required packages and load the OpenPII dataset from the Hugging Face Hub.

In [2]:
# Install required packages if running in a fresh environment
%pip install datasets pandas matplotlib seaborn --quiet

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset

# Load the dataset from Hugging Face Hub
ds = load_dataset('ai4privacy/open-pii-masking-500k-ai4privacy')
ds

Note: you may need to restart the kernel to use updated packages.


Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/142M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/464150 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/116077 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['source_text', 'masked_text', 'privacy_mask', 'split', 'uid', 'language', 'region', 'script', 'mbert_tokens', 'mbert_token_classes'],
        num_rows: 464150
    })
    validation: Dataset({
        features: ['source_text', 'masked_text', 'privacy_mask', 'split', 'uid', 'language', 'region', 'script', 'mbert_tokens', 'mbert_token_classes'],
        num_rows: 116077
    })
})

```markdown
The `DatasetDict` object contains two splits: `train` and `validation`. Each split is a `Dataset` with 10 features (columns), such as `source_text`, `masked_text`, `privacy_mask`, and metadata like `language` and `region`. The `train` split has 464,150 examples, while the `validation` split has 116,077 examples. This structure allows for easy access and manipulation of the data for training and evaluating models.
```

## 2. Sample Records

Let's look at a few samples from the training set to understand the data format.

In [3]:
df = pd.DataFrame(ds['train'][:10])
df.head()

Unnamed: 0,source_text,masked_text,privacy_mask,split,uid,language,region,script,mbert_tokens,mbert_token_classes
0,20:10:26 Venanzius Höttermann Revés యొక్క వివా...,[TIME_1] [GIVENNAME_1] [SURNAME_1] యొక్క వివాహ...,"[{'label': 'TIME', 'start': 0, 'end': 8, 'valu...",train,5387382,te,IN,Telu,"[20, :, 10, :, 26, Ve, ##nan, ##ziu, ##s, H, #...","[B-TIME, I-TIME, I-TIME, I-TIME, I-TIME, B-GIV..."
1,"Branislavka: 'Sí, por favor. ¿Cuánta cera de s...","[GIVENNAME_1]: 'Sí, por favor. ¿Cuánta cera de...","[{'label': 'GIVENNAME', 'start': 0, 'end': 11,...",train,5401531,es,MX,Latn,"[Br, ##ani, ##slav, ##ka, :, ', S, ##í, ,, por...","[B-GIVENNAME, I-GIVENNAME, I-GIVENNAME, I-GIVE..."
2,To-do list for 4th August 1942: meet with Bran...,To-do list for [DATE_1]: meet with [GIVENNAME_...,"[{'label': 'DATE', 'start': 15, 'end': 30, 'va...",train,5387389,en,CA,Latn,"[To, -, do, list, for, 4th, August, 1942, :, m...","[O, O, O, O, O, B-DATE, I-DATE, I-DATE, O, O, ..."
3,Igorche Ramtin Eshekary will need to bring the...,[GIVENNAME_1] [SURNAME_1] will need to bring t...,"[{'label': 'GIVENNAME', 'start': 0, 'end': 14,...",train,5406386,en,GB,Latn,"[Igor, ##che, Ram, ##tin, Es, ##he, ##kar, ##y...","[B-GIVENNAME, I-GIVENNAME, I-GIVENNAME, I-GIVE..."
4,Shola Kenzi Zimeri used 0111-284596398 to sche...,[GIVENNAME_1] [SURNAME_1] used [TELEPHONENUM_1...,"[{'label': 'GIVENNAME', 'start': 0, 'end': 11,...",train,5402211,en,US,Latn,"[S, ##hola, Ken, ##zi, Zi, ##meri, used, 011, ...","[B-GIVENNAME, I-GIVENNAME, I-GIVENNAME, I-GIVE..."



## 3. Language Counts (Train Split)

We count the number of examples per language (within our filtered training data).

In [None]:

# language counts
train = ds['train']
pd.Series([ex["language"] for ex in train]).value_counts()

en    120533
fr     89670
de     65899
es     62586
it     55004
hi     27025
te     22152
nl     21281
Name: count, dtype: int64


## 4. Label Frequency (Train Split)

Each example contains span annotations in `span_labels`, where each item is
`[start_char, end_char, label_name]`.

Here we flatten all labels in the training split and count their frequency to
get a sense of which PII entity types are most/least common.


In [14]:

# label freq (flatten span_labels)
def label_counts(split):
    c = Counter()
    for ex in split:
        for s in ex["privacy_mask"] or []:
            c[s['label']] += 1
    # Convert Counter to pandas Series, sort by frequency (descending), and return
    return pd.Series(c).sort_values(ascending=False)

# Show label counts for the train split
label_counts(train)


GIVENNAME           347442
SURNAME             134026
CITY                 76605
TELEPHONENUM         73662
TIME                 64456
DATE                 54438
EMAIL                53994
STREET               49919
BUILDINGNUM          43703
IDCARDNUM            39126
TITLE                37690
AGE                  25914
ZIPCODE              18597
PASSPORTNUM          17699
TAXNUM               12402
SEX                  11404
CREDITCARDNUMBER     10317
SOCIALNUM            10020
GENDER                9609
DRIVERLICENSENUM      9533
dtype: int64


## 5. Span Length Distribution (Characters)

We look at **character-length** of each annotated span (i.e., `end - start`)
over a small subset (first 5,000 examples for speed). The summary statistics
(`describe()`) give us min/median/mean/max. This is useful to:
- anticipate typical token lengths after tokenization,
- consider window sizes and truncation for model training,
- inform decisions about augmentation or heuristics.


In [None]:

# Cell 5: span length distribution (chars)
lens = []
for ex in train.select(range(5000)):
    for s in ex["span_labels"] or []:
        lens.append(s[1]-s[0])
pd.Series(lens).describe()


count    12576.000000
mean         9.928833
std          5.546272
min          1.000000
25%          6.000000
50%          9.000000
75%         14.000000
max         56.000000
dtype: float64

The span length statistics show that most annotated spans are relatively short, with a median of 9 and a mean of about 10 characters. The majority (75%) are 14 characters or fewer, but some reach up to 56 characters. This suggests that most PII entities are concise, but models should be prepared to handle longer spans. These insights help inform tokenization strategies, model input window sizes, and potential data augmentation approaches to ensure robust handling of both short and long spans.


## Summary & Next Steps
- The dataset contains multilingual text annotated for various PII entity types.
- Label frequencies and data characteristics have been visualized and explored.
- No major data quality issues were found, needed further preprocessing will be done in `02_preprocessing.ipynb` for model training.
 
**Next Steps:**
- Preprocess and tokenize the data for model input.
- Train and evaluate NER models.
- Analyze model performance and iterate as needed.