# PII DETECTION: Applied Transformer-based Token Classification

###What is NER?
NER = Named Entity Recognition
It is the NLP task that consists of finding and classifying “entities” (important spans of text) in a sentence.

In my project, I'm doing PII-focused NER (Personally Identifiable Information), which is a special kind of NER with many more entity types, for example: B-NAME_STUDENT, B-EMAIL,...

Therefore, my model learns to tage every token in a sentence with one of these labels (BIO scheme)

###What is BERT transformer model?
BERT = Bidirectional Encoder Representations from Transformers
It is a revolutionary Transformer model published by Google in 2018 that completely changed NLP.

**Pre-training**

I used Hugging Face, so it already teach general language understanding for my model.

I tried to test different models already pre-trained

```
bert-base-uncased, distilbert/distilbert-base-uncased, dslim/bert-base-NER
```
**Fine-tuning**

I am doing the fine-tuning step on your PII dataset using AutoModelForTokenClassification + Trainer.

Fine-tuning helps teach the model detect and understand PII entities (NAME_STUDENT, EMAIL,...). I take a pre-trained checkpoint -> train models with 15 epoches.

###Result

Based on my training results - validation metrics from 15 epoches fine-tuning runs, I assume:
- With model_checkpoint = "dslim/bert-base-NER", it shows moderate F1 growth but plateaus around 0.83.
- With model_checkpoint = "bert-base-uncased".It achieves higher peak F1 (~0.91) and accuracy (~0.999).

Three of my model show the same issue: High validation F1 (0.80–0.91) and accuracy (0.997–0.999), but low testing performance. This could be overfitting signal in NER/PII tasks.

###What should I do next step?

Last time when I talked with teacher Iman, he told me donot focus on the result now, just following back the NER pipeline and fine-tuning, pre-trained; but I don't know where is the problem.


In [18]:
!pip install -q transformers datasets evaluate seqeval accelerate nltk

In [2]:
from google.colab import files
uploaded = files.upload()

Saving pii_data.csv to pii_data.csv


In [3]:
import pandas as pd
import json
import nltk
from nltk.corpus import stopwords
import re
from collections import Counter
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
import torch
import numpy as np
import evaluate
from evaluate import load
import numpy as np


nltk.download('stopwords')
stop = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
df = pd.read_csv("pii_data.csv")
df.head()


Unnamed: 0,0,1
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison..."
2,Janice: A Student with a Unique Identity\n\nIn...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T..."


In [5]:
df.columns = ['Text', 'PII_types']
df.head()

Unnamed: 0,Text,PII_types
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison..."
2,Janice: A Student with a Unique Identity\n\nIn...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T..."


In [6]:
# def clean_text(text):
#     text = str(text).lower()                    # lowercase all words
#     text = re.sub(r'[^a-z\s]', '', text)        # remove punctuation, numbers, special chars
#     words = [w for w in text.split() if w not in stop]  #remove stopwords
#     return " ".join(words)

# def clean_text_light(text):
#     return str(text).lower()

# df['clean_text'] = df['Text'].apply(clean_text)
# df['clean_text_light'] = df['Text'].apply(clean_text_light)

Should removed the clean text cause this is a huge issue. PII recognition depends heavily on casing. When everything is lowercased, I remove almost vital semantic signals the model relies one. This is a known NER problem: Lowercasing decreases token-level F1 by 6-12% on average.

First, we need to turn the PII string into real dictionary .

In [7]:
df['PII_types'] = df['PII_types'].apply(lambda x: json.loads(x.replace("'", '"')))

In [8]:
type(df['PII_types'].iloc[0])

dict

In [9]:

PII_counts = Counter([pii_type for pii_dict in df['PII_types'] for pii_type, items in pii_dict.items() if items for _ in items])
pd.DataFrame.from_dict(PII_counts, orient='index', columns=['Count']).sort_values(by='Count', ascending=False)

Unnamed: 0,Count
NAME_STUDENT,2486
PHONE_NUM,2480
EMAIL,2459
URL_PERSONAL,2454
ID_NUM,2440
USERNAME,2438
STREET_ADDRESS,2378


In [10]:
df['PII_types'].iloc[0]

{'NAME_STUDENT': ['Richard', 'Chang'],
 'EMAIL': ['gwilliams@yahoo.com'],
 'USERNAME': ['brandy38'],
 'ID_NUM': ['GB41EJEY19489241157815'],
 'PHONE_NUM': ['(259)938-7784x08016'],
 'URL_PERSONAL': ['https://twitter.com/john51',
  'https://youtube.com/c/sallywalker'],
 'STREET_ADDRESS': ['711 Golden Overpass, West Andreaville, OH 44115']}

## Create BIO Tagged Text
The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).

| Tag        | Meaning                                      | When to use it                                                                 |
|------------|----------------------------------------------|---------------------------------------------------------------------------------|
| **B-**     | **Begin** – This token **starts** a new entity | Always the **first** token of an entity<br> Use even if the entity is only 1 token long |
| **I-**     | **Inside** – This token is **inside** the same entity | Every token **after** the B- tag that still belongs to the same entity      |
| **O**      | **Outside** – This token is **not** part of any PII entity | All tokens that are not part of any named entity                                |

In [11]:
def create_bio_from_text_and_pii(row):
    text = row["Text"]
    pii_dict = row["PII_types"]

    tokens = text.split()
    labels = ["O"] * len(tokens)

    for pii_type, values in pii_dict.items():
        if not values:
            continue
        for pii_text in values:
            if not pii_text.strip():
                continue

            pii_words = pii_text.split()
            n = len(pii_words)

            for i in range(len(tokens) - n + 1):
                if tokens[i:i+n] == pii_words:
                    labels[i] = f"B-{pii_type}"
                    for j in range(1, n):
                        labels[i+j] = f"I-{pii_type}"
                    break

    return pd.Series([tokens, labels])

# I convert the doc-level PII dict into token-level BIO labels
df[['tokens', 'bio_labels']] = df.apply(create_bio_from_text_and_pii, axis=1)
print("Example:")
print("Text :", df['Text'].iloc[0][:200])
print("Tokens:", df['tokens'].iloc[0][:20])
print("Labels:", df['bio_labels'].iloc[0][:30])
print(len(df['bio_labels']))

Example:
Text : In today's modern world, where technology has become an integral part of our lives, it is essential for students like Richard Chang to navigate the digital landscape with ease. Richard, with his email
Tokens: ['In', "today's", 'modern', 'world,', 'where', 'technology', 'has', 'become', 'an', 'integral', 'part', 'of', 'our', 'lives,', 'it', 'is', 'essential', 'for', 'students', 'like']
Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NAME_STUDENT', 'B-NAME_STUDENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
2000


In [12]:
# Test on new sentence
test_sentence = "Hi, my name is Minh and my email is minh123@gmail.com and phone 0908-123-456"

test_tokens = test_sentence.split()
test_labels = ["O"] * len(test_tokens)

# Use the same logic to mark
fake_pii = {
    "NAME_STUDENT": ["Minh"],
    "EMAIL": ["minh123@gmail.com"],
    "PHONE_NUM": ["0908-123-456"]
}

for pii_type, values in fake_pii.items():
    for val in values:
        words = val.split()
        for i in range(len(test_tokens) - len(words) + 1):
            if test_tokens[i:i+len(words)] == words:
                test_labels[i] = f"B-{pii_type}"
                for j in range(1, len(words)):
                    test_labels[i+j] = f"I-{pii_type}"

# Show result
for token, label in zip(test_tokens, test_labels):
    print(token, "→", label)

Hi, → O
my → O
name → O
is → O
Minh → B-NAME_STUDENT
and → O
my → O
email → O
is → O
minh123@gmail.com → B-EMAIL
and → O
phone → O
0908-123-456 → B-PHONE_NUM


## Prepare Dataset
Currently PII_Types are formed text (EMAIL, PHONE_NUM, etc), I need to collect all PII types to define the label set.

In [13]:
from collections import defaultdict

# Collect only the real entity types
all_labels = set()
for labels in df["bio_labels"]:
    for label in labels:
        if label != "O":
            all_labels.add(label)

unique_labels = ["O"] + sorted(all_labels)
label2id = {label: idx for idx, label in enumerate(unique_labels)}
id2label = {idx: label for label, idx in label2id.items()}

print("Total labels:", len(unique_labels))
print(label2id)
print(all_labels)


Total labels: 11
{'O': 0, 'B-EMAIL': 1, 'B-ID_NUM': 2, 'B-NAME_STUDENT': 3, 'B-PHONE_NUM': 4, 'B-STREET_ADDRESS': 5, 'B-URL_PERSONAL': 6, 'B-USERNAME': 7, 'I-ID_NUM': 8, 'I-NAME_STUDENT': 9, 'I-STREET_ADDRESS': 10}
{'B-ID_NUM', 'B-PHONE_NUM', 'B-URL_PERSONAL', 'B-NAME_STUDENT', 'B-STREET_ADDRESS', 'I-STREET_ADDRESS', 'I-ID_NUM', 'B-EMAIL', 'I-NAME_STUDENT', 'B-USERNAME'}


Convert your DataFrame to a Hugging Face Dataset. Ensure labels are numeric IDs:

In [15]:
import pandas as pd
from datasets import Dataset

def preprocess_example(example):
    example["ner_tags"] = [label2id[label] for label in example["bio_labels"]]
    return example

# Apply creating the dataset
df = df.apply(preprocess_example, axis=1)
dataset = Dataset.from_pandas(df)

dataset = Dataset.from_pandas(df[['tokens', 'bio_labels', 'ner_tags']])

# Split into train/test
dataset = dataset.train_test_split(test_size=0.2, seed=42)

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['tokens', 'bio_labels', 'ner_tags'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['tokens', 'bio_labels', 'ner_tags'],
        num_rows: 400
    })
})


In [16]:
df['bio_labels'].sample(15)

Unnamed: 0,bio_labels
878,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1606,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1583,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1994,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1353,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
759,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1345,"[O, O, O, O, O, O, O, O, O, O, O, O, B-NAME_ST..."
1112,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1453,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1253,"[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


## Load a Pre-trained Model and Tokenizer
Using a model "distilbert-base-uncased"

In [17]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "distilbert/distilbert-base-uncased"
# model_checkpoint = "dslim/bert-base-NER"
# model_checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
num_labels = len(label2id)
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels, id2label=id2label, label2id=label2id
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

KeyboardInterrupt: 

## Tokenize and Align Labels
Transformers use subword tokenization (e.g., "minh123@gmail.com" might split into "minh", "##123", "@", "gmail", ".", "com"), so you need to align your word-level labels to subwords. Ignore labels for special tokens like [CLS]/[SEP], and for subwords after the first, use the continuation label (e.g., if word is "I-EMAIL", subwords get -100 to ignore in loss, or continue as "I-EMAIL" depending on strategy—Hugging Face handles this).
Define a tokenization function:

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        max_length=512,
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:                # [CLS], [SEP], padding tokens
                label_ids.append(-100)
            elif word_idx != previous_word_idx: # First token of a word
                label_ids.append(label[word_idx])
            else:                               # Subword tokens
                label_ids.append(-100)          # or label[word_idx]
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

## Set up training
Use the Trainer API for simplicity. It handles optimization, evaluation, etc

In [None]:
metric = load("seqeval")

# 1. Use the official data collator
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [id2label[p] for (p, l) in zip(pred, label) if l != -100]
        for pred, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# 3. Apply the function and remove original columns
tokenized_dataset = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names
)
training_args = TrainingArguments(
    output_dir="./pii_model",
    num_train_epochs=15,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    fp16=True,
    dataloader_num_workers=2,
    report_to="none",
    dataloader_pin_memory = False
)

# 4. Now create trainer with the collator
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

In [None]:
tokenized_dataset["train"]

In [None]:
tokenized_dataset["test"]

After training, use the model to predict on new sentences:

In [None]:
from transformers import pipeline

ner_pipeline = pipeline("ner", model=trainer.model, tokenizer=tokenizer, aggregation_strategy="simple")

test_sentence = "Hi, my name is Minh and my email is minh123@gmail.com and phone 0908-123-456"
results = ner_pipeline(test_sentence)

for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")

In [None]:
from transformers import pipeline

# Load NER pipeline
ner_pipeline = pipeline(
    "ner",
    model=trainer.model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

test_sentences = [
    "My name is Nguyen Minh.",
    "You can contact me at thomas.lee@example.com.",
    "Paris is beautiful in the summer, but Paris Nguyen is my classmate.",
    "My phone number is +84 908 123 456.",
    "I was born on 25/02/2005.",
    "I live at 22 Nguyen Thi Minh Khai, District 1, Ho Chi Minh City.",
    "Call me at 091-822-4459 tomorrow.",
    "I'm Nguyen Thi Minh, born on 25/02/2005, living at 12 Tran Hung Dao, Hanoi.",
]

for text in test_sentences:
    print(f"\nTEXT: {text}")
    results = ner_pipeline(text)
    if not results:
        print("No PII detected.")
    else:
        for entity in results:
            print(f"  Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")


In [None]:
# trainer.save_model("/content/drive/MyDrive/pii_model_final")
# tokenizer.save_pretrained("/content/drive/MyDrive/pii_model_final")
# print("Saved to Google Drive!")

In [None]:
# !zip -r pii_model_final.zip /content/drive/MyDrive/pii_model_final

In [None]:
# from google.colab.files import download
# download('pii_model_final.zip')