# PII DETECTION: Applied Transformer-based Token Classification

### Why Transformer-based Token Classification (not Logistic Regression)

The shift from **document-level** to **token-level** PII detection fundamentally changes the modeling requirements:

- **Document-level classification** only requires knowing whether a PII type exists → simple patterns (e.g., presence of "@" → email) → classic models (Logistic Regression, TF-IDF) achieve perfect F1 = 1.00.
- **Token-level NER** requires identifying **exact word boundaries** and understanding **contextual meaning** (e.g., "Paris" as name vs city) → requires sequential modeling with attention.

Transformer models with `AutoModelForTokenClassification` are specifically designed for this:
- They process text **word-by-word** while seeing the full context.
- They output one label per token (B-NAME, I-EMAIL, O, etc.).
- Pre-trained on massive text → understand grammar, capitalization, patterns.

Traditional classifiers (Logistic Regression, Random Forest, SVM) cannot be applied directly because:
- They expect fixed-size input (one vector per document).
- They cannot model word order or subword relationships.
- Even with word embeddings, they lose sequential information.

Therefore, **fine-tuned Transformers are the standard and only effective approach** for token-level PII detection in modern NLP.

In [None]:
# !pip install -q transformers datasets evaluate seqeval accelerate nltk

In [5]:
from google.colab import files
uploaded = files.upload()

Saving pii_data.csv to pii_data.csv


In [67]:
import pandas as pd
import json
import nltk
from nltk.corpus import stopwords
import re
from collections import Counter
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
import torch
import numpy as np
import evaluate
from evaluate import load
import numpy as np


nltk.download('stopwords')
stop = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
df = pd.read_csv("pii_data.csv")
df.head()


Unnamed: 0,0,1
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison..."
2,Janice: A Student with a Unique Identity\n\nIn...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T..."


In [32]:
df.columns = ['Text', 'PII_types']
df.head()

Unnamed: 0,Text,PII_types
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison..."
2,Janice: A Student with a Unique Identity\n\nIn...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T..."


In [39]:
def clean_text(text):
    text = str(text).lower()                    # lowercase all words
    text = re.sub(r'[^a-z\s]', '', text)        # remove punctuation, numbers, special chars
    words = [w for w in text.split() if w not in stop]  #remove stopwords
    return " ".join(words)

def clean_text_light(text):
    return str(text).lower()

df['clean_text'] = df['Text'].apply(clean_text)
df['clean_text_light'] = df['Text'].apply(clean_text_light)

First, we need to turn the PII string into real dictionary .

In [43]:
df['PII_types'] = df['PII_types'].apply(lambda x: json.loads(x.replace("'", '"')))

In [44]:
type(df['PII_types'].iloc[0])

dict

In [23]:

PII_counts = Counter([pii_type for pii_dict in df['PII_types'] for pii_type, items in pii_dict.items() if items for _ in items])
pd.DataFrame.from_dict(PII_counts, orient='index', columns=['Count']).sort_values(by='Count', ascending=False)

Unnamed: 0,Count
NAME_STUDENT,2486
PHONE_NUM,2480
EMAIL,2459
URL_PERSONAL,2454
ID_NUM,2440
USERNAME,2438
STREET_ADDRESS,2378



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [45]:
df['PII_types'].iloc[0]

{'NAME_STUDENT': ['Richard', 'Chang'],
 'EMAIL': ['gwilliams@yahoo.com'],
 'USERNAME': ['brandy38'],
 'ID_NUM': ['GB41EJEY19489241157815'],
 'PHONE_NUM': ['(259)938-7784x08016'],
 'URL_PERSONAL': ['https://twitter.com/john51',
  'https://youtube.com/c/sallywalker'],
 'STREET_ADDRESS': ['711 Golden Overpass, West Andreaville, OH 44115']}

## Create BIO Tagged Text
The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).

| Tag        | Meaning                                      | When to use it                                                                 |
|------------|----------------------------------------------|---------------------------------------------------------------------------------|
| **B-**     | **Begin** – This token **starts** a new entity | Always the **first** token of an entity<br> Use even if the entity is only 1 token long |
| **I-**     | **Inside** – This token is **inside** the same entity | Every token **after** the B- tag that still belongs to the same entity      |
| **O**      | **Outside** – This token is **not** part of any PII entity | All tokens that are not part of any named entity                                |

In [59]:
def create_bio_from_text_and_pii(row):
    text = row['Text'] if 'Text' in row else row['clean_text']
    pii_dict = row['PII_types']

    tokens = text.split()

    labels = ["0"] * len(tokens)

    for pii_type, values in pii_dict.items():
        if not values or values == []:
            continue
        for pii_text in values:
            if not pii_text.strip():
                continue
            pii_words = pii_text.split()
            n = len(pii_words)

            # Search for this exact phrase in tokens
            for i in range(len(tokens) - n + 1):
                if tokens[i:i+n] == pii_words:
                    # Mark BIO format
                    labels[i] = f"B-{pii_type} ({tokens[i]})"

                    for j in range(1, n):
                        labels[i+j] = f"I-{pii_type}"
                    break  # stop after first match (no overlapping in this data)

    return pd.Series([tokens, labels])
df[['tokens', 'bio_labels']] = df.apply(create_bio_from_text_and_pii, axis=1)
print("Example:")
print("Text :", df['Text'].iloc[0][:200])
print("Tokens:", df['tokens'].iloc[0][:20])
print("Labels:", df['bio_labels'].iloc[0][:30])
print(len(df['bio_labels']))

Example:
Text : In today's modern world, where technology has become an integral part of our lives, it is essential for students like Richard Chang to navigate the digital landscape with ease. Richard, with his email
Tokens: ['In', "today's", 'modern', 'world,', 'where', 'technology', 'has', 'become', 'an', 'integral', 'part', 'of', 'our', 'lives,', 'it', 'is', 'essential', 'for', 'students', 'like']
Labels: ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', 'B-NAME_STUDENT (Richard)', 'B-NAME_STUDENT (Chang)', '0', '0', '0', '0', '0', '0', '0', '0']
2000


In [47]:
# Test on any new sentence
test_sentence = "Hi, my name is Minh and my email is minh123@gmail.com and phone 0908-123-456"

test_tokens = test_sentence.split()
test_labels = ["O"] * len(test_tokens)

# Use the same logic to mark
fake_pii = {
    "NAME_STUDENT": ["Minh"],
    "EMAIL": ["minh123@gmail.com"],
    "PHONE_NUM": ["0908-123-456"]
}

for pii_type, values in fake_pii.items():
    for val in values:
        words = val.split()
        for i in range(len(test_tokens) - len(words) + 1):
            if test_tokens[i:i+len(words)] == words:
                test_labels[i] = f"B-{pii_type}"
                for j in range(1, len(words)):
                    test_labels[i+j] = f"I-{pii_type}"

# Show result
for token, label in zip(test_tokens, test_labels):
    print(token, "→", label)

Hi, → O
my → O
name → O
is → O
Minh → B-NAME_STUDENT
and → O
my → O
email → O
is → O
minh123@gmail.com → B-EMAIL
and → O
phone → O
0908-123-456 → B-PHONE_NUM


## Prepare Dataset
Currently PII_Types are formed text (EMAIL, PHONE_NUM, etc), I need to collect all PII types to define the label set.

In [48]:
from collections import defaultdict

all_labels = set()
for labels in df['bio_labels']:
    all_labels.update([label.split(' (')[0] for label in labels if label != "O"])

# Add "O" and sort for consistency
unique_labels = ["O"] + sorted(all_labels)
label2id = {label: idx for idx, label in enumerate(unique_labels)}
id2label = {idx: label for label, idx in label2id.items()}

print("Label mapping:", label2id)

Label mapping: {'O': 0, '0': 1, 'B-EMAIL': 2, 'B-ID_NUM': 3, 'B-NAME_STUDENT': 4, 'B-PHONE_NUM': 5, 'B-STREET_ADDRESS': 6, 'B-URL_PERSONAL': 7, 'B-USERNAME': 8, 'I-ID_NUM': 9, 'I-NAME_STUDENT': 10, 'I-STREET_ADDRESS': 11}


Convert your DataFrame to a Hugging Face Dataset. Ensure labels are numeric IDs:

In [49]:
import pandas as pd
from datasets import Dataset

def prepare_example(row):
    return {
        'tokens': row['tokens'],
        'ner_tags': [label2id.get(label.split(' (')[0], 0) for label in row['bio_labels']]  # Map to IDs, default to O
    }

prepared_df = df.apply(prepare_example, axis=1, result_type='expand')
dataset = Dataset.from_pandas(prepared_df)

# Split into train/test (e.g., 80/20)
dataset = dataset.train_test_split(test_size=0.2)

## Load a Pre-trained Model and Tokenizer
Using a model "distilbert-base-uncased"

In [50]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "distilbert/distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
num_labels = len(label2id)
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels, id2label=id2label, label2id=label2id
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Tokenize and Align Labels
Transformers use subword tokenization (e.g., "minh123@gmail.com" might split into "minh", "##123", "@", "gmail", ".", "com"), so you need to align your word-level labels to subwords. Ignore labels for special tokens like [CLS]/[SEP], and for subwords after the first, use the continuation label (e.g., if word is "I-EMAIL", subwords get -100 to ignore in loss, or continue as "I-EMAIL" depending on strategy—Hugging Face handles this).
Define a tokenization function:

In [66]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        max_length=512,
        # ← DO NOT use padding=True here!
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # This is magic
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:                # [CLS], [SEP], padding tokens
                label_ids.append(-100)
            elif word_idx != previous_word_idx: # First token of a word
                label_ids.append(label[word_idx])
            else:                               # Subword tokens
                label_ids.append(-100)          # or label[word_idx] if you want I- on subwords
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

## Set up training
Use the Trainer API for simplicity. It handles optimization, evaluation, etc

In [57]:
metric = load("seqeval")

# 1. Use the official data collator
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [id2label[p] for (p, l) in zip(pred, label) if l != -100]
        for pred, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# 3. Apply the function and remove original columns
tokenized_dataset = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names
)
training_args = TrainingArguments(
    output_dir="./pii_model",
    num_train_epochs=15,
    per_device_train_batch_size=32,    # using GPU can set up bigger size
    per_device_eval_batch_size=32,
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    fp16=True,
    dataloader_num_workers=2,
    report_to="none",
)

# 4. Now create trainer with the collator
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,        # This will pad labels to same length with -100
    compute_metrics=compute_metrics,
)

trainer.train()

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0005,0.005887,0.818999,0.8501,0.83426,0.998597
2,0.0006,0.006885,0.832645,0.805463,0.818828,0.998445
3,0.001,0.007412,0.842672,0.781479,0.810923,0.998459
4,0.0007,0.008963,0.833102,0.801466,0.816978,0.99839
5,0.0007,0.006566,0.809907,0.871419,0.839538,0.998597
6,0.0006,0.0067,0.831959,0.874084,0.852502,0.998749
7,0.0002,0.006721,0.856174,0.840773,0.848403,0.998728
8,0.0002,0.00727,0.848993,0.842771,0.845871,0.99868
9,0.0002,0.005705,0.846104,0.875416,0.860511,0.998853
10,0.0002,0.00582,0.847798,0.872085,0.85977,0.998825




TrainOutput(global_step=750, training_loss=0.0003799459654837847, metrics={'train_runtime': 479.2955, 'train_samples_per_second': 50.073, 'train_steps_per_second': 1.565, 'total_flos': 3136241369088000.0, 'train_loss': 0.0003799459654837847, 'epoch': 15.0})

After training, use the model to predict on new sentences:

In [None]:
from transformers import pipeline

ner_pipeline = pipeline("ner", model=trainer.model, tokenizer=tokenizer, aggregation_strategy="simple")

test_sentence = "Hi, my name is Minh and my email is minh123@gmail.com and phone 0908-123-456"
results = ner_pipeline(test_sentence)

for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")

Device set to use cuda:0


Entity: hi, my name is minh and my email is, Label: 0, Score: 0.96
Entity: minh, Label: EMAIL, Score: 0.88
Entity: ##123 @, Label: 0, Score: 0.56
Entity: gma, Label: EMAIL, Score: 0.94
Entity: ##il. com and phone 0908 - 123 - 456, Label: 0, Score: 0.93


In [68]:
trainer.save_model("/content/drive/MyDrive/pii_model_final")
tokenizer.save_pretrained("/content/drive/MyDrive/pii_model_final")
print("Saved to Google Drive!")

Saved to Google Drive!


In [69]:
!zip -r pii_model_final.zip /content/drive/MyDrive/pii_model_final

  adding: content/drive/MyDrive/pii_model_final/ (stored 0%)
  adding: content/drive/MyDrive/pii_model_final/config.json (deflated 54%)
  adding: content/drive/MyDrive/pii_model_final/model.safetensors (deflated 8%)
  adding: content/drive/MyDrive/pii_model_final/tokenizer_config.json (deflated 75%)
  adding: content/drive/MyDrive/pii_model_final/special_tokens_map.json (deflated 42%)
  adding: content/drive/MyDrive/pii_model_final/vocab.txt (deflated 53%)
  adding: content/drive/MyDrive/pii_model_final/tokenizer.json (deflated 71%)
  adding: content/drive/MyDrive/pii_model_final/training_args.bin (deflated 53%)


In [70]:
from google.colab.files import download
download('pii_model_final.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>