# PII DETECTION: Applied Transformer-based Token Classification

### What is NER?
NER = Named Entity Recognition
It is the NLP task that consists of finding and classifying “entities” (important spans of text) in a sentence.

In my project, I'm doing PII-focused NER (Personally Identifiable Information), which is a special kind of NER with many more entity types, for example: B-NAME_STUDENT, B-EMAIL,...

Therefore, my model learns to tage every token in a sentence with one of these labels (BIO scheme)

### What is BERT transformer model?
BERT = Bidirectional Encoder Representations from Transformers
It is a revolutionary Transformer model published by Google in 2018 that completely changed NLP.

**Pre-training**

I used Hugging Face, so it already teach general language understanding for my model.

I tried to test different models already pre-trained

```
bert-base-uncased, distilbert/distilbert-base-uncased, dslim/bert-base-NER
```
**Fine-tuning**

I am doing the fine-tuning step on your PII dataset using AutoModelForTokenClassification + Trainer.

Fine-tuning helps teach the model detect and understand PII entities (NAME_STUDENT, EMAIL,...). I take a pre-trained checkpoint -> train models with 15 epoches.

### Result

Based on my training results - validation metrics from 15 epoches fine-tuning runs, I assume:
- With model_checkpoint = "dslim/bert-base-NER", it shows moderate F1 growth but plateaus around 0.83.
- With model_checkpoint = "bert-base-uncased".It achieves higher peak F1 (~0.91) and accuracy (~0.999).

Three of my model show the same issue: High validation F1 (0.80–0.91) and accuracy (0.997–0.999), but low testing performance. This could be overfitting signal in NER/PII tasks.

###What should I do next step?

Last time when I talked with teacher Iman, he told me donot focus on the result now, just following back the NER pipeline and fine-tuning, pre-trained; but I don't know where is the problem.


In [1]:
!pip install -q transformers datasets evaluate seqeval accelerate nltk

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [2]:
from google.colab import files
uploaded = files.upload()

Saving pii_data.csv to pii_data.csv


In [111]:
import pandas as pd
import json
import nltk
from nltk.corpus import stopwords
import re
from collections import Counter
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification, EarlyStoppingCallback
import torch
import numpy as np
import evaluate
from evaluate import load
import numpy as np


nltk.download('stopwords')
stop = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
df = pd.read_csv("pii_data.csv")
df.head()


Unnamed: 0,0,1
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison..."
2,Janice: A Student with a Unique Identity\n\nIn...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T..."


In [5]:
df.columns = ['Text', 'PII_types']
df.head()

Unnamed: 0,Text,PII_types
0,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Richard', 'Chang'], 'EMAIL'..."
1,"In today's modern world, where technology has ...","{'NAME_STUDENT': [], 'EMAIL': ['tamaramorrison..."
2,Janice: A Student with a Unique Identity\n\nIn...,"{'NAME_STUDENT': ['Janice'], 'EMAIL': ['laura5..."
3,Christian is a student who goes by the usernam...,"{'NAME_STUDENT': ['Christian'], 'EMAIL': [], '..."
4,"In today's modern world, where technology has ...","{'NAME_STUDENT': ['Aaron Smith', 'Fischer', 'T..."


In [6]:
# def clean_text(text):
#     text = str(text).lower()                    # lowercase all words
#     text = re.sub(r'[^a-z\s]', '', text)        # remove punctuation, numbers, special chars
#     words = [w for w in text.split() if w not in stop]  #remove stopwords
#     return " ".join(words)

# def clean_text_light(text):
#     return str(text).lower()

# df['clean_text'] = df['Text'].apply(clean_text)
# df['clean_text_light'] = df['Text'].apply(clean_text_light)

Should removed the clean text cause this is a huge issue. PII recognition depends heavily on casing. When everything is lowercased, I remove almost vital semantic signals the model relies one. This is a known NER problem: Lowercasing decreases token-level F1 by 6-12% on average.

First, we need to turn the PII string into real dictionary .

In [7]:
df['PII_types'] = df['PII_types'].apply(lambda x: json.loads(x.replace("'", '"')))

In [8]:
type(df['PII_types'].iloc[0])

dict

In [9]:

PII_counts = Counter([pii_type for pii_dict in df['PII_types'] for pii_type, items in pii_dict.items() if items for _ in items])
pd.DataFrame.from_dict(PII_counts, orient='index', columns=['Count']).sort_values(by='Count', ascending=False)

Unnamed: 0,Count
NAME_STUDENT,2486
PHONE_NUM,2480
EMAIL,2459
URL_PERSONAL,2454
ID_NUM,2440
USERNAME,2438
STREET_ADDRESS,2378


In [10]:
df['PII_types'].iloc[0]

{'NAME_STUDENT': ['Richard', 'Chang'],
 'EMAIL': ['gwilliams@yahoo.com'],
 'USERNAME': ['brandy38'],
 'ID_NUM': ['GB41EJEY19489241157815'],
 'PHONE_NUM': ['(259)938-7784x08016'],
 'URL_PERSONAL': ['https://twitter.com/john51',
  'https://youtube.com/c/sallywalker'],
 'STREET_ADDRESS': ['711 Golden Overpass, West Andreaville, OH 44115']}

## Create BIO Tagged Text
The BIO / IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition).

| Tag        | Meaning                                      | When to use it                                                                 |
|------------|----------------------------------------------|---------------------------------------------------------------------------------|
| **B-**     | **Begin** – This token **starts** a new entity | Always the **first** token of an entity<br> Use even if the entity is only 1 token long |
| **I-**     | **Inside** – This token is **inside** the same entity | Every token **after** the B- tag that still belongs to the same entity      |
| **O**      | **Outside** – This token is **not** part of any PII entity | All tokens that are not part of any named entity                                |

In [11]:
# def create_bio_from_text_and_pii(row):
#     text = row["Text"]
#     pii_dict = row["PII_types"]

#     tokens = text.split()
#     labels = ["O"] * len(tokens)

#     for pii_type, values in pii_dict.items():
#         if not values:
#             continue
#         for pii_text in values:
#             if not pii_text.strip():
#                 continue

#             pii_words = pii_text.split()
#             n = len(pii_words)

#             for i in range(len(tokens) - n + 1):
#                 if tokens[i:i+n] == pii_words:
#                     labels[i] = f"B-{pii_type}"
#                     for j in range(1, n):
#                         labels[i+j] = f"I-{pii_type}"
#                     break

#     return pd.Series([tokens, labels])

# # I convert the doc-level PII dict into token-level BIO labels
# df[['tokens', 'bio_labels']] = df.apply(create_bio_from_text_and_pii, axis=1)
# print("Example:")
# print("Text :", df['Text'].iloc[0][:200])
# print("Tokens:", df['tokens'].iloc[0][:20])
# print("Labels:", df['bio_labels'].iloc[0][:30])
# print(len(df['bio_labels']))

Example:
Text : In today's modern world, where technology has become an integral part of our lives, it is essential for students like Richard Chang to navigate the digital landscape with ease. Richard, with his email
Tokens: ['In', "today's", 'modern', 'world,', 'where', 'technology', 'has', 'become', 'an', 'integral', 'part', 'of', 'our', 'lives,', 'it', 'is', 'essential', 'for', 'students', 'like']
Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NAME_STUDENT', 'B-NAME_STUDENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
2000


In [94]:
import re
import pandas as pd

TOKEN_REGEX = re.compile(r"\S+")
DATE_REGEX = re.compile(
    r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b"
)

def create_bio_from_spans(row):
    text = row["Text"]
    pii_dict = row["PII_types"]

    # PII-safe tokenization (keeps punctuation attached)
    tokens = [m.group() for m in TOKEN_REGEX.finditer(text)]
    spans = [(m.start(), m.end()) for m in TOKEN_REGEX.finditer(text)]

    labels = ["O"] * len(tokens)

    for pii_type, values in pii_dict.items():
        for value in values:
            if not value.strip():
                continue

            for match in re.finditer(re.escape(value), text):
                span_start, span_end = match.span()
                first = True

                for i, (ts, te) in enumerate(spans):
                    # Overlap condition (CRITICAL)
                    if ts < span_end and te > span_start:
                        if labels[i] != "O":
                            continue
                        labels[i] = (
                            f"B-{pii_type}" if first else f"I-{pii_type}"
                        )
                        first = False
            for match in DATE_REGEX.finditer(text):
                span_start, span_end = match.span()
                first = True
                for i, (ts, te) in enumerate(spans):
                  if ts < span_end and te > span_start:
                    if labels[i] != "O":
                      continue
                    labels[i] = (
                        f"B-DATE" if first else f"I-DATE"
                    )
                    first = False

    return pd.Series([tokens, labels])
df[['tokens', 'bio_labels']] = df.apply(create_bio_from_spans, axis=1)
print("Example:")
print("Text :", df['Text'].iloc[0][:300])
print("Tokens:", df['tokens'].iloc[0][:60])
print("Labels:", df['bio_labels'].iloc[0][:60])
print(len(df['bio_labels']))

Example:
Text : In today's modern world, where technology has become an integral part of our lives, it is essential for students like Richard Chang to navigate the digital landscape with ease. Richard, with his email address gwilliams@yahoo.com and username brandy38, is well-equipped to embrace the opportunities th
Tokens: ['In', "today's", 'modern', 'world,', 'where', 'technology', 'has', 'become', 'an', 'integral', 'part', 'of', 'our', 'lives,', 'it', 'is', 'essential', 'for', 'students', 'like', 'Richard', 'Chang', 'to', 'navigate', 'the', 'digital', 'landscape', 'with', 'ease.', 'Richard,', 'with', 'his', 'email', 'address', 'gwilliams@yahoo.com', 'and', 'username', 'brandy38,', 'is', 'well-equipped', 'to', 'embrace', 'the', 'opportunities', 'that', 'the', 'online', 'realm', 'offers.', 'One', 'of', 'the', 'key', 'aspects', 'of', "Richard's", 'digital', 'presence', 'is', 'his']
Labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O

In [95]:
# Test on new sentence
test_sentence = "Hi, my name is Minh and my email is minh123@gmail.com and phone 0908-123-456"

test_tokens = test_sentence.split()
test_labels = ["O"] * len(test_tokens)

# Use the same logic to mark
fake_pii = {
    "NAME_STUDENT": ["Minh"],
    "EMAIL": ["minh123@gmail.com"],
    "PHONE_NUM": ["0908-123-456"]
}

for pii_type, values in fake_pii.items():
    for val in values:
        words = val.split()
        for i in range(len(test_tokens) - len(words) + 1):
            if test_tokens[i:i+len(words)] == words:
                test_labels[i] = f"B-{pii_type}"
                for j in range(1, len(words)):
                    test_labels[i+j] = f"I-{pii_type}"

# Show result
for token, label in zip(test_tokens, test_labels):
    print(token, "→", label)

Hi, → O
my → O
name → O
is → O
Minh → B-NAME_STUDENT
and → O
my → O
email → O
is → O
minh123@gmail.com → B-EMAIL
and → O
phone → O
0908-123-456 → B-PHONE_NUM


One of the problem is exact string matching creates memorization, not NER

So the model learns:

“This string is EMAIL”

instead of:

“This pattern is EMAIL”

That’s why:

Validation F1 ≈ 0.9

Real inference ≈ fails

This is the #1 reason your model still “cannot recognize PII”.


## Prepare Dataset
Currently PII_Types are formed text (EMAIL, PHONE_NUM, etc), I need to collect all PII types to define the label set.

In [61]:
# from collections import defaultdict

# # Collect only the real entity types
# all_labels = set()
# for labels in df["bio_labels"]:
#     for label in labels:
#         if label != "O":
#             all_labels.add(label)

# unique_labels = ["O"] + sorted(all_labels)
# label2id = {label: idx for idx, label in enumerate(unique_labels)}
# id2label = {idx: label for label, idx in label2id.items()}

# print("Total labels:", len(unique_labels))
# print(label2id)
# print(all_labels)


In [96]:
ENTITY_TYPES = [
    "NAME_STUDENT",
    "EMAIL",
    "PHONE_NUM",
    "USERNAME",
    "URL_PERSONAL",
    "ID_NUM",
    "STREET_ADDRESS",
    "DATE"
]

labels = ["O"]
for ent in ENTITY_TYPES:
    labels.append(f"B-{ent}")
    labels.append(f"I-{ent}")

label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for l, i in label2id.items()}

print(len(labels))
print(label2id)


17
{'O': 0, 'B-NAME_STUDENT': 1, 'I-NAME_STUDENT': 2, 'B-EMAIL': 3, 'I-EMAIL': 4, 'B-PHONE_NUM': 5, 'I-PHONE_NUM': 6, 'B-USERNAME': 7, 'I-USERNAME': 8, 'B-URL_PERSONAL': 9, 'I-URL_PERSONAL': 10, 'B-ID_NUM': 11, 'I-ID_NUM': 12, 'B-STREET_ADDRESS': 13, 'I-STREET_ADDRESS': 14, 'B-DATE': 15, 'I-DATE': 16}


Convert your DataFrame to a Hugging Face Dataset. Ensure labels are numeric IDs:

In [97]:
import pandas as pd
from datasets import Dataset

def preprocess_example(example):
    example["ner_tags"] = [label2id[label] for label in example["bio_labels"]]
    return example

# Apply creating the dataset
df = df.apply(preprocess_example, axis=1)
dataset = Dataset.from_pandas(df)

dataset = Dataset.from_pandas(df[['tokens', 'bio_labels', 'ner_tags']])

# Split into train/test
dataset = dataset.train_test_split(test_size=0.2, seed=42)

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['tokens', 'bio_labels', 'ner_tags'],
        num_rows: 1600
    })
    test: Dataset({
        features: ['tokens', 'bio_labels', 'ner_tags'],
        num_rows: 400
    })
})


In [71]:
# df['bio_labels'].sample(50)

## Load a Pre-trained Model and Tokenizer
Using a model "distilbert-base-uncased"

In [99]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "distilbert/distilbert-base-uncased"
# model_checkpoint = "dslim/bert-base-NER"
# model_checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
num_labels = len(label2id)
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Tokenize and Align Labels
Transformers use subword tokenization (e.g., "minh123@gmail.com" might split into "minh", "##123", "@", "gmail", ".", "com"), so you need to align your word-level labels to subwords. Ignore labels for special tokens like [CLS]/[SEP], and for subwords after the first, use the continuation label (e.g., if word is "I-EMAIL", subwords get -100 to ignore in loss, or continue as "I-EMAIL" depending on strategy—Hugging Face handles this).
Define a tokenization function:

In [100]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        max_length=512,
    )

    all_labels = []

    for i, word_labels in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:
                # CLS, SEP, PAD
                label_ids.append(-100)

            elif word_idx != previous_word_idx:
                # First subword of a token
                label_ids.append(word_labels[word_idx])

            else:
                # Continuation subword
                label = word_labels[word_idx]
                label_name = id2label[label]

                if label_name.startswith("B-"):
                    # Convert B-XXX → I-XXX
                    label_ids.append(label + 1)
                else:
                    label_ids.append(label)

            previous_word_idx = word_idx

        all_labels.append(label_ids)

    tokenized_inputs["labels"] = all_labels
    return tokenized_inputs


In [101]:
sample = tokenized_dataset["train"][0]

for tid, lid in zip(sample["input_ids"], sample["labels"]):
    print(tokenizer.convert_ids_to_tokens(tid), lid, id2label.get(lid, "IGN"))


[CLS] -100 IGN
in 0 O
today 0 O
' 0 O
s 0 O
modern 0 O
world 0 O
, 0 O
where 0 O
technology 0 O
has 0 O
become 0 O
an 0 O
integral 0 O
part 0 O
of 0 O
our 0 O
lives 0 O
, 0 O
it 0 O
is 0 O
not 0 O
surprising 0 O
to 0 O
see 0 O
students 0 O
like 0 O
jesus 1 B-NAME_STUDENT
kaufman 2 I-NAME_STUDENT
embracing 0 O
the 0 O
digital 0 O
age 0 O
. 0 O
with 0 O
an 0 O
email 0 O
address 0 O
like 0 O
daniel 3 B-EMAIL
##90 4 I-EMAIL
@ 4 I-EMAIL
yahoo 4 I-EMAIL
. 4 I-EMAIL
com 4 I-EMAIL
, 4 I-EMAIL
jesus 0 O
is 0 O
ready 0 O
to 0 O
connect 0 O
with 0 O
the 0 O
world 0 O
and 0 O
make 0 O
the 0 O
most 0 O
of 0 O
the 0 O
opportunities 0 O
that 0 O
come 0 O
his 0 O
way 0 O
. 0 O
when 0 O
it 0 O
comes 0 O
to 0 O
online 0 O
platforms 0 O
, 0 O
jesus 0 O
is 0 O
no 0 O
stranger 0 O
. 0 O
with 0 O
user 0 O
##name 0 O
##s 0 O
like 0 O
rs 7 B-USERNAME
##ala 8 I-USERNAME
##zar 8 I-USERNAME
and 0 O
courtney 7 B-USERNAME
##sh 8 I-USERNAME
##ep 8 I-USERNAME
##her 8 I-USERNAME
##d 8 I-USERNAME
, 8 I-USERNAME
he 0 O

## Set up training
Use the Trainer API for simplicity. It handles optimization, evaluation, etc

In [102]:
metric = load("seqeval")

# 1. Use the official data collator
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [id2label[p] for (p, l) in zip(pred, label) if l != -100]
        for pred, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# 3. Apply the function and remove original columns
tokenized_dataset = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names
)
training_args = TrainingArguments(
    output_dir="./pii_model",
    num_train_epochs=8,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    fp16=True,
    dataloader_num_workers=2,
    report_to="none",
    dataloader_pin_memory = False
)

# 4. Now create trainer with the collator
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

trainer.train()

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.3643,0.153517,0.361443,0.353894,0.357629,0.953068
2,0.0429,0.030815,0.87269,0.921921,0.89663,0.993452
3,0.0229,0.021592,0.923788,0.936951,0.930323,0.994819
4,0.0159,0.01823,0.936158,0.953152,0.944579,0.995586
5,0.012,0.017067,0.942135,0.959789,0.95088,0.995849
6,0.0105,0.017064,0.94504,0.963303,0.954084,0.995914
7,0.0086,0.017287,0.945179,0.962522,0.953772,0.996091
8,0.0081,0.016387,0.946721,0.960765,0.953691,0.996202


TrainOutput(global_step=800, training_loss=0.1673540211096406, metrics={'train_runtime': 321.1966, 'train_samples_per_second': 39.851, 'train_steps_per_second': 2.491, 'total_flos': 1672813255065600.0, 'train_loss': 0.1673540211096406, 'epoch': 8.0})

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [103]:
tokenized_dataset["train"]

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 1600
})

In [104]:
tokenized_dataset["test"]

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 400
})

After training, use the model to predict on new sentences:

In [105]:
from transformers import pipeline

ner_pipeline = pipeline("ner", model=trainer.model, tokenizer=tokenizer, aggregation_strategy="simple")

test_sentence = "Hi, my name is Minh and my email is minh123@gmail.com and phone 0908-123-456"
results = ner_pipeline(test_sentence)

for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")

Device set to use cuda:0


Entity: minh, Label: NAME_STUDENT, Score: 0.36
Entity: minh123 @ gmail. com, Label: EMAIL, Score: 0.95
Entity: phone, Label: ID_NUM, Score: 0.27
Entity: 0908 - 123 - 456, Label: PHONE_NUM, Score: 0.91


In [108]:
from transformers import pipeline

# Load NER pipeline
ner_pipeline = pipeline(
    "ner",
    model=trainer.model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

test_sentences = [
    "My name is John Smith.",
    "You can contact me at john.smith@example.com.",
    "London is beautiful in the summer, but London Brown is my classmate.",
    "My phone number is +1 415 555 2671.",
    "I was born on 02/25/2005.",
    "I live at 123A Main Street, New York City.",
    "Call me at 415-822-4459 tomorrow.",
    "I'm Emily Johnson, born on 02/25/2005, living at 456 Oak Avenue, Los Angeles.",
    "My advisor is Michael Andrew Thompson.",
    "The student Sarah O'Connor submitted the form.",
    "Yesterday I met Jean-Luc Picard - a science student in class.",
    "Send the file to alice.bob+test@cs.stanford.edu.",
    "Her email is user_2024@mail.co.uk.",
    "Contact: support-team@company.io",
    "Call me at (212) 555-0198.",
    "My backup number is +44 20 7946 0958.",
    "Emergency contact: 0908 123 456.",

    "I live at 1600 Pennsylvania Avenue NW.",
    "Office location: 221B Baker Street, London.",
    "Ship to 742 Evergreen Terrace.",

    "Visit my site at https://johnsmith.dev",
    "My LinkedIn is linkedin.com/in/emilyjohnson",
    "Portfolio: www.alexchen.me",




]

for text in test_sentences:
    print(f"\nTEXT: {text}")
    results = ner_pipeline(text)
    if not results:
        print("No PII detected.")
    else:
        for entity in results:
            print(f"  Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")


Device set to use cuda:0



TEXT: My name is John Smith.
  Entity: john smith, Label: NAME_STUDENT, Score: 0.42

TEXT: You can contact me at john.smith@example.com.
  Entity: john. smith @ example. com., Label: EMAIL, Score: 0.78

TEXT: London is beautiful in the summer, but London Brown is my classmate.
  Entity: london, Label: NAME_STUDENT, Score: 0.61
  Entity: london brown, Label: NAME_STUDENT, Score: 0.71

TEXT: My phone number is +1 415 555 2671.
  Entity: + 1 415 555 2671., Label: PHONE_NUM, Score: 0.84

TEXT: I was born on 02/25/2005.
  Entity: /, Label: PHONE_NUM, Score: 0.22
  Entity: 25, Label: ID_NUM, Score: 0.18
  Entity: /, Label: PHONE_NUM, Score: 0.19
  Entity: 2005, Label: NAME_STUDENT, Score: 0.12

TEXT: I live at 123A Main Street, New York City.
  Entity: ##a main street,, Label: STREET_ADDRESS, Score: 0.34

TEXT: Call me at 415-822-4459 tomorrow.
  Entity: 415 - 822 - 4459, Label: PHONE_NUM, Score: 0.92

TEXT: I'm Emily Johnson, born on 02/25/2005, living at 456 Oak Avenue, Los Angeles.
  Ent

In [110]:
from datetime import datetime
import os

# Create folder name with today's date
today = datetime.now().strftime("%Y-%m-%d")
save_dir = f"./ner_model_{today}"

os.makedirs(save_dir, exist_ok=True)

# Save model and tokenizer
trainer.model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

print(f"Model saved to: {save_dir}")


Model saved to: ./ner_model_2025-12-13


In [114]:
# from datetime import datetime
# import os

# today = datetime.now().strftime("%Y-%m-%d")

# save_dir = rf"C:\Users\nguye\OneDrive\Desktop\uni\Y2_Sem3\individual_project\PII_Detection_3\PII_Detection\notebooks\checkpoints\ner_model_{today}"

# os.makedirs(save_dir, exist_ok=True)

# trainer.model.save_pretrained(save_dir)
# tokenizer.save_pretrained(save_dir)

# print(f"Model saved to: {save_dir}")


In [109]:
from transformers import pipeline

ner_pipeline = pipeline(
    "ner",
    model="./ner_model_2025-12-13",
    tokenizer="./ner_model_2025-12-13",
    aggregation_strategy="simple"
)

test_sentence = "Hi, my name is Minh and my email is minh123@gmail.com and phone 0908-123-456"
results = ner_pipeline(test_sentence)

for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")


Device set to use cuda:0


Entity: minh, Label: NAME_STUDENT, Score: 0.54
Entity: minh123 @ gmail. com, Label: EMAIL, Score: 0.93
Entity: 0908 - 123 - 456, Label: PHONE_NUM, Score: 0.81
