# PII NER ‚Äî Beginner-Friendly Colab Notebook

This notebook implements the end-to-end PII NER assignment for noisy STT transcripts.

**What this notebook contains (Beginner-Friendly):**
- Folder setup
- Dataset generation (synthetic noisy STT)
- Training a DistilBERT token classifier (baseline)
- Prediction with conservative post-processing
- Simple evaluation and latency measurement
- Tips and next steps

**Assignment PDF (uploaded):**

- `/mnt/data/IIT Madras __ Assignment 2025 (1).pdf` ‚Äî open from the Colab Files pane.

---

**How to use:** Upload this notebook to Google Colab and run cells sequentially.

In [None]:
!mkdir -p data scripts src out
!ls -la


total 32
drwxr-xr-x 1 root root 4096 Nov 23 13:39 .
drwxr-xr-x 1 root root 4096 Nov 23 13:38 ..
drwxr-xr-x 4 root root 4096 Nov 20 14:30 .config
drwxr-xr-x 2 root root 4096 Nov 23 13:39 data
drwxr-xr-x 2 root root 4096 Nov 23 13:39 out
drwxr-xr-x 1 root root 4096 Nov 20 14:30 sample_data
drwxr-xr-x 2 root root 4096 Nov 23 13:39 scripts
drwxr-xr-x 2 root root 4096 Nov 23 13:39 src


## 2) Install required packages
Run this cell to install transformers, datasets, and PyTorch (Colab often already has torch).

In [None]:
!pip install -q transformers datasets seqeval tokenizers accelerate sacrebleu
!pip install -q torch


## 3) Write the dataset generator script
This script creates `data/train.jsonl` and `data/dev.jsonl` with noisy STT patterns.

In [None]:
%%writefile scripts/generate_stt_data.py
import json
import random
import uuid
from pathlib import Path

NAMES = ["john", "mary", "alice", "bob", "mohammed", "anvit", "ravi", "sneha", "arjun", "kiran"]
DOMAINS = ["gmail", "yahoo", "example", "outlook"]
CITIES = ["mumbai", "chennai", "delhi", "bangalore", "kolkata"]
FILLERS = ["uh", "okay", "you know", "like"]

MAP_DIGIT = {
    '0': 'zero', '1': 'one', '2': 'two', '3': 'three', '4': 'four',
    '5': 'five', '6': 'six', '7': 'seven', '8': 'eight', '9': 'nine'
}

def random_name():
    name = random.choice(NAMES)
    if random.random() < 0.25:
        sec = random.choice(NAMES)
        if sec != name:
            name = name + " " + sec
    return name

def make_email():
    local = random.choice(NAMES)
    if random.random() < 0.2:
        local = " ".join(list(local))
    local = local.replace(" ", " dot ")
    return f"{local} at {random.choice(DOMAINS)} dot com", "EMAIL"

def make_phone():
    digits = ''.join(random.choice('0123456789') for _ in range(10))
    if random.random() < 0.5:
        return " ".join(MAP_DIGIT[d] for d in digits), "PHONE"
    return digits, "PHONE"

def make_credit():
    digits = ''.join(random.choice('0123456789') for _ in range(16))
    return " ".join(MAP_DIGIT[d] for d in digits), "CREDIT_CARD"

def make_date():
    day = random.randint(1, 28)
    month = random.choice(["january","february","march","april","may","june",
                           "july","august","september","october","november","december"])
    year = random.choice(["two thousand twenty", "twenty twenty one"])
    return f"{day} of {month} {year}", "DATE"

def make_city():
    return random.choice(CITIES), "CITY"

ENTITY_MAKERS = [make_email, make_phone, make_credit, make_date, random_name, make_city]

def generate_example():
    base = ["i", "think", "my", "account", "is", "with"]
    if random.random() < 0.3:
        base.insert(0, random.choice(FILLERS))
    text = " ".join(base)

    entities = []
    maker = random.choice(ENTITY_MAKERS)
    if maker == random_name:
        ent_text = maker()
        label = "PERSON_NAME"
    else:
        ent_text, label = maker()
    text = text + " " + ent_text
    start = text.find(ent_text)
    end = start + len(ent_text)

    return {"id": str(uuid.uuid4())[:8], "text": text, "entities": [{"start": start, "end": end, "label": label}]}

def write_dataset():
    Path("data").mkdir(exist_ok=True)
    with open("data/train.jsonl", "w") as t:
        for _ in range(800):
            t.write(json.dumps(generate_example()) + "\n")
    with open("data/dev.jsonl", "w") as d:
        for _ in range(150):
            d.write(json.dumps(generate_example()) + "\n")

if __name__ == "__main__":
    random.seed(42)
    write_dataset()


Overwriting scripts/generate_stt_data.py


In [None]:
!python scripts/generate_stt_data.py
!head -n 5 data/train.jsonl


{"id": "3e4693b1", "text": "i think my account is with mohammed at yahoo dot com", "entities": [{"start": 27, "end": 52, "label": "EMAIL"}]}
{"id": "b4c67348", "text": "i think my account is with kolkata", "entities": [{"start": 27, "end": 34, "label": "CITY"}]}
{"id": "585675aa", "text": "like i think my account is with j dot o dot h dot n at yahoo dot com", "entities": [{"start": 32, "end": 68, "label": "EMAIL"}]}
{"id": "5978bfb9", "text": "i think my account is with a dot r dot j dot u dot n at outlook dot com", "entities": [{"start": 27, "end": 71, "label": "EMAIL"}]}
{"id": "4ff6683b", "text": "you know i think my account is with alice at example dot com", "entities": [{"start": 36, "end": 60, "label": "EMAIL"}]}


In [None]:
%%writefile src/dataset.py
import json
from datasets import Dataset

LABEL_LIST = [
    "O",
    "B-CREDIT_CARD","I-CREDIT_CARD",
    "B-PHONE","I-PHONE",
    "B-EMAIL","I-EMAIL",
    "B-PERSON_NAME","I-PERSON_NAME",
    "B-DATE","I-DATE",
    "B-CITY","I-CITY",
]

label_to_id = {l:i for i,l in enumerate(LABEL_LIST)}

def read_jsonl(path):
    return [json.loads(l) for l in open(path,"r")]

def char_labels_to_token_labels(tokenizer, text, entities):
    enc = tokenizer(text, return_offsets_mapping=True, truncation=True)
    offsets = enc["offset_mapping"]
    labels = ["O"] * len(offsets)

    for ent in entities:
        s, e, lab = ent["start"], ent["end"], ent["label"]
        started = False
        for i,(os,oe) in enumerate(offsets):
            if oe<=s or os>=e:
                continue
            if not started:
                labels[i] = "B-"+lab
                started = True
            else:
                labels[i] = "I-"+lab

    return enc, [label_to_id[l] for l in labels]

def build_hf_dataset(tokenizer, path):
    data = read_jsonl(path)
    inputs=[]; masks=[]; lbls=[]; offs=[]; texts=[]
    for item in data:
        enc, ids = char_labels_to_token_labels(tokenizer,item["text"],item["entities"])
        inputs.append(enc["input_ids"])
        masks.append(enc["attention_mask"])
        lbls.append(ids)
        offs.append(enc["offset_mapping"])
        texts.append(item["text"])

    return Dataset.from_dict({
        "input_ids":inputs,
        "attention_mask":masks,
        "labels":lbls,
        "offset_mapping":offs,
        "text":texts
    })


Writing src/dataset.py


In [None]:
%%writefile src/train.py
import argparse
from transformers import AutoTokenizer,AutoModelForTokenClassification,TrainingArguments,Trainer
from src.dataset import build_hf_dataset, LABEL_LIST

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_name",default="distilbert-base-uncased")
    parser.add_argument("--train",default="data/train.jsonl")
    parser.add_argument("--dev",default="data/dev.jsonl")
    parser.add_argument("--out_dir",default="out/distil_baseline")
    parser.add_argument("--epochs",type=int,default=2)
    parser.add_argument("--batch_size",type=int,default=8)
    args=parser.parse_args()

    tok = AutoTokenizer.from_pretrained(args.model_name, use_fast=True)
    train = build_hf_dataset(tok, args.train)
    dev = build_hf_dataset(tok, args.dev)

    model = AutoModelForTokenClassification.from_pretrained(
        args.model_name,
        num_labels=len(LABEL_LIST)
    )

    tr_args = TrainingArguments(
        output_dir=args.out_dir,
        evaluation_strategy="epoch",
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        num_train_epochs=args.epochs,
        save_strategy="epoch",
        logging_steps=50
    )

    trainer = Trainer(
        model=model,
        args=tr_args,
        train_dataset=train,
        eval_dataset=dev,
        tokenizer=tok,
    )

    trainer.train()
    trainer.save_model(args.out_dir)

if __name__=="__main__":
    main()


Writing src/train.py


In [None]:
%%writefile src/train.py
import sys
sys.path.append('/content')
sys.path.append('/content/src')

import argparse
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification
)
from src.dataset import build_hf_dataset, LABEL_LIST

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_name", default="distilbert-base-uncased")
    parser.add_argument("--train", default="data/train.jsonl")
    parser.add_argument("--dev", default="data/dev.jsonl")
    parser.add_argument("--out_dir", default="out/distil_baseline")
    parser.add_argument("--epochs", type=int, default=2)
    parser.add_argument("--batch_size", type=int, default=8)
    args = parser.parse_args()

    tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_fast=True)
    train_ds = build_hf_dataset(tokenizer, args.train)
    dev_ds = build_hf_dataset(tokenizer, args.dev)

    model = AutoModelForTokenClassification.from_pretrained(
        args.model_name,
        num_labels=len(LABEL_LIST)
    )

    # üî• FIX: Automatic padding for input + labels
    data_collator = DataCollatorForTokenClassification(tokenizer)

    training_args = TrainingArguments(
        output_dir=args.out_dir,
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        num_train_epochs=args.epochs,
        logging_steps=50,
        save_steps=500,
        eval_steps=500,
        logging_dir="logs",
        do_eval=True,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=dev_ds,
        tokenizer=tokenizer,
        data_collator=data_collator     # ‚≠ê VERY IMPORTANT
    )

    trainer.train()
    trainer.save_model(args.out_dir)

if __name__ == "__main__":
    main()


Overwriting src/train.py


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"


In [None]:
!python src/train.py --epochs 2


2025-11-23 14:01:15.031634: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763906475.108488    5743 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763906475.150423    5743 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1763906475.207019    5743 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763906475.207096    5743 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763906475.207105    5743 computation_placer.cc:177] computation placer alr

In [None]:
!ls -R out/distil_baseline


out/distil_baseline:
checkpoint-200	   special_tokens_map.json  training_args.bin
config.json	   tokenizer_config.json    vocab.txt
model.safetensors  tokenizer.json

out/distil_baseline/checkpoint-200:
config.json	   scheduler.pt		    trainer_state.json
model.safetensors  special_tokens_map.json  training_args.bin
optimizer.pt	   tokenizer_config.json    vocab.txt
rng_state.pth	   tokenizer.json


In [None]:
%%writefile src/predict_and_postprocess.py
import json, re, argparse, torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from src.dataset import LABEL_LIST

EMAIL_RE = re.compile(r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$")

def normalize(text):
    return text.replace(" at ", "@").replace(" dot ", ".")

def bio_to_spans(tokens, labels, offsets, text):
    spans = []
    curr = None
    for i, lid in enumerate(labels):
        lab = LABEL_LIST[lid]
        if lab.startswith("B-"):
            if curr:
                spans.append(curr)
            curr = {"label": lab[2:], "start": offsets[i][0], "end": offsets[i][1]}
        elif lab.startswith("I-") and curr:
            curr["end"] = offsets[i][1]
        else:
            if curr:
                spans.append(curr)
                curr = None
    if curr:
        spans.append(curr)

    for s in spans:
        s["text"] = text[s["start"]:s["end"]]
    return spans

def is_valid(span):
    if span["label"] == "EMAIL":
        return EMAIL_RE.match(normalize(span["text"])) is not None
    return True

def run_prediction(model_dir="out/distil_baseline",
                   input_path="data/dev.jsonl",
                   output_path="out/dev_pred.json"):

    tok = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForTokenClassification.from_pretrained(model_dir)
    model.eval()

    data = [json.loads(l) for l in open(input_path)]
    out = []

    for item in data:
        text = item["text"]
        enc = tok(text, return_offsets_mapping=True, return_tensors="pt")

        with torch.no_grad():
            pred = model(enc["input_ids"], attention_mask=enc["attention_mask"]).logits.argmax(-1)[0]

        spans = bio_to_spans(
            tok.convert_ids_to_tokens(enc["input_ids"][0]),
            pred.tolist(),
            enc["offset_mapping"][0].tolist(),
            text
        )

        final = [s for s in spans if is_valid(s)]
        for f in final:
            f["pii"] = True

        out.append({
            "id": item["id"],
            "text": text,
            "predicted_entities": final
        })

    with open(output_path, "w") as f:
        for x in out:
            f.write(json.dumps(x) + "\n")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_dir", default="out/distil_baseline")
    parser.add_argument("--input", default="data/dev.jsonl")
    parser.add_argument("--output", default="out/dev_pred.json")
    args = parser.parse_args()
    run_prediction(args.model_dir, args.input, args.output)

if __name__ == "__main__":
    main()


Overwriting src/predict_and_postprocess.py


In [None]:
!sed -n '1,200p' src/predict_and_postprocess.py


import json, re, argparse, torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from src.dataset import LABEL_LIST

EMAIL_RE = re.compile(r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$")

def normalize(text):
    return text.replace(" at ", "@").replace(" dot ", ".")

def bio_to_spans(tokens, labels, offsets, text):
    spans = []
    curr = None
    for i, lid in enumerate(labels):
        lab = LABEL_LIST[lid]
        if lab.startswith("B-"):
            if curr:
                spans.append(curr)
            curr = {"label": lab[2:], "start": offsets[i][0], "end": offsets[i][1]}
        elif lab.startswith("I-") and curr:
            curr["end"] = offsets[i][1]
        else:
            if curr:
                spans.append(curr)
                curr = None
    if curr:
        spans.append(curr)

    for s in spans:
        s["text"] = text[s["start"]:s["end"]]
    return spans

def is_valid(span):
    if span["label"] == "EMAIL":
        return EMAIL_R

In [None]:
!rm -f /content/src/__pycache__/*.pyc
!rm -rf /content/src/__pycache__


In [None]:
import sys
sys.path.append("/content")
sys.path.append("/content/src")

import importlib
import src.predict_and_postprocess
importlib.reload(src.predict_and_postprocess)

from src.predict_and_postprocess import run_prediction


In [None]:
run_prediction()

!head -n 10 out/dev_pred.json


{"id": "ead8ce41", "text": "like i think my account is with 0064256908", "predicted_entities": [{"label": "PHONE", "start": 32, "end": 42, "text": "0064256908", "pii": true}]}
{"id": "33eb1887", "text": "i think my account is with 14 of july two thousand twenty", "predicted_entities": [{"label": "DATE", "start": 27, "end": 57, "text": "14 of july two thousand twenty", "pii": true}]}
{"id": "145ef083", "text": "i think my account is with ravi at outlook dot com", "predicted_entities": [{"label": "EMAIL", "start": 27, "end": 50, "text": "ravi at outlook dot com", "pii": true}]}
{"id": "72f26aa4", "text": "you know i think my account is with four six zero one five eight five one seven four three two eight zero six three", "predicted_entities": [{"label": "CREDIT_CARD", "start": 36, "end": 115, "text": "four six zero one five eight five one seven four three two eight zero six three", "pii": true}]}
{"id": "7c3a5eac", "text": "like i think my account is with anvit at outlook dot com", "pred

In [None]:
# Cell A: compute PII metrics (dev)
import json
from pathlib import Path

BASE = Path("/content")
gold_path = BASE/"data"/"dev.jsonl"
pred_path = BASE/"out"/"dev_pred.json"

def load_jsonl(p):
    if not p.exists():
        print("MISSING:", p)
        return None
    return [json.loads(l) for l in open(p, "r", encoding="utf8")]

gold = load_jsonl(gold_path)
pred = load_jsonl(pred_path)

if gold is None or pred is None:
    raise SystemExit("Gold or prediction file missing. Check paths and generate predictions first.")

gold_map = {g["id"]: g for g in gold}
pred_map = {p["id"]: p for p in pred}

labels_to_check = {"CREDIT_CARD","PHONE","EMAIL","PERSON_NAME","DATE"}
tp = fp = fn = 0
for gid, g in gold_map.items():
    gold_spans = {(e["start"], e["end"], e["label"]) for e in g.get("entities", []) if e["label"] in labels_to_check}
    p = pred_map.get(gid, {})
    pred_spans = {(e["start"], e["end"], e["label"]) for e in p.get("predicted_entities", []) if e["label"] in labels_to_check}
    for ps in pred_spans:
        if ps in gold_spans:
            tp += 1
        else:
            fp += 1
    for gs in gold_spans:
        if gs not in pred_spans:
            fn += 1

precision = tp / (tp+fp) if (tp+fp)>0 else 0.0
recall = tp / (tp+fn) if (tp+fn)>0 else 0.0
f1 = 2*precision*recall/(precision+recall) if (precision+recall)>0 else 0.0

metrics = {"TP":tp,"FP":fp,"FN":fn,"precision":round(precision,4),"recall":round(recall,4),"f1":round(f1,4)}
print("Metrics:", json.dumps(metrics, indent=2))

# Save metrics to file
outdir = Path("/mnt/data")
outdir.mkdir(exist_ok=True)
with open(outdir/"metrics.json","w",encoding="utf8") as f:
    json.dump(metrics, f, indent=2)
print("Saved metrics to /mnt/data/metrics.json")


Metrics: {
  "TP": 126,
  "FP": 0,
  "FN": 0,
  "precision": 1.0,
  "recall": 1.0,
  "f1": 1.0
}
Saved metrics to /mnt/data/metrics.json


In [None]:
# Cell C: create submission zip
import zipfile, os
from pathlib import Path

BASE = Path("/content")
OUTZIP = Path("/mnt/data/PII_NER_submission.zip")
ASSIGNMENT_PDF = Path("/mnt/data/IIT Madras __ Assignment 2025 (1).pdf")

with zipfile.ZipFile(OUTZIP, "w", compression=zipfile.ZIP_DEFLATED) as z:
    # add src
    if (BASE/"src").exists():
        for root,_,files in os.walk(BASE/"src"):
            for fn in files:
                full = Path(root)/fn
                z.write(full, arcname=str(Path("src")/full.relative_to(BASE/"src")))
    # add scripts
    if (BASE/"scripts").exists():
        for root,_,files in os.walk(BASE/"scripts"):
            for fn in files:
                full = Path(root)/fn
                z.write(full, arcname=str(Path("scripts")/full.relative_to(BASE/"scripts")))
    # add prediction if exists
    pred = BASE/"out"/"dev_pred.json"
    if pred.exists():
        z.write(pred, arcname="out/dev_pred.json")
    # add README and metrics
    for fn in ["README.md","metrics.json"]:
        p = Path("/mnt/data")/fn
        if p.exists():
            z.write(p, arcname=fn)
    # add assignment PDF if exists
    if ASSIGNMENT_PDF.exists():
        z.write(ASSIGNMENT_PDF, arcname=ASSIGNMENT_PDF.name)

print("Created:", OUTZIP)
print("Size (bytes):", OUTZIP.stat().st_size)
print("List contents (first 30):")
import zipfile
with zipfile.ZipFile(OUTZIP, "r") as z:
    for info in z.infolist()[:30]:
        print("-", info.filename, info.file_size)


Created: /mnt/data/PII_NER_submission.zip
Size (bytes): 11761
List contents (first 30):
- src/dataset.py 1572
- src/predict_and_postprocess.py 2585
- src/train.py 1873
- src/__pycache__/predict_and_postprocess.cpython-312.pyc 4400
- scripts/generate_stt_data.py 2721
- out/dev_pred.json 32012
- README.md 1680
- metrics.json 87


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!ls -lh /mnt/data


total 20K
-rw-r--r-- 1 root root   87 Nov 23 14:33 metrics.json
-rw-r--r-- 1 root root  12K Nov 23 14:34 PII_NER_submission.zip
-rw-r--r-- 1 root root 1.7K Nov 23 14:34 README.md


In [47]:
!cp /mnt/data/PII_NER_submission.zip /content/
!cp /mnt/data/metrics.json /content/
!cp /mnt/data/README.md /content/
!cp /content/out/dev_pred.json /content/
