## 1. Install Dependencies
Let's start by installing the necessary libraries from the Hugging Face ecosystem. We need `transformers` for the model, `datasets` to download the Nemotron-PII data, `evaluate` and `seqeval` for calculating our NER evaluation metrics, and `accelerate` to optimize the PyTorch training loop.

In [None]:
!pip install transformers datasets seqeval evaluate accelerate -q

## 2. Configuration & Data Loading
Here we define our base model (`all-MiniLM-L6-v2`) and the dataset we want to fine-tune on (`nvidia/Nemotron-PII`). We also set a max sequence length of 512 tokens. Since we are training a Token Classification model, we will load both the train and test splits, concatenate them, and shuffle them into a clean 90/10 Train/Validation split.

In [None]:
import numpy as np
from datasets import load_dataset, concatenate_datasets

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer
)
import evaluate

# 1. Settings
MODEL_CHECKPOINT = "sentence-transformers/all-MiniLM-L6-v2"
DATASET_NAME = "nvidia/Nemotron-PII"
MAX_LENGTH = 512

# 2. Load Data
print("\n Downloading dataset...")
train_split = load_dataset(DATASET_NAME, split="train")
test_split = load_dataset(DATASET_NAME, split="test")
raw_datasets = concatenate_datasets([train_split, test_split])
raw_datasets = raw_datasets.train_test_split(test_size=0.1, seed=42)


print(f"Training on {len(raw_datasets['train'])} samples")

## 3. Preprocessing & Token Alignment
Token Classification (NER) requires labels to align perfectly with the tokenized text. The `Nemotron-PII` dataset stores entity spans as stringified dictionaries, so we first iterate through the dataset and safely parse them using `ast.literal_eval` to extract all unique PII categories.

Next, we map our character-level entity spans to the model's subword tokens using the standard **BIO (Begin, Inside, Outside)** tagging format. The `tokenize_and_align_labels` function ensures that special tokens (like `[CLS]` or `[SEP]`) are ignored during loss calculation (labeled as `-100`).

In [None]:
import ast
import json

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

# The dataset stores spans as strings, so we must parse them first.
unique_labels = set()

# We iterate carefully to avoid the TypeError
for spans_string in raw_datasets["train"]["spans"]:
    # Safely convert string representation "[{'start':...}]" to a Python list
    try:
        spans = ast.literal_eval(spans_string)
        for span in spans:
            unique_labels.add(span["label"])
    except (ValueError, SyntaxError):
        continue

label_list = sorted(list(unique_labels))
bio_label_list = ["O"]
for label in label_list:
    bio_label_list.append(f"B-{label}")
    bio_label_list.append(f"I-{label}")

id2label = {i: label for i, label in enumerate(bio_label_list)}
label2id = {label: i for i, label in enumerate(bio_label_list)}

print(f"Labels found: {label_list}")

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["text"],
        truncation=True,
        max_length=MAX_LENGTH,
        is_split_into_words=False,
        return_offsets_mapping=True,
        padding="max_length"
    )

    labels = []

    # Iterate over every document in the batch
    for i, offsets in enumerate(tokenized_inputs["offset_mapping"]):
        doc_labels = []

        # --- CORRECTION HERE AS WELL ---
        # Parse the stringified spans for this specific example
        raw_spans = examples["spans"][i]
        try:
            spans = ast.literal_eval(raw_spans)
        except:
            spans = []

        # Sort spans by start index
        spans = sorted(spans, key=lambda x: x["start"])

        for idx, (start, end) in enumerate(offsets):
            # Special tokens (CLS, SEP, PAD)
            if start == end:
                doc_labels.append(-100)
                continue

            token_label = "O"

            # Check if token falls inside a span
            for span in spans:
                if start >= span["start"] and end <= span["end"]:
                    if start == span["start"]:
                        token_label = f"B-{span['label']}"
                    else:
                        token_label = f"I-{span['label']}"
                    break

            doc_labels.append(label2id[token_label])

        labels.append(doc_labels)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = raw_datasets.map(tokenize_and_align_labels, batched=True)

## 4. Initialize Model & Evaluation Metrics
Now we load the pre-trained `MiniLM-L6-v2` base model and replace its standard pooling head with a Token Classification head. We pass `ignore_mismatched_sizes=True` because we are introducing a new classification layer specific to the size of our BIO label list.

We also define our `compute_metrics` function, which uses the `seqeval` library. During training, this will automatically calculate Precision, Recall, F1 Score, and Accuracy at the end of each epoch, ignoring the `-100` padding tokens.

In [None]:
# Load the sentence transformer as a Token Classification model
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_CHECKPOINT,
    num_labels=len(bio_label_list),
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True # Necessary when reshaping the head
)

metric = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [bio_label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [bio_label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

## 5. Training Setup & Execution
Because cloud environments (like Colab) can disconnect unexpectedly, we'll mount Google Drive and save our checkpoints directly there.

We define our `TrainingArguments` (setting a learning rate of 2e-5, batch sizes of 64, and training for 4 epochs) and initialize the Hugging Face `Trainer`. The script includes smart logic to check your Drive folder and automatically resume from the latest checkpoint if training was previously interrupted.


In [None]:
import os
from google.colab import drive

# 1. Mount Google Drive to save checkpoints persistently
drive.mount('/content/drive')

# Define a path in your Google Drive
# Change 'pii-detector-checkpoints' to whatever folder name you prefer
drive_output_dir = "/content/drive/MyDrive/pii-detector-checkpoints"

# 2. Update Training Arguments
args = TrainingArguments(
    output_dir=drive_output_dir,  # Save DIRECTLY to Google Drive
    eval_strategy="epoch",
    save_strategy="epoch",        # Save a checkpoint at the end of every epoch
    save_total_limit=3,           # Only keep the last 3 epochs to save Drive space (optional)
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=4,
    weight_decay=0.01,
    push_to_hub=False,
    report_to="none",
    load_best_model_at_end=True,  # At the very end, load the best epoch based on metrics
    metric_for_best_model="f1"    # Use F1 score to determine the "best" model
)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=DataCollatorForTokenClassification(tokenizer),
    processing_class=tokenizer,
    compute_metrics=compute_metrics
)

# 3. Logic to Resume Training
# Check if there are existing checkpoints in the Drive folder
files_in_drive = os.listdir(drive_output_dir) if os.path.exists(drive_output_dir) else []
checkpoint_exists = any("checkpoint" in f for f in files_in_drive)

if checkpoint_exists:
    print(f"Checkpoints found in {drive_output_dir}. Resuming training...")
    trainer.train(resume_from_checkpoint=True)
else:
    print("No checkpoints found. Starting fresh training...")
    trainer.train()

## 6. Save and Export the Final Model
Once training is complete, we save the final, best-performing model and its tokenizer directly to Google Drive for safekeeping.

Finally, the script optionally zips the completed model directory and downloads it directly to your local machine. From here, you can run inference locally, convert it to ONNX, or upload the directory to a Hugging Face model repository.

In [None]:
import shutil

# 1. Define final save path (Inside Drive for safety)
final_save_path = "/content/drive/MyDrive/pii_detector_final_model"

# 2. Save the final model artifacts
trainer.save_model(final_save_path)
tokenizer.save_pretrained(final_save_path)

print(f"Model successfully saved to: {final_save_path}")

# Optional: If you still want to download it to your local machine as a ZIP
shutil.make_archive("/content/pii_detector_model", 'zip', final_save_path)
from google.colab import files
files.download("/content/pii_detector_model.zip")