### Installing Necessary Libraries

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
import json
import datasets
import transformers
import torch
from datasets import DatasetDict
from transformers import TrainingArguments,Trainer
import numpy as np
from collections import defaultdict
from datasets import Dataset

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!


### Loading the Base Model and the tokenizer

In [None]:

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
     # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.3.19: Fast Mistral patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Loading the LorA configuration

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Preparing the Dataset

In [None]:
# Function to load JSONL file
def load_jsonl(file_path):
    with open(file_path, 'r') as f:
        data = [json.loads(line) for line in f]

    # Print a sample entry for debugging
    print("\n FIRST ENTRY FOR PII DATASET")
    print(json.dumps(data[7], indent=4))  # Pretty-print first entry
    return data

# Load data
data = load_jsonl("/content/PII_audios_annotation.jsonl")


 FIRST ENTRY FOR PII DATASET
{
    "id": 8,
    "text": " hello, my name is lucas martinez. i'm reaching out regarding my recent tax filing. i received a notice about a discrepancy with my bank account number, which is 688388829776. please send any correspondence to 17101 east colfax avenue, aurora, colorado 80011. i would appreciate your assistance in resolving this matter.  additionally, i would like to request a detailed explanation of the discrepancy to better understand the issue.",
    "label": [
        [
            19,
            33,
            "NAME"
        ],
        [
            162,
            174,
            "BANK-ACCOUNT-NO"
        ],
        [
            210,
            258,
            "ADDRESS"
        ]
    ],
    "Comments": []
}


In [None]:
def convert_labels_to_dict_format(data):
    for item in data:
        new_labels = []
        for label in item["label"]:
            start, end, entity = label
            new_labels.append({
                "start": start,
                "end": end,
                "entity": entity
            })
        item["label"] = new_labels
    return data

In [None]:
from datasets import Dataset
# Convert list-style labels to dicts
data = convert_labels_to_dict_format(data)

dataset = Dataset.from_list(data)

In [None]:
dataset

Dataset({
    features: ['id', 'text', 'label', 'Comments'],
    num_rows: 1087
})

In [None]:
import json
print(json.dumps(data[0], indent=4))

{
    "id": 1,
    "text": " hello, my name is benjamin carter. i'm contacting you about an issue with my tax return from last year. there seems to be a problem with my bank account number for 873153717, and i believe my social security number 589-90-4308 is incorrect in your records.  i've already attempted to resolve this issue online, but didn't receive a response. additionally, this delay has caused me to miss the filing deadline, which could result in penalties. please verify the information and reach out to me at 416-557-3342. thank you for your help in resolving this matter quickly.",
    "label": [
        {
            "start": 19,
            "end": 34,
            "entity": "NAME"
        },
        {
            "start": 165,
            "end": 174,
            "entity": "BANK-ACCOUNT-NO"
        },
        {
            "start": 216,
            "end": 227,
            "entity": "SSN"
        },
        {
            "start": 497,
            "end": 509,
            "entit

### Prompt for pii detection

In [None]:
pii_prompt = """### System Instruction:
You are a text extraction model tasked with identifying and extracting **personally identifiable information (PII)** from a given input text. Avoid generating any additional explanations or unrelated text. Only extract the PII entities and their labels.

### Instruction:
Extract the **personally identifiable information (PII)** from the input text. Each entity should be returned on a new line with its label.

The extracted PII entities should include, but are not limited to, the following types:
- NAME
- PHONE-NUMBER
- ADDRESS
- CREDIT-CARD-NO
- BANK-ACCOUNT-NO
- BANK-ROUTING-NO
- SSN

Do not generate or include any extra text. Return only the extracted PII entities with their labels in the following format:
<LABEL>: <ENTITY>

### Example 1:
Input: "Jane Smith has an account number 987654321 and her credit card number is 4111 1111 1111 1111."
Response:
NAME: Jane Smith
BANK-ACCOUNT-NO: 987654321
CREDIT-CARD-NO: 4111 1111 1111 1111

### Example 2:
Input: "Mary Johnson's SSN is 123-45-6789, and her phone number is (800) 555-9876."
Response:
NAME: Mary Johnson
SSN: 123-45-6789
PHONE-NUMBER: (800) 555-9876

### Text:
{}

### Entities:
{}"""


In [None]:
def formatting_prompts_func(examples):
    prompts = []
    EOS_TOKEN = tokenizer.eos_token

    for text, labels in zip(examples["text"], examples["label"]):
        entities = []
        for entity_dict in labels:
            start = entity_dict["start"]
            end = entity_dict["end"]
            entity_type = entity_dict["entity"]
            entity = text[start:end]
            entities.append(f"{entity_type}: {entity.strip()}")  # Label first

        entity_output = "\n".join(entities)
        prompt = pii_prompt.format(text, entity_output) + EOS_TOKEN
        prompts.append(prompt)

    return {"prompt": prompts}

In [None]:
dataset = dataset.map(formatting_prompts_func, batched=True)


Map:   0%|          | 0/1087 [00:00<?, ? examples/s]

In [None]:
# 4. Drop label column and Comments
dataset = dataset.remove_columns(["label"])
dataset = dataset.remove_columns(["Comments"])


In [None]:
from sklearn.model_selection import train_test_split
# Split into train and test (e.g., 90% train, 10% test)
split_dataset = dataset.train_test_split(test_size=0.1)

# Now you have:
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForLanguageModeling
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,  # Your dataset
    dataset_text_field="prompt",  # Your dataset field name
    max_seq_length=max_seq_length,
    dataset_num_proc=8,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
    args=UnslothTrainingArguments(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=8,
        warmup_ratio=0.1,
        num_train_epochs=1,
        learning_rate=5e-5,
        embedding_learning_rate=5e-6,
        fp16=True,  # Ensure fp16 is enabled
        bf16=is_bfloat16_supported(),  # Enable bf16 if supported
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.00,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # For logging or reporting to external services
    ),
)


Unsloth: Tokenizing ["prompt"] (num_proc=8):   0%|          | 0/978 [00:00<?, ? examples/s]

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 978 | Num Epochs = 1 | Total steps = 15
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 8 x 1) = 64
 "-____-"     Trainable parameters = 335,544,320/7,000,000,000 (4.79% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.2984
2,1.2839
3,0.8816
4,0.5514
5,0.4048
6,0.3428
7,0.3665
8,0.366
9,0.3077
10,0.3093


### Inference

In [None]:
def extract_pii(example_text):
    try:
        FastLanguageModel.for_inference(model)

        prompt = pii_prompt.format(example_text, "") + '\n'

        inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            do_sample=False,
            temperature=0.0,
            use_cache=True,
            pad_token_id=tokenizer.pad_token_id
        )

        decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        result = decoded.split("### Entities:")[-1].strip()

        pii_entities = []
        for line in result.splitlines():
            if ": " in line:
                label, entity = line.split(": ", 1)
                pii_entities.append((label.strip(), entity.strip()))

        return pii_entities

    except Exception as e:
        print("Exception occurred:", e)
        return []

In [None]:
import re

def redact_text(example_text, pii_entities):
    redacted_text = example_text
    # Sort entities by length to prevent partial replacements
    for label, entity in sorted(pii_entities, key=lambda x: -len(x[1])):
        pattern = re.escape(entity)
        replacement = f"[{label}]"
        redacted_text = re.sub(pattern, replacement, redacted_text)
    return redacted_text


### For single example

In [None]:
text = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# Step 1: Extract PII
entities = extract_pii(text)

# Step 2: Print Extracted Entities
if entities:
    print("\n=== Extracted Entities ===")
    for label, entity in entities:
        print(f"{label}: {entity}")
else:
    print("\nNo PII entities were found.")

# Step 3: Redact PII from Text (Label-Based)
redacted = redact_text(text, entities)

# Step 4: Print Redacted Text
print("\n=== Redacted Text ===")
print(redacted)


=== Extracted Entities ===
NAME: Mia Thompson
BANK-ACCOUNT-NO: 4893172051
BANK-ROUTING-NO: 192847561
PHONE-NO: 727-814-3902

=== Redacted Text ===
Hi, I’m [NAME]. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number [BANK-ACCOUNT-NO] linked with routing number [BANK-ROUTING-NO]. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at [PHONE-NO] if more information is needed.


### For a Batch

In [None]:
examples = [
    "Hello, my name is Daniel Brooks. I’m reaching out regarding an issue with my mortgage payment. I recently made a payment using my bank account 3857194621, but I received a notification that the payment was declined due to an account mismatch. I believe there’s an error in how my details were processed. This delay is concerning as it might result in late fees. Please verify my details and reach out to me at 312-674-2251 as soon as possible. Thank you for your assistance.",

    "Hi, this is Olivia Carter. I’ve been experiencing a service disruption with my electricity at 217 Maple Street, San Diego, CA 92103. Despite making my monthly payment on time, my service was unexpectedly disconnected. I tried reaching customer support, but I haven’t been able to get through. This is causing significant inconvenience for my household. Could you please check the status of my payment and restore the service at the earliest?",

    "Hi, I’m Jonathan Reed. I recently noticed an unauthorized withdrawal from my checking account 2609-44753-6712-8953 that I did not authorize. I haven’t lost my card or shared my account details with anyone. I contacted my bank, but they advised me to reach out to you for further investigation. Could you please look into this issue and put a temporary hold on my account to prevent further transactions? Please call me at 845-372-1098 for updates. Thank you for your time.",

    "Hi, this is Laura Smith. I recently received a notification about a discrepancy in my Social Security details linked to my employment records. My Social Security Number 6521-857-3251 appears to be mismatched in your system, which could potentially affect my payroll and benefits. I’d appreciate it if someone could verify and correct this issue as soon as possible. Please mail the confirmation to 1295 West Grove Ave, Springfield, IL 62704. Thank you for your help.",

    "Hello, I’m Ethan Parker. I have a billing issue with my internet service. My service address is 5521 Walnut Drive, Austin, TX 7845. I recently paid my bill using my credit card 5123-4568-9875-2145, but it seems the payment wasn’t processed correctly, and I’ve been charged twice for the same billing cycle. Could you please review my account and refund the extra charge? You can contact me at 678-214-8895 for any further information."

     "hi, i'm william harris. i'm having an issue with my auto insurance. i filed a claim but it hasn't been processed yet. my account number is 43786152 and my driver's license number is 705-70-0962. would you look into this? you can reach me at 541-273-9084.  and my address is 9844 parkview street, portland, oregon 97223. thank you for your help.",

    "hello, this is liam wright. i recently paid my rent, but it hasn't been reflected in my account. my account number is 3f90-283-7465 and the payment was $1,200. my driver's license number is y9349074. can you verify the payment status? my contact number is 310-539-743644663322445.  7291 and my address is 8317 oak avenue, los angeles, california 9001. thank you for your assistance.",


    "hello, this is jackson smith. i'm contacting you about an issue with my student loan account. my account number is 850-192-3821 and i recently made a payment of $1000 but it hasn't been applied to my balance. can you please investigate? my passport number is b98765432.  please contact me at 818-523-9074 or send mail to 1937 beverly bollywood, los angeles, ca 90057. thank you for your help.",


    "hi, i'm matthew hall. i'm writing regarding a recent charge on my bank account 158-039-6728 for $3.50 which i do not recognize. i believe my account might have been compromised. my address is 1421 oakwood drive, denver, colorado 80239. please investigate this matter and contact me at 303-763-0491. thank you for your attention to this",


    "hi, i am lucas clark. i recently made a payment of $1,500 on my loan, but the payment hasn't been reflected. my bank routing number is 894-73-46543 and my passport number is l12345678. can you check the status? my contact number is 404-438-2573.  and my address is 21349 atlas, georgia 30303. thank you."

]



In [None]:
# Process each example in batch
for idx, text in enumerate(examples, start=1):
    print(f"\n=== Processing Example {idx} ===")

    # Step 1: Extract PII
    entities = extract_pii(text)

    # Step 2: Print Extracted Entities
    if entities:
        print("Extracted Entities:")
        for label, entity in entities:
            print(f"{label}: {entity}")
    else:
        print("No PII entities found.")

    # Step 3: Redact Text
    redacted = redact_text(text, entities)

    # Step 4: Print Redacted Text
    print("Redacted Text:", redacted)


=== Processing Example 1 ===
Extracted Entities:
NAME: Daniel Brooks
BANK-ACCOUNT-NO: 3857194621
PHONE-NO: 312-674-2251
Redacted Text: Hello, my name is [NAME]. I’m reaching out regarding an issue with my mortgage payment. I recently made a payment using my bank account [BANK-ACCOUNT-NO], but I received a notification that the payment was declined due to an account mismatch. I believe there’s an error in how my details were processed. This delay is concerning as it might result in late fees. Please verify my details and reach out to me at [PHONE-NO] as soon as possible. Thank you for your assistance.

=== Processing Example 2 ===
Extracted Entities:
NAME: Olivia Carter
ADDRESS: 217 Maple Street, San Diego, CA 92103
Redacted Text: Hi, this is [NAME]. I’ve been experiencing a service disruption with my electricity at [ADDRESS]. Despite making my monthly payment on time, my service was unexpectedly disconnected. I tried reaching customer support, but I haven’t been able to get through. T

### Saving to GGUF

In [None]:
from huggingface_hub import login

login(token="your hugging face token")

In [None]:
model.push_to_hub_gguf("AI-Enthusiast11/mistral-7b-4bit-PII-Entity-Extractor", tokenizer, quantization_method = "q4_k_m")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 4.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.24 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 44%|████▍     | 14/32 [00:01<00:01, 11.44it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [03:46<00:00,  7.08s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving AI-Enthusiast11/mistral-7b-4bit-pii-entity-extractor/pytorch_model-00001-of-00003.bin...
Unsloth: Saving AI-Enthusiast11/mistral-7b-4bit-pii-entity-extractor/pytorch_model-00002-of-00003.bin...
Unsloth: Saving AI-Enthusiast11/mistral-7b-4bit-pii-entity-extractor/pytorch_model-00003-of-00003.bin...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at AI-Enthusiast11/mistral-7b-4bit-pii-entity-extractor into f16 GGUF format.
The output location will be /content/AI-Enthusiast11/mistral-7b-4bit-pii-entity-extractor/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: mistral-7b-4bit-pii-entity-extractor
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading 

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/AI-Enthusiast11/mistral-7b-4bit-pii-entity-extractor
