## Zero-Shot PII Masking Using a Large Language Model (LLM)

### Install Libraries
This cell installs the necessary Python libraries:
- `google-generativeai`: For using the Gemini API.
- `scikit-learn`: For calculating evaluation metrics.
- `requests`: For making HTTP requests.
- `openai`: For OpenAI API (if needed).
- `gdown`: For downloading files from Google Drive.
- `datasets`: For loading and processing datasets.

In [1]:
!pip install google-generativeai scikit-learn requests openai gdown datasets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

### Import Libraries
This cell imports the required libraries and modules:
- `google.generativeai`: For interacting with the Gemini API.
- `time`: For adding delays between API calls.
- `gdown`: For downloading files from Google Drive.
- `sklearn.metrics`: For calculating precision, recall, F1-score, and accuracy.
- `datasets`: For loading datasets.

In [2]:
import google.generativeai as genai
import time
import gdown
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from datasets import load_dataset


### Configure Gemini API
This cell sets up the Gemini API using your API key. It initializes the `gemini-1.5-pro` model for generating responses.

In [16]:
genai.configure(api_key="AIzaSyCxrDjE6ODeQP100zQrFBmbPGEHOTTxwBc")
# new3 = AIzaSyDwdA9LqXeCMjwGlUlWjgLyfQo6FBXDM1s
# new4 = AIzaSyCxrDjE6ODeQP100zQrFBmbPGEHOTTxwBc
model = genai.GenerativeModel("gemini-1.5-pro")

### Define Functions
This cell defines the following functions:
- `create_prompt()`: Creates a prompt for the Gemini model to mask PII.
- `mask_pii()`: Sends the prompt to the Gemini API and returns the masked text.
- `evaluate()`: Compares the model's output with ground truth to calculate metrics like accuracy, precision, recall, F1-score, FPR, and FNR.

In [4]:
def create_prompt(text):
    return f"""
    Identify and mask all Personally Identifiable Information (PII) in the following text. Use the following format:
    - Names: [REDACTED_NAME]
    - Emails: [REDACTED_EMAIL]

    Text: "{text}"
    """


In [5]:
def mask_pii(text):
    prompt = create_prompt(text)
    response = model.generate_content(prompt)
    return response.text.strip()


In [6]:
def tokenize(text):
    return text.split()

In [7]:
def evaluate(ground_truth_tokens, llm_output_tokens):
    ground_truth_binary = [1 if tag != "O" else 0 for tag in ground_truth_tokens]
    predicted_binary = [1 if token.startswith("[REDACTED") else 0 for token in llm_output_tokens]

    TP = sum((gt == 1 and pred == 1) for gt, pred in zip(ground_truth_binary, predicted_binary))
    FP = sum((gt == 0 and pred == 1) for gt, pred in zip(ground_truth_binary, predicted_binary))
    FN = sum((gt == 1 and pred == 0) for gt, pred in zip(ground_truth_binary, predicted_binary))
    TN = sum((gt == 0 and pred == 0) for gt, pred in zip(ground_truth_binary, predicted_binary))

    accuracy = (TP + TN) / (TP + TN + FP + FN) if (TP + TN + FP + FN) != 0 else 0
    precision = TP / (TP + FP) if (TP + FP) != 0 else 0
    recall = TP / (TP + FN) if (TP + FN) != 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) != 0 else 0
    FPR = FP / (FP + TN) if (FP + TN) != 0 else 0
    FNR = FN / (FN + TP) if (FN + TP) != 0 else 0

    return accuracy, precision, recall, f1, FPR, FNR

### Download and Load Datasets
This cell downloads and loads two datasets:
1. **Independent Test Data**: A JSON file containing test data.
2. **Synthetic Test Data**: A CSV file with synthetic email addresses added.

In [8]:

# https://drive.google.com/file/d/1E2FjYFDGEeXTwpabkC0aYzZV8aOQqf_h/view?usp=sharing
independent_test_data_file_id = "1E2FjYFDGEeXTwpabkC0aYzZV8aOQqf_h"

gdown.download(f"https://drive.google.com/uc?id={independent_test_data_file_id}", "test_data.json", quiet=False)
DATA_FILES = {"test_data": "test_data.json"}
dataset = load_dataset("json", data_files=DATA_FILES)

Downloading...
From: https://drive.google.com/uc?id=1E2FjYFDGEeXTwpabkC0aYzZV8aOQqf_h
To: /content/test_data.json
100%|██████████| 4.19M/4.19M [00:00<00:00, 149MB/s]


Generating test_data split: 0 examples [00:00, ? examples/s]

In [9]:
test_data = dataset["test_data"]
print(type(test_data))
print(test_data[:2])

<class 'datasets.arrow_dataset.Dataset'>
{'lang': ['en', 'en'], 'ner_tags': [['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']], 'sequence': ['included future Rage Against the Machine and Audioslave drummer Brad Wilk .', 'The city voted 53.5 percent in favor of the marijuana legalization measure , which , as then-mayor John Hickenlooper pointed out , was without effect , because the city cannot usurp state law , which at that time treated marijuana possession in much the same way as a speeding ticket , with fines of up to $ 100 and no jail time .'], 'tokens': [['included', 'future', 'Rage', 'Against', 'the', 'Machine', 'and', 'Audioslave', 'drummer

In [10]:
print("size of dataset: ",len(test_data))

size of dataset:  3650


### Import Additional Libraries
This cell imports more libraries for advanced functionality:
- `random`: For generating random numbers (used in exponential backoff).
- `numpy`: For numerical calculations (e.g., averaging metrics).
- `pandas`: For data manipulation.
- `ast`: For safely evaluating strings as Python objects.
- `google.api_core.exceptions`: For handling API errors.
- `collections.defaultdict`: For caching API responses.

### Set Configuration
This cell defines configuration parameters:
- `batch_size`: Number of samples processed in each batch.
- `max_retries`: Maximum number of retries for API calls.
- `base_wait_time`: Initial delay between retries.
- `delay_seconds`: Delay between batches to avoid rate limits.
- `size_of_dataset`: Number of samples to process from the dataset.


### Initialize Metric Storage
This cell initializes lists to store evaluation metrics:
- Accuracy, Precision, Recall, F1-Score, FPR, and FNR.


### Main Processing Loop
This cell processes the dataset in batches:
1. Fetches a batch of data.
2. Checks the cache to avoid redundant API calls.
3. Sends the text to the Gemini API for masking.
4. Evaluates the model's performance using the `evaluate()` function.
5. Stores the metrics for each batch.

### Aggregate Metrics
This cell calculates the overall performance metrics by averaging the results from all batches.



### Doing evaluation on test dataset (without synthetic emails)

In [17]:
import time
import random
import numpy as np
from google.api_core.exceptions import TooManyRequests
from collections import defaultdict

batch_size = 1
max_retries = 2
base_wait_time = 5
delay_seconds = 30

all_accuracies = []
all_precisions = []
all_recalls = []
all_f1_scores = []
all_fprs = []
all_fnrs = []

size_of_dataset = 30 #change according to your need in range of [0, 3650]
test_data = dataset["test_data"].select(range(size_of_dataset))

cache = defaultdict(str)

def exponential_backoff(attempt):
    jitter = random.uniform(1, 3)
    return base_wait_time * (2 ** attempt) + jitter

for i in range(0, len(test_data), batch_size):
    batch = [test_data[j] for j in range(i, min(i + batch_size, len(test_data)))]

    batch_texts = [item["sequence"] for item in batch]
    cached_results = {text: cache[text] for text in batch_texts if text in cache}
    texts_to_query = [text for text in batch_texts if text not in cached_results]

    attempt = 0
    while attempt < max_retries:
        try:
            if texts_to_query:
                masked_texts = [mask_pii(text) for text in texts_to_query]
                for text, masked in zip(texts_to_query, masked_texts):
                    cache[text] = masked
            break
        except TooManyRequests as e:
            retry_after = int(e.response.headers.get("Retry-After", base_wait_time))
            wait_time = max(retry_after, exponential_backoff(attempt))
            print(f"Rate limit exceeded. Retrying in {wait_time:.2f} seconds...")
            time.sleep(wait_time)
            attempt += 1

    if attempt == max_retries:
        print("Max retries reached. Skipping batch.")
        continue

    final_masked_texts = [cache[text] for text in batch_texts]

    for text, masked_text, ner_tags in zip(batch_texts, final_masked_texts, [item["ner_tags"] for item in batch]):
        print(f"Original Text: {text}")
        print(f"Masked Text: {masked_text}")

        llm_output_tokens = tokenize(masked_text)

        accuracy, precision, recall, f1, FPR, FNR = evaluate(ner_tags, llm_output_tokens)

        all_accuracies.append(accuracy)
        all_precisions.append(precision)
        all_recalls.append(recall)
        all_f1_scores.append(f1)
        all_fprs.append(FPR)
        all_fnrs.append(FNR)

        print(f"Accuracy: {accuracy:.2f}")
        print(f"Precision: {precision:.2f}")
        print(f"Recall: {recall:.2f}")
        print(f"F1-Score: {f1:.2f}")
        print(f"False Positive Rate (FPR): {FPR:.2f}")
        print(f"False Negative Rate (FNR): {FNR:.2f}")
        print("-" * 50)

    time.sleep(delay_seconds)


overall_accuracy = np.mean(all_accuracies) if all_accuracies else 0
overall_precision = np.mean(all_precisions) if all_precisions else 0
overall_recall = np.mean(all_recalls) if all_recalls else 0
overall_f1 = np.mean(all_f1_scores) if all_f1_scores else 0
overall_FPR = np.mean(all_fprs) if all_fprs else 0
overall_FNR = np.mean(all_fnrs) if all_fnrs else 0

print("\n=== Overall Dataset Results ===")
print(f"Overall Accuracy: {overall_accuracy:.2f}")
print(f"Overall Precision: {overall_precision:.2f}")
print(f"Overall Recall: {overall_recall:.2f}")
print(f"Overall F1-Score: {overall_f1:.2f}")
print(f"Overall False Positive Rate (FPR): {overall_FPR:.2f}")
print(f"Overall False Negative Rate (FNR): {overall_FNR:.2f}")
print("==============================")



Original Text: included future Rage Against the Machine and Audioslave drummer Brad Wilk .
Masked Text: "included future Rage Against the Machine and Audioslave drummer [REDACTED_NAME] ."
Accuracy: 0.91
Precision: 1.00
Recall: 0.50
F1-Score: 0.67
False Positive Rate (FPR): 0.00
False Negative Rate (FNR): 0.50
--------------------------------------------------
Original Text: The city voted 53.5 percent in favor of the marijuana legalization measure , which , as then-mayor John Hickenlooper pointed out , was without effect , because the city cannot usurp state law , which at that time treated marijuana possession in much the same way as a speeding ticket , with fines of up to $ 100 and no jail time .
Masked Text: The city voted 53.5 percent in favor of the marijuana legalization measure, which, as then-mayor [REDACTED_NAME] pointed out, was without effect, because the city cannot usurp state law, which at that time treated marijuana possession in much the same way as a speeding ticket, w

### Doing evaluation on synthetic dataset  (dataset with append email)

In [11]:
# https://drive.google.com/file/d/1-1v9MghJ6XnGDdlKaD4h-se1ZNfk6hYV/view?usp=sharing
synthetic_independent_test_data_file_id = "1-1v9MghJ6XnGDdlKaD4h-se1ZNfk6hYV"

gdown.download(f"https://drive.google.com/uc?id={synthetic_independent_test_data_file_id}", "synthetic_test_data.csv", quiet=False)
DATA_FILES = {"synthetic_test_data": "synthetic_test_data.csv"}
synthetic_test_dataset = load_dataset("csv", data_files=DATA_FILES)

Downloading...
From: https://drive.google.com/uc?id=1-1v9MghJ6XnGDdlKaD4h-se1ZNfk6hYV
To: /content/synthetic_test_data.csv
100%|██████████| 18.2M/18.2M [00:00<00:00, 103MB/s] 


Generating synthetic_test_data split: 0 examples [00:00, ? examples/s]

In [14]:
import time
import random
import numpy as np
import pandas as pd
import ast
from google.api_core.exceptions import TooManyRequests
from collections import defaultdict

batch_size = 1
max_retries = 2
base_wait_time = 5
delay_seconds = 35
size_of_dataset = 30  # Adjust as needed (max depends on your dataset size)

test_data = synthetic_test_dataset["synthetic_test_data"].select(range(size_of_dataset))

all_accuracies = []
all_precisions = []
all_recalls = []
all_f1_scores = []
all_fprs = []
all_fnrs = []

cache = defaultdict(str)

def exponential_backoff(attempt):
    jitter = random.uniform(1, 3)
    return base_wait_time * (2 ** attempt) + jitter

for i in range(0, len(test_data), batch_size):
    # Use .select() to slice the dataset
    batch_indices = list(range(i, min(i + batch_size, len(test_data))))
    batch = test_data.select(batch_indices)

    batch_texts = [item["sequence"] for item in batch]
    cached_results = {text: cache[text] for text in batch_texts if text in cache}
    texts_to_query = [text for text in batch_texts if text not in cached_results]

    attempt = 0
    while attempt < max_retries:
        try:
            if texts_to_query:
                # Assuming mask_pii() is your PII masking function
                masked_texts = [mask_pii(text) for text in texts_to_query]
                for text, masked in zip(texts_to_query, masked_texts):
                    cache[text] = masked
            break
        except TooManyRequests as e:
            retry_after = int(e.response.headers.get("Retry-After", base_wait_time))
            wait_time = max(retry_after, exponential_backoff(attempt))
            print(f"Rate limit exceeded. Retrying in {wait_time:.2f} seconds...")
            time.sleep(wait_time)
            attempt += 1

    if attempt == max_retries:
        print("Max retries reached. Skipping batch.")
        continue

    final_masked_texts = [cache[text] for text in batch_texts]

    for text, masked_text, ner_tags in zip(batch_texts, final_masked_texts, [item["ner_tags"] for item in batch]):
        print(f"Original Text: {text}")
        print(f"Masked Text: {masked_text}")

        llm_output_tokens = tokenize(masked_text)

        accuracy, precision, recall, f1, FPR, FNR = evaluate(ner_tags, llm_output_tokens)

        all_accuracies.append(accuracy)
        all_precisions.append(precision)
        all_recalls.append(recall)
        all_f1_scores.append(f1)
        all_fprs.append(FPR)
        all_fnrs.append(FNR)

        print(f"Accuracy: {accuracy:.2f}")
        print(f"Precision: {precision:.2f}")
        print(f"Recall: {recall:.2f}")
        print(f"F1-Score: {f1:.2f}")
        print(f"False Positive Rate (FPR): {FPR:.2f}")
        print(f"False Negative Rate (FNR): {FNR:.2f}")
        print("-" * 50)

    time.sleep(delay_seconds)

overall_accuracy = np.mean(all_accuracies) if all_accuracies else 0
overall_precision = np.mean(all_precisions) if all_precisions else 0
overall_recall = np.mean(all_recalls) if all_recalls else 0
overall_f1 = np.mean(all_f1_scores) if all_f1_scores else 0
overall_FPR = np.mean(all_fprs) if all_fprs else 0
overall_FNR = np.mean(all_fnrs) if all_fnrs else 0

print("\n=== Overall Dataset Results ===")
print(f"Overall Accuracy: {overall_accuracy:.2f}")
print(f"Overall Precision: {overall_precision:.2f}")
print(f"Overall Recall: {overall_recall:.2f}")
print(f"Overall F1-Score: {overall_f1:.2f}")
print(f"Overall False Positive Rate (FPR): {overall_FPR:.2f}")
print(f"Overall False Negative Rate (FNR): {overall_FNR:.2f}")
print("==============================")

Original Text: Since then , only Terry Bradshaw in 147 games , Joe Montana in 139 games , and Tom Brady in 131 games have reached 100 wins more quickly . terry.bradshaw81@fastnu.edu.pk joe.montana96@outlook.com tom.brady11@outlook.com
Masked Text: Since then, only [REDACTED_NAME] in 147 games, [REDACTED_NAME] in 139 games, and [REDACTED_NAME] in 131 games have reached 100 wins more quickly. [REDACTED_EMAIL] [REDACTED_EMAIL] [REDACTED_EMAIL]
Accuracy: 0.24
Precision: 0.67
Recall: 0.19
F1-Score: 0.30
False Positive Rate (FPR): 0.50
False Negative Rate (FNR): 0.81
--------------------------------------------------
Original Text: He was portrayed by Anthony Perkins in the 1960 version of " Psycho " directed by Alfred Hitchcock and the " Psycho " franchise . anthony.perkins17@outlook.com alfred.hitchcock29@fastnu.edu.pk
Masked Text: He was portrayed by [REDACTED_NAME] in the 1960 version of " Psycho " directed by [REDACTED_NAME] and the " Psycho " franchise . [REDACTED_EMAIL] [REDACTED_EMAI