### Install Required Libraries
This cell installs all the necessary Python libraries for the notebook:
- `datasets`: For loading and processing datasets.
- `transformers`: For using pre-trained models and tokenizers.
- `seqeval`: For evaluating NER models.
- `gdown`: For downloading files from Google Drive.
- `pandas`, `scikit-learn`: For data manipulation and evaluation.
- `torch`: For PyTorch-based model training.
- `openai`: For interacting with OpenAI's API (if needed).

In [None]:
!pip install datasets
!pip install -U accelerate
!pip install -U transformers
!pip install seqeval
!pip install gdown
!pip install pandas scikit-learn
!pip install torch
!pip install openai

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

### Import Libraries
This cell imports all the required libraries and modules:
- `transformers`: For model training, tokenization, and evaluation.
- `datasets`: For loading and processing datasets.
- `ast`: For safely evaluating strings as Python objects.
- `gdown`: For downloading files from Google Drive.
- `pandas`: For data manipulation.
- `torch`: For PyTorch-based model training.
- `openai`: For interacting with OpenAI's API (if needed).
- `google.colab.drive`: For mounting Google Drive.

In [None]:
from transformers import (
    AutoModelForTokenClassification, AutoTokenizer, DataCollatorForTokenClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset
import ast
import gdown
import pandas as pd
import torch
import openai

### Mount Google Drive

Mount your google drive to save the datasets, model over the drive.

Note: If you want to run the code locally, update the file paths accordingly for loading and saving datasets and models.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Download Dataset Files
This cell downloads the training, validation, and test datasets from Google Drive using `gdown`. The datasets are saved as CSV files:
- `train.csv`: Training dataset.
- `val.csv`: Validation dataset.
- `test.csv`: Test dataset.

### Load Tokenizer
This cell loads the tokenizer for the `bert-base-cased` model from Hugging Face's `transformers` library. The tokenizer is used to preprocess text data for the model.

### Load Dataset
This cell loads the training and validation datasets from the CSV files using the `datasets` library. The dataset is stored in a `DatasetDict` object.



In [None]:
# https://drive.google.com/file/d/14RDeg4gRMhAzxgb3oB8uJ5-JT2w24_Tp/view?usp=sharing
train_file_id = "14RDeg4gRMhAzxgb3oB8uJ5-JT2w24_Tp"
# https://drive.google.com/file/d/15BOK8cly_iY3ywGPrwaqmGBhAGDlwYqR/view?usp=sharing
val_file_id = "15BOK8cly_iY3ywGPrwaqmGBhAGDlwYqR"
# https://drive.google.com/file/d/1EUmyd3w0lVIG-4rECFMqPIL4tjGemtIL/view?usp=sharing
test_file_id = "1EUmyd3w0lVIG-4rECFMqPIL4tjGemtIL"

gdown.download(f"https://drive.google.com/uc?id={train_file_id}", "train.csv", quiet=False)
gdown.download(f"https://drive.google.com/uc?id={val_file_id}", "val.csv", quiet=False)
gdown.download(f"https://drive.google.com/uc?id={test_file_id}", "test.csv", quiet=False)

MODEL_NAME = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

DATA_FILES = {"train": "train.csv", "val": "val.csv"}
dataset = load_dataset("csv", data_files=DATA_FILES)


Downloading...
From: https://drive.google.com/uc?id=14RDeg4gRMhAzxgb3oB8uJ5-JT2w24_Tp
To: /content/train.csv
100%|██████████| 14.6M/14.6M [00:00<00:00, 41.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=15BOK8cly_iY3ywGPrwaqmGBhAGDlwYqR
To: /content/val.csv
100%|██████████| 1.82M/1.82M [00:00<00:00, 116MB/s]
Downloading...
From: https://drive.google.com/uc?id=1EUmyd3w0lVIG-4rECFMqPIL4tjGemtIL
To: /content/test.csv
100%|██████████| 1.83M/1.83M [00:00<00:00, 188MB/s]
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

### Define Label Mappings
This cell defines the list of NER labels (`LABEL_LIST`) and creates mappings between labels and their corresponding IDs (`label2id` and `id2label`).

In [None]:
LABEL_LIST = ["O", "B-PER", "I-PER", "B-EMAIL", "I-EMAIL"]
label2id = {label: i for i, label in enumerate(LABEL_LIST)}
id2label = {i: label for label, i in label2id.items()}


### Convert String Columns to Lists
This cell converts the `tokens` and `ner_tags` columns from strings to Python lists using `ast.literal_eval`. This is necessary because the CSV files store these columns as strings.

In [None]:
def convert_str_to_list(example):
    example["tokens"] = ast.literal_eval(example["tokens"])
    example["ner_tags"] = ast.literal_eval(example["ner_tags"])
    return example

dataset = dataset.map(convert_str_to_list)

Map:   0%|          | 0/22812 [00:00<?, ? examples/s]

Map:   0%|          | 0/2852 [00:00<?, ? examples/s]

### Tokenize and Align Labels
This cell tokenizes the text data and aligns the NER labels with the tokenized input. It ensures that the labels are correctly assigned to each token, even after tokenization.

### Tokenize Datasets
This cell applies the `tokenize_and_align_labels` function to the training and validation datasets. The tokenized datasets are stored in `tokenized_datasets`.

In [None]:
def tokenize_and_align_labels(batch):
    tokenized_inputs = tokenizer(
        batch["tokens"],
        is_split_into_words=True,
        truncation=True,
        padding="max_length",
        max_length=128
    )

    all_labels = []
    for i, ner_tags in enumerate(batch["ner_tags"]):
        labels = [label2id.get(tag, label2id["O"]) for tag in ner_tags]
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        if word_ids is None:
            raise ValueError("word_ids is None.")

        previous_word_idx = None
        aligned_labels = []
        for word_idx in word_ids:
            if word_idx is None:
                aligned_labels.append(-100)
            elif word_idx != previous_word_idx:
                aligned_labels.append(labels[word_idx])
            else:
                aligned_labels.append(-100)
            previous_word_idx = word_idx

        aligned_labels = aligned_labels[:128]
        aligned_labels += [-100] * (128 - len(aligned_labels))

        all_labels.append(aligned_labels)

    tokenized_inputs["labels"] = all_labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)


Map:   0%|          | 0/22812 [00:00<?, ? examples/s]

Map:   0%|          | 0/2852 [00:00<?, ? examples/s]

### Define Evaluation Metrics
This cell defines the `compute_metrics` function, which calculates evaluation metrics such as accuracy, precision, recall, F1-score, FPR, and FNR for the NER model.

In [None]:
from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report
import numpy as np

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [id2label[p] for p, l in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for p, l in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    accuracy = accuracy_score(true_labels, true_predictions)
    f1 = f1_score(true_labels, true_predictions, zero_division=0)
    precision = precision_score(true_labels, true_predictions, zero_division=0)
    recall = recall_score(true_labels, true_predictions, zero_division=0)

    report = classification_report(true_labels, true_predictions, output_dict=True, zero_division=0)

    fpr = {}
    fnr = {}
    for entity_type, metrics in report.items():
        if entity_type not in ["micro avg", "macro avg", "weighted avg"]:
            tp = metrics["support"] * metrics["recall"]
            fn = metrics["support"] * (1 - metrics["recall"])
            fp = metrics["support"] * (1 - metrics["precision"])

            denominator_fpr = fp + tp + fn
            denominator_fnr = fn + tp

            if denominator_fpr > 0:
                fpr[entity_type] = fp / denominator_fpr
            else:
                fpr[entity_type] = 0

            if denominator_fnr > 0:
                fnr[entity_type] = fn / denominator_fnr
            else:
                fnr[entity_type] = 0

    valid_fpr = [v for v in fpr.values() if not np.isnan(v)]
    valid_fnr = [v for v in fnr.values() if not np.isnan(v)]

    micro_fpr = sum(valid_fpr) / len(valid_fpr) if valid_fpr else 0
    micro_fnr = sum(valid_fnr) / len(valid_fnr) if valid_fnr else 0

    return {
        "accuracy": accuracy,
        "f1": f1,
        "precision": precision,
        "recall": recall,
        "fpr": micro_fpr,
        "fnr": micro_fnr,
        "classification_report": report,
    }


### Set Environment Variables
This cell sets environment variables to configure the Hugging Face Hub download timeout and disable Weights & Biases (W&B) logging.

### Load Pre-trained Model
This cell loads the pre-trained `bert-base-cased` model for token classification. The model is configured with the number of labels and label mappings.

### Define Training Arguments
This cell defines the training arguments for the `Trainer` class, including:
- Output directory.
- Evaluation and save strategies.
- Learning rate.
- Batch size.
- Number of epochs.
- Weight decay.

### Initialize Trainer
This cell initializes the `Trainer` class with the model, training arguments, datasets, tokenizer, data collator, and evaluation metrics.



In [None]:
import os
os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "600"
os.environ["WANDB_DISABLED"] = "true"
model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(LABEL_LIST),
    id2label=id2label,
    label2id=label2id
)

training_args = TrainingArguments(
    output_dir="./content/drive/MyDrive/NoteBook/ner_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["val"],
    processing_class=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer),
    compute_metrics=compute_metrics,
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


### Train or Download Model
This cell prompts the user to choose between training the model or downloading a pre-trained model. If the user chooses to train the model, it starts the training process and saves the model to Google Drive. If the user chooses to download the model, it downloads the pre-trained model from Google Drive.

In [None]:
choice = input("Do you want to (1) Train the model or (2) Download the pre-trained model? Enter 1 or 2: ").strip()

if choice == "1":
    print("Training the model...")
    trainer.train()
    print("Saving the model to Google Drive...")
    model.save_pretrained("/content/drive/MyDrive/NoteBook/ner_model")
    tokenizer.save_pretrained("/content/drive/MyDrive/NoteBook/ner_model")
    print("Model saved to Google Drive.")

    model = AutoModelForTokenClassification.from_pretrained("/content/drive/MyDrive/NoteBook/ner_model")
    tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/NoteBook/ner_model")
    print(" model and tokenizer loaded successfully from drive.")

elif choice == "2":
    print("Downloading the pre-trained model...")
    # https://drive.google.com/drive/folders/1-1gu-XgHZ9crDBkdg2TD8e4EvGbGl_OC?usp=sharing
    folder_id = "1-1gu-XgHZ9crDBkdg2TD8e4EvGbGl_OC"
    gdown.download_folder(id=folder_id, output="ner_model")
    print("Model downloaded to 'ner_model' folder.")

    model = AutoModelForTokenClassification.from_pretrained("ner_model")
    tokenizer = AutoTokenizer.from_pretrained("ner_model")
    print("Pre-trained model and tokenizer loaded successfully.")
else:
    print("Invalid choice. Please enter 1 or 2.")

Do you want to (1) Train the model or (2) Download the pre-trained model? Enter 1 or 2: 1
Training the model...


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Fpr,Fnr
1,0.0055,0.003067,0.99912,0.99496,0.994084,0.995839,0.005829,0.004142
2,0.0016,0.002697,0.999343,0.996155,0.995966,0.996343,0.003984,0.003639
3,0.0006,0.002704,0.999343,0.996219,0.995592,0.996847,0.004342,0.003137


Saving the model to Google Drive...
Model saved to Google Drive.
 model and tokenizer loaded successfully from drive.


### Load Test Dataset
This cell loads the test dataset from the CSV file and applies the same preprocessing steps as the training and validation datasets.

In [None]:
DATA_FILES = {"test": "test.csv"}
test_dataset = load_dataset("csv", data_files=DATA_FILES)
test_dataset = test_dataset.map(convert_str_to_list)
tokenized_test_datasets = test_dataset.map(tokenize_and_align_labels, batched=True)

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2852 [00:00<?, ? examples/s]

Map:   0%|          | 0/2852 [00:00<?, ? examples/s]

### Evaluate Model on Test Dataset
This cell evaluates the model on the test dataset and prints the evaluation results, including accuracy, precision, recall, F1-score, FPR, and FNR.

In [None]:
results = trainer.evaluate(tokenized_test_datasets)
print(results)

{'eval_test_loss': 0.0037880672607570887, 'eval_test_accuracy': 0.9991494914424858, 'eval_test_f1': 0.9963560002470508, 'eval_test_precision': 0.9961714215141411, 'eval_test_recall': 0.9965406473931308, 'eval_test_fpr': 0.0037663799786242405, 'eval_test_fnr': 0.003425495473452389, 'eval_test_runtime': 23.3699, 'eval_test_samples_per_second': 122.037, 'eval_test_steps_per_second': 7.659, 'epoch': 3.0}


### Initialize NER Pipeline
This cell initializes an NER pipeline using the trained model and tokenizer. The pipeline is used to make predictions on new text data.

In [None]:

from transformers import pipeline

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Subhan Rangila and Muhammad Sadiq are data scientists at Google in New York. Their emails are subhanRangila@gmail.com and muhammad.sadiq@yahoo.com."

predictions = ner_pipeline(text)
for pred in predictions:
    print(f"Entity: {pred['word']}, Label: {pred['entity_group']}, Score: {pred['score']:.4f}")


Device set to use cuda:0


Entity: Subhan Rangila, Label: PER, Score: 0.9286
Entity: Muhammad Sadiq, Label: PER, Score: 0.9999
Entity: subhanRangila @ gmail. com, Label: EMAIL, Score: 0.9888
Entity: muhammad, Label: EMAIL, Score: 0.9989
Entity: sadiq @ yahoo. com, Label: EMAIL, Score: 1.0000


### Redact PII with Pipeline
This cell defines a function to redact PII (e.g., names and emails) from text using the NER pipeline. It replaces PII entities with placeholders like `[NAME]` and `[EMAIL]`.

In [None]:
def redact_pii_with_pipeline(text, ner_pipeline):
    predictions = ner_pipeline(text)

    predictions = sorted(predictions, key=lambda x: x["start"])

    redacted_text = ""
    prev_end = 0
    for pred in predictions:
        redacted_text += text[prev_end:pred["start"]]

        if pred["entity_group"] == "PER":
            redacted_text += "[NAME]"
        elif pred["entity_group"] == "EMAIL":
            redacted_text += "[EMAIL]"
        else:
            redacted_text += pred["word"]

        prev_end = pred["end"]

    redacted_text += text[prev_end:]

    return redacted_text


### Test Redaction Function
This cell tests the redaction function on a sample text and prints the original and redacted text.

In [None]:
text = "Alice Johnson works at Microsoft. Bob Dylan is a researcher at OpenAI. Their contacts are alice.j@microsoft.com and bobdylan@openai.com."
redacted_text = redact_pii_with_pipeline(text, ner_pipeline)
print("Original Text:", text)
print("Redacted Text:", redacted_text)

Original Text: Alice Johnson works at Microsoft. Bob Dylan is a researcher at OpenAI. Their contacts are alice.j@microsoft.com and bobdylan@openai.com.
Redacted Text: [NAME] works at Microsoft. [NAME] is a researcher at OpenAI. Their contacts are alice.[EMAIL] and [EMAIL].


### Load Independent Test Dataset
This cell downloads and loads an independent test dataset from Google Drive. The dataset is stored in JSON format.

In [None]:
# https://drive.google.com/file/d/1E2FjYFDGEeXTwpabkC0aYzZV8aOQqf_h/view?usp=sharing
independent_test_data_file_id = "1E2FjYFDGEeXTwpabkC0aYzZV8aOQqf_h"

gdown.download(f"https://drive.google.com/uc?id={independent_test_data_file_id}", "test_data.json", quiet=False)


Downloading...
From: https://drive.google.com/uc?id=1E2FjYFDGEeXTwpabkC0aYzZV8aOQqf_h
To: /content/test_data.json
100%|██████████| 4.19M/4.19M [00:00<00:00, 194MB/s]


'test_data.json'

In [None]:
DATA_FILES = {"test_data": "test_data.json"}
test_dataset = load_dataset("json", data_files=DATA_FILES)
tokenized_test_datasets = test_dataset.map(tokenize_and_align_labels, batched=True)

### Evaluate Model on Independent Test Dataset
This cell evaluates the model on the independent test dataset and prints the evaluation results.
Note that since independent dataset doesn't having email therefore getting fpr, fnr values as nan as can be seen by evaluation

In [None]:
results = trainer.evaluate(tokenized_test_datasets)
print(results)

{'eval_test_data_loss': 0.029309110715985298, 'eval_test_data_accuracy': 0.9928402326062473, 'eval_test_data_f1': 0.9389151655429203, 'eval_test_data_precision': 0.9868838586841548, 'eval_test_data_recall': 0.8953934740882917, 'eval_test_data_fpr': nan, 'eval_test_data_fnr': nan, 'eval_test_data_runtime': 30.2626, 'eval_test_data_samples_per_second': 120.611, 'eval_test_data_steps_per_second': 7.567, 'epoch': 3.0}


  fpr[entity_type] = fp / (fp + tp + fn)
  fnr[entity_type] = fn / (fn + tp)


### Load Synthetic Test Dataset
This cell downloads and loads a synthetic test dataset from Google Drive. The dataset is stored in CSV format.
Note that when we append synthetic email to the test dataset and then evaluate it we got the values of fnr and fpr

In [None]:
# https://drive.google.com/file/d/1-1v9MghJ6XnGDdlKaD4h-se1ZNfk6hYV/view?usp=sharing
independent_synthetic_test_data_file_id = "1-1v9MghJ6XnGDdlKaD4h-se1ZNfk6hYV"

gdown.download(f"https://drive.google.com/uc?id={independent_synthetic_test_data_file_id}", "synthetic_test_data.csv", quiet=False)



Downloading...
From: https://drive.google.com/uc?id=1-1v9MghJ6XnGDdlKaD4h-se1ZNfk6hYV
To: /content/synthetic_test_data.csv
100%|██████████| 18.2M/18.2M [00:01<00:00, 11.5MB/s]


'synthetic_test_data.csv'

In [None]:
DATA_FILES = {"synthetic_test_data": "synthetic_test_data.csv"}
synthetic_test_dataset = load_dataset("csv", data_files=DATA_FILES)
synthetic_test_dataset = synthetic_test_dataset.map(convert_str_to_list)
tokenized_synthetic_test_dataset = synthetic_test_dataset.map(tokenize_and_align_labels, batched=True)

Generating synthetic_test_data split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/28516 [00:00<?, ? examples/s]

Map:   0%|          | 0/28516 [00:00<?, ? examples/s]

### Evaluate Model on Synthetic Test Dataset
This cell evaluates the model on the synthetic test dataset and prints the evaluation results.

In [None]:
results = trainer.evaluate(tokenized_synthetic_test_dataset)
print(results)

{'eval_synthetic_test_data_loss': 0.0007556203636340797, 'eval_synthetic_test_data_accuracy': 0.9998100851803136, 'eval_synthetic_test_data_f1': 0.9990298614892565, 'eval_synthetic_test_data_precision': 0.9989235737351991, 'eval_synthetic_test_data_recall': 0.9991361718642413, 'eval_synthetic_test_data_fpr': 0.0010661004431376582, 'eval_synthetic_test_data_fnr': 0.0008573174886663537, 'eval_synthetic_test_data_runtime': 239.3137, 'eval_synthetic_test_data_samples_per_second': 119.157, 'eval_synthetic_test_data_steps_per_second': 7.45, 'epoch': 3.0}
