<a href="https://colab.research.google.com/github/Prathama-1/TOXIC-PROMPT-Classification/blob/main/GENAI_Assessment(Prathama)_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Toxic Prompt Classifier**

=======================



* This notebook fine-tunes a pretrained transformer(DistilBERT) on a binary classification task to flag prompts as safe or unsafe.
* This acts as a prompt guardrail for GenAI assistants.



In [None]:
import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    Trainer, TrainingArguments,
    DataCollatorWithPadding
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
import random
import os

In [None]:
#Disabling weights and biases(W&B) logging because I am not tracking the model on the W&B dashboard.
os.environ["WANDB_DISABLED"] = "true"

In [None]:
#Setting seeds for reproducibility.
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

In [None]:
# Check CUDA is available so that i can make the CUDA use the GPU for faster computation
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

Using device: cuda
GPU: Tesla T4


**Dataset Preparation**

=======================

Used the Toxic Comment Classification Challenge Dataset and applied binary calssification.

In [None]:
df_raw = pd.read_csv("train.csv")
toxicity_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
df_raw["label"] = df_raw[toxicity_cols].max(axis=1)  # Binary label: 1=unsafe, 0=safe
df = df_raw[["comment_text", "label"]].copy() # Keeping only the necessary columns.

In [None]:
#Comparing the original versus optimized dataset for less noise.
df_raw.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,label
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,0


In [None]:
df.head()

Unnamed: 0,comment_text,label
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [None]:
print(df["label"].value_counts())  # Shows number of safe (0) vs unsafe (1) samples

label
0    143346
1     16225
Name: count, dtype: int64


In [None]:
#Splitting the dataset for training and testing.
train_df, val_df = train_test_split(df, test_size=0.2, stratify=df["label"], random_state=seed)

In [None]:
#Converting to Hugging face dataset since it is compatible with the trainer I will be using later.
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))

**Pre-trained model**

=====================



*   Using DistilBERT uncased
*   Importing from hugging face transformers



In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
#Moving the model to CUDA.
#Later giving the inputs also to the same model in the CUDA

model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


**Tokenization and padding**

============================



*   This tokenizer converts text to numerical input which is understood by the model.
*   Dynamic padding is deferred to the data collator to minimize GPU waste.



In [None]:
#Tokenized the dataset’s text prompts, ensuring proper handling of sequence length, truncation, and padding(requirement).
def tokenize(batch):
    return tokenizer(batch["comment_text"], truncation=True)

train_dataset = train_dataset.map(tokenize, batched=True)
val_dataset = val_dataset.map(tokenize, batched=True)

Map:   0%|          | 0/127656 [00:00<?, ? examples/s]

Map:   0%|          | 0/31915 [00:00<?, ? examples/s]

In [None]:
#Using dynamic padding so that it saves GPU memory use and speeds up training
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

**Hyperparameters**

===============

Here are some of the most impacting hyperparams optimization.

*   Learning rate : This sets how fast the model learns.

*   per_device_train_batch_size: Balance this based on GPU memory.
*    num_train_epochs: Number of passes over the training dataset.


*   Weight decay : Prevents overfitting of large weights.


*   warmup_ratio:Stabilizes learning rate at the start of training.

*   fp16 : speeds up training and reduces GPU memory usage .





In [None]:
#hyperparameters optimized for accuracy and speed
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    fp16=True,
    warmup_ratio=0.1,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


**Metrics**

===========

Accuracy, Precision, Recall and F1-score.

In [None]:
#Using this later while training
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='binary')
    acc = accuracy_score(p.label_ids, preds)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

**Using the Hugging face Trainer API**

In [None]:
#hugging face trainer
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [None]:
#Training
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0765,0.090335,0.968886,0.874834,0.809861,0.841095


TrainOutput(global_step=7979, training_loss=0.10975843647814497, metrics={'train_runtime': 1268.7619, 'train_samples_per_second': 100.615, 'train_steps_per_second': 6.289, 'total_flos': 1.1595043198017312e+16, 'train_loss': 0.10975843647814497, 'epoch': 1.0})

**Inference example**

=================

Taken few prompts and checked it against the model.

In [None]:
#Requirement: Inference example
#Could add softmax too.
def classify_prompt(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)#passing inputs also to the device.
    outputs = model(**inputs)
    pred = torch.argmax(outputs.logits, dim=1).item()
    return "unsafe" if pred == 1 else "safe"

In [None]:
examples = [
    "I love your service, thank you!",
    "You're the worst support ever, I hate you.",
    "Can you help me reset my password?",
    "Go to hell, stupid agent."
]

for text in examples:
    print(f"{text} => {classify_prompt(text)}")

I love your service, thank you! => safe
You're the worst support ever, I hate you. => unsafe
Can you help me reset my password? => safe
Go to hell, stupid agent. => unsafe


In [None]:
print("Thank you😁😁!")

Thank you😁😁!
