# **Importing Required Libraries**

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

import torch
from transformers import (
    DistilBertTokenizerFast,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

from datasets import Dataset
import wandb

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
df = pd.read_csv("/content/Full_data_for_classification_53745.csv")
df

Unnamed: 0,instruction,intent,OOD
0,I'd like to cancel my ticket for the game in t...,cancel_ticket,No
1,I have to cancel my ticket for the event in th...,cancel_ticket,No
2,I have to cancel my ticket for the show i need...,cancel_ticket,No
3,How could i cancel my tickets for the show in...,cancel_ticket,No
4,Wanna cancel my ticket for the show in this to...,cancel_ticket,No
...,...,...,...
53740,Morning! Hope you're feeling vibrant and full ...,greeting,Yes
53741,"Hey there, what's filling your heart with pure...",greeting,Yes
53742,"Hru, any profound thoughts or quiet reflection...",greeting,Yes
53743,Just dropping in to say hi and wish you an utt...,greeting,Yes


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53745 entries, 0 to 53744
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   instruction  53745 non-null  object
 1   intent       53745 non-null  object
 2   OOD          53745 non-null  object
dtypes: object(3)
memory usage: 1.2+ MB


In [3]:
# checking for NULL values
df.isna().sum()

Unnamed: 0,0
instruction,0
intent,0
OOD,0


In [5]:
# checking for duplicates
print(df['instruction'].duplicated().sum())

0


In [15]:
# Preprocessing
# Convert OOD column to binary labels: Yes -> 1, No -> 0
df['label'] = df['OOD'].apply(lambda x: 1 if x == 'Yes' else 0)

In [16]:
# Split data into train and validation sets (85-15 split)
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['instruction'].tolist(),
    df['label'].tolist(),
    test_size=0.15,
    random_state=42,
    stratify=df['intent']
)

#     

# **DistilBERT-base-uncased: A Distilled Version of BERT-base-uncased**

**`DistilBERT-base-uncased`** is a smaller, faster, and lighter version of Google’s **`BERT-base-uncased`**, created using **knowledge distillation**. It was developed by **Hugging Face** to retain most of BERT’s performance on NLP tasks while reducing computational overhead — making it ideal for deployment in production environments with latency or memory constraints.

---

## **Key Features of DistilBERT-base-uncased**

1. **Smaller Model Size**  
   - BERT-base has **110M parameters**; DistilBERT has **66M parameters** (~40% reduction).
   - Retains the same hidden size (768) but reduces transformer layers from 12 to 6.

2. **Retained Performance**  
   - Achieves **~95% of BERT’s performance** on key benchmarks like GLUE.
   - Particularly strong on classification, NER, and QA tasks after fine-tuning.

3. **Knowledge Distillation**  
   - Trained as a “student” model to mimic the behavior of the “teacher” BERT.
   - Uses three loss components during training:
     - **Soft target loss**: Matches output probabilities (logits) of BERT.
     - **Hard label loss**: Standard cross-entropy on ground-truth labels.
     - **Cosine embedding loss**: Aligns hidden state representations layer-wise.

4. **Faster Inference & Training**  
   - **60% faster inference** than BERT-base due to half the layers.
   - Reduced memory footprint enables deployment on CPUs or edge devices.

5. **Uncased Tokenization**  
   - Input text is lowercased before tokenization — suitable for case-insensitive tasks.
   - Uses WordPiece tokenizer (same as BERT), vocabulary size = 30,522.

6. **Open-Source & MIT Licensed**  
   - Freely available via Hugging Face `transformers` library.
   - Can be fine-tuned for commercial applications without restrictions.

---

## **Architecture Differences (vs. BERT-base-uncased)**

| Feature             | BERT-base-uncased | DistilBERT-base-uncased |
|---------------------|-------------------|--------------------------|
| **Parameters**        | 110M              | 66M                      |
| **Transformer Layers** | 12                | 6                        |
| **Attention Heads**    | 12                | 12                       |
| **Hidden Size**       | 768               | 768                      |
| **Feedforward Dim**   | 3072              | 3072                     |
| **Max Sequence Length** | 512 tokens        | 512 tokens               |
| **Tokenizer**         | WordPiece (uncased) | WordPiece (uncased)      |
| **Positional Encoding** | Learned           | Learned                  |

- DistilBERT removes the **pooler layer** (used for next sentence prediction in BERT).
- No NSP (Next Sentence Prediction) task used during pre-training — only masked language modeling.
- Layer initialization: First 6 layers of BERT are copied to initialize DistilBERT (then fine-tuned).

---

## **Training Process**

1. **Pre-training Dataset**  
   - Trained on the same data as BERT: **BookCorpus + English Wikipedia** (~13GB text).

2. **Knowledge Distillation Setup**  
   - Teacher: BERT-base-uncased.
   - Student: 6-layer Transformer initialized from teacher’s first 6 layers.
   - Trained using **masked language modeling (MLM)** objective only — no NSP.

3. **Loss Functions Combined**  
   The total loss is a weighted sum:
   ```
   Loss = α * L_{ce} + β * L_{distill} + γ * L_{cos}
   ```
   - `L_{ce}`: Cross-entropy loss on true labels.
   - `L_{distill}`: KL-divergence between student and teacher softmax outputs.
   - `L_{cos}`: Cosine similarity loss between corresponding hidden states.

   Default weights: α=1.0, β=5.0, γ=0.5 (from original paper).

4. **Optimization Details**  
   - Optimizer: **AdamW** (weight decay = 0.01).
   - Learning rate: 5e-4 with linear warmup and decay.
   - Batch size: 256 sequences.
   - Trained for **~90 hours** on 8x 16GB V100 GPUs.

---

## **Performance Comparison**

| Model                   | Params | Speed (Inference) | GLUE Avg Score (Higher better) | SQuAD v1.1 F1 |
|------------------------|--------|-------------------|-------------------------------|---------------|
| BERT-base-uncased      | 110M   | 1.0x (baseline)   | 78.3                          | 88.5          |
| **DistilBERT-base-uncased** | **66M** | **~1.6x faster**  | **76.7**                      | **86.9**      |
| MobileBERT (Google)    | 25M    | ~2x faster        | ~77.7                         | ~89.5         |

- DistilBERT retains **~97% of BERT’s GLUE score** with 40% fewer parameters.
- On SQuAD, it loses only ~1.6 F1 points — negligible for many applications.
- Outperforms similarly sized models like TinyBERT in generalization.

---

## **Use Cases**

1. **Text Classification**  
   - Sentiment analysis, spam detection, topic labeling — ideal due to speed and accuracy trade-off.

2. **Named Entity Recognition (NER)**  
   - Efficiently tags entities in real-time systems (e.g., customer support logs).

3. **Question Answering (QA)**  
   - Suitable for extractive QA on constrained devices (mobile, browser extensions).

4. **Semantic Search & Embeddings**  
   - Generate sentence embeddings (with mean-pooling or [CLS]) for retrieval tasks.

5. **Edge & Real-Time Applications**  
   - Runs efficiently on CPU-only servers or mobile apps where latency matters.

6. **Educational & Prototyping Use**  
   - Excellent for learning transformer internals or rapid experimentation without GPU dependency.

---

## **Limitations**

- **No Next Sentence Prediction (NSP)**: Not suitable out-of-the-box for tasks requiring inter-sentence reasoning (though rarely needed post-BERT).
- **Slightly Weaker on Long Dependencies**: Due to fewer layers, may underperform on very long or complex linguistic structures.
- **Not State-of-the-Art**: Newer distilled models (e.g., TinyBERT, MobileBERT, MiniLM) may offer better efficiency or performance.
- **Still Larger Than Ultra-Light Models**: For extreme edge cases, consider even smaller models like `bert-tiny` (4M params).

In [17]:
# Load DistilBERT model for binary classification
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
# Load DistilBERT tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [19]:
# Tokenize the text data for both training and validation
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=256)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=256)

In [20]:
# Convert to HuggingFace Dataset format
train_dataset = Dataset.from_dict({
    'input_ids': train_encodings['input_ids'],
    'attention_mask': train_encodings['attention_mask'],
    'labels': train_labels   # Note: key must be 'labels'
})

train_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 45683
})

In [21]:
val_dataset = Dataset.from_dict({
    'input_ids': val_encodings['input_ids'],
    'attention_mask': val_encodings['attention_mask'],
    'labels': val_labels   # Note: key must be 'labels'
})

val_dataset

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 8062
})

In [22]:
# Define the compute_metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)

    accuracy = accuracy_score(labels, predictions)
    precision = precision_score(labels, predictions)
    recall = recall_score(labels, predictions)
    f1 = f1_score(labels, predictions)

    return {
        "eval_accuracy": accuracy,
        "eval_precision": precision,
        "eval_recall": recall,
        "eval_f1": f1
    }

In [23]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    save_total_limit=5,
    learning_rate=2e-5,
    lr_scheduler_type="linear"
)

In [24]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [25]:
import wandb

# Login
wandb.login(key="220f713778670bc7b215d4475838cb3fb88d4534")

# Initialize run
wandb.init(project="chatbot_classification_model", name="distilbert-binary-OOD")

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mzlibbrary4[0m ([33mzlibbrary4-iit-hyderabad[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0,0.00575,0.99876,0.997791,0.999754,0.998771
2,0.0,0.00251,0.999628,0.999508,0.999754,0.999631
3,0.0,0.007475,0.99938,0.999754,0.999016,0.999385
4,0.0,0.003028,0.999752,0.999754,0.999754,0.999754
5,0.0,0.003207,0.999628,0.999508,0.999754,0.999631


TrainOutput(global_step=28555, training_loss=0.006935134352845627, metrics={'train_runtime': 1353.1022, 'train_samples_per_second': 168.808, 'train_steps_per_second': 21.103, 'total_flos': 2127483342000720.0, 'train_loss': 0.006935134352845627, 'epoch': 5.0})

In [None]:
# Evaluate the model
results = trainer.evaluate()
print("Evaluation results:", results)

Evaluation results: {'eval_accuracy': 0.9997519225998511, 'eval_precision': 0.999754058042302, 'eval_recall': 0.999754058042302, 'eval_f1': 0.999754058042302, 'eval_loss': 0.0030281455256044865, 'eval_runtime': 8.8141, 'eval_samples_per_second': 914.67, 'eval_steps_per_second': 57.181, 'epoch': 5.0}


In [None]:
# Make predictions on the validation set
predictions = trainer.predict(val_dataset)
predictions

PredictionOutput(predictions=array([[-8.234556,  8.849662],
       [-8.268248,  8.879897],
       [ 8.516267, -8.687723],
       ...,
       [ 8.509247, -8.692003],
       [-8.275925,  8.904369],
       [ 8.466045, -8.668133]], dtype=float32), label_ids=array([1, 1, 0, ..., 0, 1, 0]), metrics={'test_loss': 0.0030281455256044865, 'test_eval_accuracy': 0.9997519225998511, 'test_eval_precision': 0.999754058042302, 'test_eval_recall': 0.999754058042302, 'test_eval_f1': 0.999754058042302, 'test_runtime': 9.272, 'test_samples_per_second': 869.503, 'test_steps_per_second': 54.357})

In [None]:
# Get predicted labels
pred_labels = torch.argmax(torch.tensor(predictions.predictions), dim=1)

# Print classification report
report = classification_report(val_labels, pred_labels.numpy(), target_names=["No", "Yes"])
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

          No       1.00      1.00      1.00      3996
         Yes       1.00      1.00      1.00      4066

    accuracy                           1.00      8062
   macro avg       1.00      1.00      1.00      8062
weighted avg       1.00      1.00      1.00      8062



In [None]:
# Save the fine-tuned model
save_directory = '/content/drive/MyDrive/Chatbot_Query_Classifier/chatbot_query_classifier_distilbert_model'

model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

('/content/drive/MyDrive/Chatbot_Query_Classifier/chatbot_query_classifier_distilbert_model/tokenizer_config.json',
 '/content/drive/MyDrive/Chatbot_Query_Classifier/chatbot_query_classifier_distilbert_model/special_tokens_map.json',
 '/content/drive/MyDrive/Chatbot_Query_Classifier/chatbot_query_classifier_distilbert_model/vocab.txt',
 '/content/drive/MyDrive/Chatbot_Query_Classifier/chatbot_query_classifier_distilbert_model/added_tokens.json',
 '/content/drive/MyDrive/Chatbot_Query_Classifier/chatbot_query_classifier_distilbert_model/tokenizer.json')

#        

# **Inference**

In [1]:
from google.colab import drive
drive.mount('/content/mydrive')

Mounted at /content/mydrive


In [10]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 🔁 Load the model and tokenizer from your saved directory
model_path = '/content/mydrive/MyDrive/Chatbot_Query_Classifier/chatbot_query_classifier_distilbert_model'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# 🔒 Set model to evaluation mode
model.eval()

# ✅ Set device (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# ✅ Inference function
def predict_OOD(query):
    inputs = tokenizer(query, return_tensors="pt", truncation=True, padding=True, max_length=256)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_class_id = torch.argmax(logits, dim=1).item()

    label_map = {0: "No", 1: "Yes"}  # 0 = in-domain, 1 = out-of-domain
    prediction_label = label_map[predicted_class_id]

    return prediction_label

In [11]:
# 🗨️ Run inference exactly TWICE
print("\n🤖 Please enter 2 queries:")

for i in range(2):
    user_input = input(f"[{i+1}/2] Your query: ")

    # Optional: Allow early exit
    if user_input.lower() in ["exit", "quit"]:
        print("👋 Exiting early. Goodbye!")
        break

    prediction = predict_OOD(user_input)
    print(f"📌 Prediction (Is OOD?): {prediction}\n")

print("✅ Done. Thank you!")


🤖 Please enter 2 queries:
[1/2] Your query: How can i cancel my ticket?
📌 Prediction (Is OOD?): No

[2/2] Your query: what is quantum physics?
📌 Prediction (Is OOD?): Yes

✅ Done. Thank you!


#  

# **Inference with Fallback responses**

In [1]:
from google.colab import drive
drive.mount('/content/mydrive')

Mounted at /content/mydrive


In [2]:
import torch
import random
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# List of fallback responses for OOD (Yes) predictions
responses = [
    "I’m sorry, but I am unable to assist with this request. If you need help regarding event tickets, I’d be happy to support you.",
    "Apologies, but I am not able to provide assistance on this matter. Please let me know if you require help with event tickets.",
    "Unfortunately, I cannot assist with this. However, I am here to help with any event ticket-related concerns you may have.",
    "Regrettably, I am unable to assist with this request. If there's anything I can do regarding event tickets, feel free to ask.",
    "I regret that I am unable to assist in this case. Please reach out if you need support related to event tickets.",
    "Apologies, but this falls outside the scope of my support. I’m here if you need any help with event ticket issues.",
    "I'm sorry, but I cannot assist with this particular topic. If you have questions about event tickets, I’d be glad to help.",
    "I regret that I’m unable to provide assistance here. Please let me know how I can support you with event ticket matters.",
    "Unfortunately, I am not equipped to assist with this. If you need help with event tickets, I am here for that.",
    "I apologize, but I cannot help with this request. However, I’d be happy to assist with anything related to event tickets.",
    "I’m sorry, but I’m unable to support this request. If it’s about event tickets, I’ll gladly help however I can.",
    "This matter falls outside the assistance I can offer. Please let me know if you need help with event ticket-related inquiries.",
    "Regrettably, this is not something I can assist with. I’m happy to help with any event ticket questions you may have.",
    "I’m unable to provide support for this issue. However, I can assist with concerns regarding event tickets.",
    "I apologize, but I cannot help with this matter. If your inquiry is related to event tickets, I’d be more than happy to assist.",
    "I regret that I am unable to offer help in this case. I am, however, available for any event ticket-related questions.",
    "Unfortunately, I’m not able to assist with this. Please let me know if there’s anything I can do regarding event tickets.",
    "I'm sorry, but I cannot assist with this topic. However, I’m here to help with any event ticket concerns you may have.",
    "Apologies, but this request falls outside of my support scope. If you need help with event tickets, I’m happy to assist.",
    "I’m afraid I can’t help with this matter. If there’s anything related to event tickets you need, feel free to reach out.",
    "This is beyond what I can assist with at the moment. Let me know if there’s anything I can do to help with event tickets.",
    "Sorry, I’m unable to provide support on this issue. However, I’d be glad to assist with event ticket-related topics.",
    "Apologies, but I can’t assist with this. Please let me know if you have any event ticket inquiries I can help with.",
    "I’m unable to help with this matter. However, if you need assistance with event tickets, I’m here for you.",
    "Unfortunately, I can’t support this request. I’d be happy to assist with anything related to event tickets instead.",
    "I’m sorry, but I can’t help with this. If your concern is related to event tickets, I’ll do my best to assist.",
    "Apologies, but this issue is outside of my capabilities. However, I’m available to help with event ticket-related requests.",
    "I regret that I cannot assist with this particular matter. Please let me know how I can support you regarding event tickets.",
    "I’m sorry, but I’m not able to help in this instance. I am, however, ready to assist with any questions about event tickets.",
    "Unfortunately, I’m unable to help with this topic. Let me know if there's anything event ticket-related I can support you with."
]

# Load model and tokenizer
model_path = '/content/mydrive/MyDrive/Chatbot_Query_Classifier/chatbot_query_classifier_distilbert_model'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Set model to evaluation mode
model.eval()

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Prediction function
def predict_OOD(query: str):
    # Tokenize and move input to device
    inputs = tokenizer(query, return_tensors="pt", truncation=True, padding=True, max_length=32)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Get model output
    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    class_id = torch.argmax(logits, dim=1).item()

    return class_id  # 0 = No, 1 = Yes

In [3]:
for i in range(2):
    user_input = input(f"[{i+1}/2] Your query: ")

    if user_input.lower() in ['exit', 'quit']:
        print("👋 Exiting the chatbot. Have a great day!")
        break

    prediction = predict_OOD(user_input)

    if prediction == 1:
        # If OOD, show random polite fallback
        fallback_response = random.choice(responses)
        print(f"\n🔍 Prediction: Out of Domain ❌\n💬 Response: {fallback_response}\n")
    else:
        # If in-domain, respond accordingly
        print(f"\n🔍 Prediction: In Domain ✅\n💬 Response: This seems like a valid event ticket query. How can I assist you further?\n")

print("✅ Done. Thank you!")

[1/2] Your query: How can i sell my ticket?

🔍 Prediction: In Domain ✅
💬 Response: This seems like a valid event ticket query. How can I assist you further?

[2/2] Your query: Explain how large language model works?

🔍 Prediction: Out of Domain ❌
💬 Response: Apologies, but this request falls outside of my support scope. If you need help with event tickets, I’m happy to assist.

✅ Done. Thank you!


<div style="text-align: center;">
    <img src="https://www.icegif.com/wp-content/uploads/2024/12/thank-you-icegif-11.gif" width="400"/>
</div>