# üéì BERT to DistilBERT Knowledge Distillation Tutorial<div style="text-align: center; padding: 20px 0; margin-bottom: 30px; border-bottom: 3px solid #667eea;"><h1 style="color: #667eea; font-size: 2em; margin: 0; font-weight: bold; line-height: 1.2;">üéì BERT to DistilBERT</h1><h2 style="color: #764ba2; font-size: 1.5em; margin: 10px 0 0 0; font-weight: 600; line-height: 1.3;">Knowledge Distillation Tutorial</h2></div><div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; color: white; margin: 20px 0;"><h2 style="color: white; margin-top: 0; font-size: 1.3em; font-weight: 600;">üìö Overview</h2><p style="font-size: 1em; line-height: 1.6; color: white; margin-bottom: 0;">This notebook demonstrates <strong>Knowledge Distillation</strong> - a powerful technique to compress a large, accurate model (teacher) into a smaller, faster model (student) while preserving performance.</p></div>## üéØ What is Knowledge Distillation?<div style="background-color: #f8f9fa; padding: 15px; border-left: 4px solid #667eea; margin: 15px 0;"><ul style="margin: 0; padding-left: 20px; color: #212529;"><li style="color: #212529;"><strong style="color: #000;">üë®‚Äçüè´ Teacher Model</strong>: Large, pre-trained BERT model fine-tuned on SST-2 (sentiment analysis)</li><li style="color: #212529;"><strong style="color: #000;">üë®‚Äçüéì Student Model</strong>: Smaller DistilBERT model that learns from the teacher</li><li style="color: #212529;"><strong style="color: #000;">üéØ Goal</strong>: Transfer the teacher's knowledge to the student, achieving similar accuracy with ~60% fewer parameters</li></ul></div>## üîë Key Concepts<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(250px, 1fr)); gap: 15px; margin: 20px 0;"><div style="background: #e3f2fd; padding: 15px; border-radius: 8px; border-top: 3px solid #2196f3;"><h3 style="margin-top: 0; color: #1976d2; font-size: 1.1em; font-weight: 600;">üå°Ô∏è Temperature Scaling</h3><p style="margin-bottom: 0; color: #212529; font-size: 0.95em;">Softens probability distributions to reveal "dark knowledge"</p></div><div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-top: 3px solid #9c27b0;"><h3 style="margin-top: 0; color: #7b1fa2; font-size: 1.1em; font-weight: 600;">üìä KL Divergence Loss</h3><p style="margin-bottom: 0; color: #212529; font-size: 0.95em;">Measures how well student matches teacher's predictions</p></div><div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-top: 3px solid #ff9800;"><h3 style="margin-top: 0; color: #e65100; font-size: 1.1em; font-weight: 600;">‚öñÔ∏è Combined Loss</h3><p style="margin-bottom: 0; color: #212529; font-size: 0.95em;">Balances hard labels (ground truth) and soft labels (teacher predictions)</p></div></div>## üìã Notebook Structure<div style="background-color: #fff; border: 2px solid #e0e0e0; border-radius: 8px; padding: 15px; margin: 15px 0;"><ol style="margin: 0; padding-left: 20px; color: #212529;"><li style="color: #212529;"><strong style="color: #000;">üîß Setup</strong>: Install dependencies and load models</li><li style="color: #212529;"><strong style="color: #000;">üß† Distillation</strong>: Custom trainer implementing knowledge distillation loss</li><li style="color: #212529;"><strong style="color: #000;">üöÄ Training</strong>: Train student model using teacher's soft predictions</li><li style="color: #212529;"><strong style="color: #000;">üìà Evaluation</strong>: Benchmark teacher vs student performance</li><li style="color: #212529;"><strong style="color: #000;">‚òÅÔ∏è Deployment</strong>: Upload distilled model to Hugging Face Hub</li></ol></div><div style="background-color: #d4edda; border-left: 4px solid #28a745; padding: 12px; margin: 20px 0;"><strong style="color: #155724;">üí° Tip:</strong> <span style="color: #155724;">Run cells sequentially to complete the distillation pipeline.</span></div>

In [1]:
!pip install transformers datasets accelerate evaluate torch

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [15]:
!pip install bitsandbytes accelerate scipy

Collecting bitsandbytes
  Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.49.0


## üì• Step 1: Load Teacher and Student Models<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 15px; border-radius: 8px; color: white; margin-bottom: 20px;"><h3 style="color: white; margin-top: 0; font-size: 1.2em; font-weight: 600;">üë®‚Äçüè´ Teacher Model: BERT-base-uncased (Fine-tuned on SST-2)</h3></div><div style="background-color: #f5f5f5; padding: 15px; border-radius: 8px; margin: 15px 0;"><ul style="margin: 0; padding-left: 20px; color: #212529;"><li style="color: #212529;"><strong style="color: #000;">Model</strong>: <code style="background: #fff; padding: 2px 6px; border-radius: 3px; color: #d32f2f;">textattack/bert-base-uncased-SST-2</code></li><li style="color: #212529;"><strong style="color: #000;">Parameters</strong>: <span style="color: #d32f2f; font-weight: bold;">~110M parameters</span></li><li style="color: #212529;"><strong style="color: #000;">Purpose</strong>: Pre-trained and fine-tuned on Stanford Sentiment Treebank (SST-2)</li><li style="color: #212529;"><strong style="color: #000;">Role</strong>: Provides "soft labels" (probability distributions) instead of just hard labels</li></ul></div><div style="background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); padding: 15px; border-radius: 8px; color: white; margin-bottom: 20px;"><h3 style="color: white; margin-top: 0; font-size: 1.2em; font-weight: 600;">üë®‚Äçüéì Student Model: DistilBERT-base-uncased</h3></div><div style="background-color: #f5f5f5; padding: 15px; border-radius: 8px; margin: 15px 0;"><ul style="margin: 0; padding-left: 20px; color: #212529;"><li style="color: #212529;"><strong style="color: #000;">Model</strong>: <code style="background: #fff; padding: 2px 6px; border-radius: 3px; color: #388e3c;">distilbert-base-uncased</code></li><li style="color: #212529;"><strong style="color: #000;">Parameters</strong>: <span style="color: #388e3c; font-weight: bold;">~67M parameters</span> (<span style="color: #d32f2f;">~60% smaller</span>)</li><li style="color: #212529;"><strong style="color: #000;">Purpose</strong>: Smaller, faster model that will learn from teacher</li><li style="color: #212529;"><strong style="color: #000;">Initialization</strong>: Starts with generic pre-trained weights, not fine-tuned</li></ul></div>### ü§î Why This Pair?<div style="background-color: #e3f2fd; padding: 15px; border-left: 4px solid #2196f3; margin: 15px 0;"><ul style="margin: 0; padding-left: 20px; color: #212529;"><li>DistilBERT is architecturally similar to BERT but with fewer layers</li><li>Both use the same tokenizer, making knowledge transfer easier</li><li>Size reduction enables <strong style="color: #000;">faster inference</strong> and <strong style="color: #000;">lower memory usage</strong></li></ul></div>

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 1. Load the Teacher (Already fine-tuned on SST-2)
teacher_id = "textattack/bert-base-uncased-SST-2"
teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_id)
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_id)

# 2. Load the Student (Smaller, generic DistilBERT)
student_id = "distilbert-base-uncased"
student_model = AutoModelForSequenceClassification.from_pretrained(
    student_id,
    num_labels=2,
    id2label=teacher_model.config.id2label,
    label2id=teacher_model.config.label2id
)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
teacher_model.to(device)
student_model.to(device)

print(f"Teacher parameters: {teacher_model.num_parameters():,}")
print(f"Student parameters: {student_model.num_parameters():,}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Teacher parameters: 109,483,778
Student parameters: 66,955,010


## üõ†Ô∏è Step 2: Custom Distillation Trainer<div style="background-color: #fff3e0; padding: 20px; border-radius: 10px; border: 2px solid #ff9800; margin: 20px 0;"><h3 style="color: #e65100; margin-top: 0; font-size: 1.3em; font-weight: 600;">DistillationTrainer Class</h3><p style="color: #212529; margin-bottom: 0;">This custom trainer extends Hugging Face's <code style="background: #f5f5f5; padding: 2px 6px; border-radius: 3px; color: #000;">Trainer</code> to implement knowledge distillation.</p></div>### üîß Key Components:<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(280px, 1fr)); gap: 15px; margin: 20px 0;"><div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 15px; border-radius: 8px; color: white;"><h4 style="color: white; margin-top: 0; font-size: 1.1em; font-weight: 600;">üå°Ô∏è Temperature Scaling</h4><p style="margin-bottom: 5px; color: white; font-size: 0.9em;"><code style="background: rgba(255,255,255,0.2); padding: 2px 6px; border-radius: 3px; color: white;">temperature=2.0</code></p><ul style="margin: 0; padding-left: 20px; font-size: 0.9em; color: white;"><li>Divides logits by temperature before softmax</li><li>Higher temperature = softer probability distributions</li><li>Reveals relationships between classes</li></ul></div><div style="background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); padding: 15px; border-radius: 8px; color: white;"><h4 style="color: white; margin-top: 0; font-size: 1.1em; font-weight: 600;">üìä KL Divergence Loss</h4><p style="margin-bottom: 5px; color: white; font-size: 0.9em;"><code style="background: rgba(255,255,255,0.2); padding: 2px 6px; border-radius: 3px; color: white;">loss_distill</code></p><ul style="margin: 0; padding-left: 20px; font-size: 0.9em; color: white;"><li>Measures student-teacher distribution match</li><li>Formula: <code style="background: rgba(255,255,255,0.2); color: white;">KL(student_softmax || teacher_softmax)</code></li><li>Multiplied by <code style="background: rgba(255,255,255,0.2); color: white;">temperature¬≤</code> to scale back</li></ul></div><div style="background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); padding: 15px; border-radius: 8px; color: white;"><h4 style="color: white; margin-top: 0; font-size: 1.1em; font-weight: 600;">‚öñÔ∏è Combined Loss</h4><p style="margin-bottom: 5px; color: white; font-size: 0.9em;"><code style="background: rgba(255,255,255,0.2); padding: 2px 6px; border-radius: 3px; color: white;">alpha=0.5</code></p><ul style="margin: 0; padding-left: 20px; font-size: 0.9em; color: white;"><li><code style="background: rgba(255,255,255,0.2); color: white;">loss = Œ± √ó loss_ce + (1-Œ±) √ó loss_distill</code></li><li>Balances hard labels vs teacher predictions</li><li>Typical range: 0.3-0.7</li></ul></div></div><div style="background-color: #d4edda; border-left: 4px solid #28a745; padding: 12px; margin: 20px 0;"><strong style="color: #155724;">üîí Why Freeze Teacher?</strong> <span style="color: #155724;">Teacher weights are frozen - only student learns. Teacher provides guidance without being modified.</span></div>

In [3]:
from transformers import Trainer, TrainingArguments

class DistillationTrainer(Trainer):
    def __init__(self, *args, teacher_model=None, temperature=2.0, alpha=0.5, **kwargs):
        super().__init__(*args, **kwargs)
        self.teacher_model = teacher_model
        self.temperature = temperature
        self.alpha = alpha
        # Freeze teacher weights (we only learn from them, we don't update them)
        self.teacher_model.eval()
        self.teacher_model.requires_grad_(False)

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        # 1. Forward pass student
        outputs_student = model(**inputs)
        student_logits = outputs_student.logits

        # 2. Forward pass teacher (with no gradient tracking for efficiency)
        with torch.no_grad():
            outputs_teacher = self.teacher_model(**inputs)
            teacher_logits = outputs_teacher.logits

        # 3. Calculate "Dark Knowledge" Loss (KL Divergence)
        # We soften the logits using the Temperature (T)
        loss_distill = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=1),
            F.softmax(teacher_logits / self.temperature, dim=1),
            reduction='batchmean',
        ) * (self.temperature ** 2)

        # 4. Calculate Standard Loss (Cross Entropy with Ground Truth)
        # Note: 'labels' are automatically handled by the model's internal loss calculation if present
        loss_ce = outputs_student.loss

        # 5. Combine them
        # alpha controls how much we trust the hard labels vs the teacher
        loss = (self.alpha * loss_ce) + ((1 - self.alpha) * loss_distill)

        return (loss, outputs_student) if return_outputs else loss

## üöÄ Step 3: Training with Knowledge Distillation<div style="background: linear-gradient(135deg, #fa709a 0%, #fee140 100%); padding: 20px; border-radius: 10px; margin: 20px 0;"><h3 style="margin-top: 0; color: #212529; font-size: 1.3em; font-weight: 600;">üìä Dataset: SST-2 (Stanford Sentiment Treebank)</h3><ul style="margin: 0; padding-left: 20px; color: #212529;"><li style="color: #212529;"><strong style="color: #000;">Task</strong>: Binary sentiment classification (positive/negative)</li><li style="color: #212529;"><strong style="color: #000;">Samples</strong>: ~67K training, 872 validation, 1.8K test</li><li style="color: #212529;"><strong style="color: #000;">Format</strong>: Movie review sentences with sentiment labels</li></ul></div>### üîÑ Training Process:<div style="background-color: #f5f5f5; padding: 15px; border-radius: 8px; margin: 15px 0;"><ol style="margin: 0; padding-left: 20px; color: #212529;"><li style="color: #212529;"><strong style="color: #000;">Tokenization</strong>: Convert sentences to token IDs (max_length=128)</li><li style="color: #212529;"><strong style="color: #000;">Forward Pass</strong>:    <ul style="color: #212529;">   <li>Student processes inputs ‚Üí student logits</li>   <li>Teacher processes inputs (no gradients) ‚Üí teacher logits</li>   </ul></li><li style="color: #212529;"><strong style="color: #000;">Loss Calculation</strong>:   <ul style="color: #212529;">   <li>Apply temperature scaling to both logits</li>   <li>Compute KL divergence between distributions</li>   <li>Compute cross-entropy with ground truth</li>   <li>Combine losses with alpha weighting</li>   </ul></li><li style="color: #212529;"><strong style="color: #000;">Backward Pass</strong>: Update only student model weights</li></ol></div>### ‚öôÔ∏è Training Configuration:<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 10px; margin: 20px 0;"><div style="background: #e3f2fd; padding: 10px; border-radius: 5px; text-align: center;"><div style="font-size: 1.5em; font-weight: bold; color: #1976d2;">3</div><div style="font-size: 0.9em; color: #212529;">Epochs</div></div><div style="background: #f3e5f5; padding: 10px; border-radius: 5px; text-align: center;"><div style="font-size: 1.5em; font-weight: bold; color: #7b1fa2;">32</div><div style="font-size: 0.9em; color: #212529;">Batch Size</div></div><div style="background: #fff3e0; padding: 10px; border-radius: 5px; text-align: center;"><div style="font-size: 1.5em; font-weight: bold; color: #e65100;">2e-5</div><div style="font-size: 0.9em; color: #212529;">Learning Rate</div></div><div style="background: #e8f5e9; padding: 10px; border-radius: 5px; text-align: center;"><div style="font-size: 1.5em; font-weight: bold; color: #388e3c;">2.0</div><div style="font-size: 0.9em; color: #212529;">Temperature</div></div></div>### üìà Expected Results:<div style="background-color: #d4edda; border-left: 4px solid #28a745; padding: 15px; margin: 20px 0;"><ul style="margin: 0; padding-left: 20px; color: #155724;"><li>Student should achieve <strong style="color: #155724;">~90-92% accuracy</strong> (close to teacher's ~93%)</li><li>Model size: <strong style="color: #155724;">~67M params</strong> vs teacher's ~110M params</li><li>Inference speed: <strong style="color: #155724;">~2x faster</strong> than teacher</li></ul></div>

In [5]:
from datasets import load_dataset
import evaluate
import numpy as np

# Load Data
dataset = load_dataset("glue", "sst2")

# Tokenize
def tokenize_function(examples):
    return teacher_tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Define Metrics
accuracy_metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

# Training Arguments
training_args = TrainingArguments(
    output_dir="./distilled-bert-sst2",
    per_device_train_batch_size=32,
    num_train_epochs=3,
    learning_rate=2e-5,
    eval_strategy="epoch",
    logging_steps=50,
)

# Initialize our Custom Trainer
trainer = DistillationTrainer(
    model=student_model,
    teacher_model=teacher_model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=teacher_tokenizer,
    compute_metrics=compute_metrics,
    temperature=4.0,  # Soften the probability distribution
    alpha=0.5         # 50% Hard Labels, 50% Teacher Knowledge
)

# Train!
trainer.train()

  super().__init__(*args, **kwargs)
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mharsha90145[0m ([33mharsha90145-university-of-alabama-at-birmingham[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2634,0.346283,0.900229
2,0.1664,0.335335,0.900229
3,0.0981,0.333426,0.90367


TrainOutput(global_step=6315, training_loss=0.24634105479632307, metrics={'train_runtime': 3400.9809, 'train_samples_per_second': 59.408, 'train_steps_per_second': 1.857, 'total_flos': 6691160124062208.0, 'train_loss': 0.24634105479632307, 'epoch': 3.0})

In [12]:
student_model.save_pretrained("./distilled-bert-sst2")

In [13]:
import torch
import torch.nn.functional as F

def predict(text, model, tokenizer, device="cpu"):
    # 1. Prepare the model
    model.to(device)
    model.eval()

    # 2. Tokenize
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

    # --- THE FIX IS HERE ---
    # BERT tokenizers create 'token_type_ids', but DistilBERT crashes if it sees them.
    # We simply remove them from the dictionary if they exist.
    if "token_type_ids" in inputs:
        del inputs["token_type_ids"]

    # Move remaining inputs to device
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # 3. Run Inference
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # 4. Convert to Probabilities
    probabilities = F.softmax(logits, dim=1)

    # 5. Format Output
    prediction_id = torch.argmax(probabilities, dim=1).item()
    confidence = probabilities[0][prediction_id].item()

    label_map = {0: "NEGATIVE", 1: "POSITIVE"}
    label_name = label_map[prediction_id]

    return f"Label: {label_name} | Confidence: {confidence:.2%}"

# --- Re-run Test Cases ---
# Make sure model is on CPU for this quick test
student_model.to("cpu")

sample_1 = "The movie was overly long and the plot was confusing."
sample_2 = "An absolute masterpiece that I would watch again in a heartbeat."
sample_3 = "It was okay, not great but not terrible either."

print(f"Input: '{sample_1}'\n -> {predict(sample_1, student_model, teacher_tokenizer)}")
print(f"\nInput: '{sample_2}'\n -> {predict(sample_2, student_model, teacher_tokenizer)}")
print(f"\nInput: '{sample_3}'\n -> {predict(sample_3, student_model, teacher_tokenizer)}")

Input: 'The movie was overly long and the plot was confusing.'
 -> Label: NEGATIVE | Confidence: 99.89%

Input: 'An absolute masterpiece that I would watch again in a heartbeat.'
 -> Label: POSITIVE | Confidence: 99.97%

Input: 'It was okay, not great but not terrible either.'
 -> Label: POSITIVE | Confidence: 94.06%


## üìä Step 4: Evaluation and Benchmarking<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 15px; border-radius: 8px; color: white; margin-bottom: 20px;"><h3 style="color: white; margin-top: 0; font-size: 1.2em; font-weight: 600;">Benchmarking Function</h3><p style="margin-bottom: 0; color: white; font-size: 0.95em;">Compares multiple model variants across key metrics</p></div>### üìè Metrics Measured:<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 10px; margin: 20px 0;"><div style="background: #e3f2fd; padding: 12px; border-radius: 5px; text-align: center;"><div style="font-size: 1.2em; font-weight: bold;">üìà</div><div style="color: #212529; font-weight: 600;">Accuracy</div><div style="font-size: 0.85em; color: #666;">Classification accuracy</div></div><div style="background: #f3e5f5; padding: 12px; border-radius: 5px; text-align: center;"><div style="font-size: 1.2em; font-weight: bold;">üíæ</div><div style="color: #212529; font-weight: 600;">Model Size</div><div style="font-size: 0.85em; color: #666;">Disk size in MB</div></div><div style="background: #fff3e0; padding: 12px; border-radius: 5px; text-align: center;"><div style="font-size: 1.2em; font-weight: bold;">‚ö°</div><div style="color: #212529; font-weight: 600;">Latency</div><div style="font-size: 0.85em; color: #666;">Inference time (ms)</div></div></div>### üî¨ Models Evaluated:<div style="background-color: #f5f5f5; padding: 15px; border-radius: 8px; margin: 15px 0;"><ol style="margin: 0; padding-left: 20px; color: #212529;"><li style="color: #212529;"><strong style="color: #000;">üë®‚Äçüè´ Teacher (BERT)</strong>: Baseline - original fine-tuned model</li><li style="color: #212529;"><strong style="color: #000;">üë®‚Äçüéì Student (FP32)</strong>: Distilled model in full precision</li><li style="color: #212529;"><strong style="color: #000;">‚ö° Student (INT8)</strong>: Quantized to 8-bit integers for CPU inference</li><li style="color: #212529;"><strong style="color: #000;">üíæ Student (4-bit)</strong>: Quantized using BitsAndBytes for GPU memory efficiency</li><li style="color: #212529;"><strong style="color: #000;">üî¨ DistilBERT (Raw)</strong>: Untrained baseline (should be ~50% accuracy)</li></ol></div><div style="background-color: #fff3cd; border-left: 4px solid #ffc107; padding: 12px; margin: 20px 0;"><strong style="color: #856404;">üîß Key Fix:</strong> <span style="color: #856404;">DistilBERT doesn't use <code style="background: #fff; padding: 2px 6px; border-radius: 3px; color: #000;">token_type_ids</code> (unlike BERT). Must remove this input to avoid errors.</span></div>### üìä Expected Performance:<table style="width: 100%; border-collapse: collapse; margin: 20px 0;"><tr style="background-color: #667eea; color: white;"><th style="padding: 10px; text-align: left; color: white; font-weight: 600;">Model</th><th style="padding: 10px; text-align: center; color: white; font-weight: 600;">Accuracy</th><th style="padding: 10px; text-align: center; color: white; font-weight: 600;">Size</th><th style="padding: 10px; text-align: center; color: white; font-weight: 600;">Speed</th></tr><tr style="background-color: #f5f5f5;"><td style="padding: 10px; color: #212529;"><strong>Teacher</strong></td><td style="padding: 10px; text-align: center; color: #212529;">~93%</td><td style="padding: 10px; text-align: center; color: #212529;">~440MB</td><td style="padding: 10px; text-align: center; color: #212529;">Baseline</td></tr><tr><td style="padding: 10px; color: #212529;"><strong>Student (FP32)</strong></td><td style="padding: 10px; text-align: center; color: #212529;">~90-92%</td><td style="padding: 10px; text-align: center; color: #212529;">~268MB</td><td style="padding: 10px; text-align: center; color: #212529;">~2x faster</td></tr><tr style="background-color: #f5f5f5;"><td style="padding: 10px; color: #212529;"><strong>Student (INT8)</strong></td><td style="padding: 10px; text-align: center; color: #212529;">Similar</td><td style="padding: 10px; text-align: center; color: #212529;">~67MB</td><td style="padding: 10px; text-align: center; color: #212529;">Fastest CPU</td></tr><tr><td style="padding: 10px; color: #212529;"><strong>Student (4-bit)</strong></td><td style="padding: 10px; text-align: center; color: #212529;">Similar</td><td style="padding: 10px; text-align: center; color: #212529;">~34MB</td><td style="padding: 10px; text-align: center; color: #212529;">Lowest VRAM</td></tr></table>

In [None]:
# !pip install transformers datasets accelerate evaluate torch bitsandbytes scipy numpy

import torch
import time
import os
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    BitsAndBytesConfig
)

# ==========================================
# 1. SETUP
# ==========================================
student_path = "./distilled_student_saved"  # Change this if your model is elsewhere
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Main Hardware: {device}")

# Check if model exists, otherwise warn user
if not os.path.exists(student_path):
    print(f"WARNING: Path '{student_path}' not found. Using generic DistilBERT for demo.")
    student_path = "distilbert-base-uncased-finetuned-sst-2-english"

# Load Data
dataset = load_dataset("glue", "sst2")
val_dataset = dataset["validation"]
metric = evaluate.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-SST-2")

# ==========================================
# 2. BENCHMARK FUNCTION (With Bug Fix)
# ==========================================
def run_benchmark(model, name, device_type, dataset):
    print(f"\n--- Benchmarking: {name} ({device_type}) ---")

    # A. Handle Device Placement
    try:
        if device_type == "cuda" and not hasattr(model, "hf_device_map"):
            model.to("cuda")
        elif device_type == "cpu":
            model.to("cpu")
    except:
        pass # 4-bit models manage their own device map

    model.eval()

    # B. Measure Size
    try:
        torch.save(model.state_dict(), "temp.p")
        size_mb = os.path.getsize("temp.p") / (1024 * 1024)
        os.remove("temp.p")
    except:
        size_mb = 0.0 # 4-bit models/quantized models often fail state_dict save

    # C. Inference Loop
    latencies = []
    print(f"Evaluating {len(dataset)} items...", end="")

    for i, example in enumerate(dataset):
        inputs = tokenizer(example["sentence"], return_tensors="pt", truncation=True, max_length=128)

        # --- CRITICAL FIX FOR DISTILBERT ---
        if "token_type_ids" in inputs:
            # Check strictly if model class contains 'DistilBert'
            if "DistilBert" in type(model).__name__ or "DistilBert" in getattr(model.config, "architectures", [""])[0]:
                del inputs["token_type_ids"]

        # Move to device
        if device_type == "cuda":
            inputs = {k: v.cuda() for k, v in inputs.items()}

        # Timing
        start = time.time()
        with torch.no_grad():
            outputs = model(**inputs)
        end = time.time()

        latencies.append((end - start) * 1000)

        # Accuracy
        preds = torch.argmax(outputs.logits, dim=-1)
        metric.add(prediction=preds, reference=example["label"])

        if i % 200 == 0: print(".", end="")

    final_acc = metric.compute()['accuracy'] * 100
    avg_lat = np.mean(latencies)

    print(" Done.")
    return {"Model": name, "Type": device_type, "Acc": final_acc, "Size": size_mb, "Lat": avg_lat}

# ==========================================
# 3. RUN EVALUATIONS
# ==========================================
results_table = []

# A. TEACHER (BERT) - Baseline
print("\nLoading Teacher...")
teacher = AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-uncased-SST-2")
results_table.append(run_benchmark(teacher, "Teacher (BERT)", "cuda", val_dataset))
del teacher
torch.cuda.empty_cache()

# B. STUDENT (FP32) - Standard Distillation Result
print("\nLoading Student (FP32)...")
student_fp = AutoModelForSequenceClassification.from_pretrained(student_path)
results_table.append(run_benchmark(student_fp, "Student (FP32)", "cuda", val_dataset))

# C. STUDENT (INT8) - CPU Speed Optimized
print("\nQuantizing Student (INT8)...")
student_cpu = student_fp.to("cpu")
student_int8 = torch.quantization.quantize_dynamic(
    student_cpu, {torch.nn.Linear}, dtype=torch.qint8
)
results_table.append(run_benchmark(student_int8, "Student (INT8)", "cpu", val_dataset))
del student_fp, student_int8
torch.cuda.empty_cache()

# D. STUDENT (4-BIT) - VRAM Optimized (BitsAndBytes)
print("\nLoading Student (4-bit)...")
try:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    ) 
    student_4bit = AutoModelForSequenceClassification.from_pretrained(
        student_path, quantization_config=bnb_config, device_map="auto"
    )
    results_table.append(run_benchmark(student_4bit, "Student (4-bit)", "cuda", val_dataset))
except ImportError:
    print("Skipping 4-bit eval: bitsandbytes not installed.")

# E. BASELINE (Raw DistilBERT) - The Control Group
# This model has NOT been fine-tuned on SST-2. Accuracy should be ~50%.
print("\nLoading Raw DistilBERT (Baseline)...")
raw_student = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)
# We run this on CUDA to be fair to the other FP32 models
results_table.append(run_benchmark(raw_student, "DistilBERT (Raw)", "cuda", val_dataset))
del raw_student
torch.cuda.empty_cache()

# ==========================================
# 4. FINAL TABLE
# ==========================================
print("\n" + "="*80)
print(f"{'Model':<20} | {'Type':<6} | {'Acc %':<6} | {'Size MB':<8} | {'Lat ms':<8} | {'Notes'}")
print("-" * 80)

for r in results_table:
    note = ""
    if "4-bit" in r['Model']: note = "Low VRAM"
    elif "INT8" in r['Model']: note = "Fastest CPU"
    elif "Teacher" in r['Model']: note = "Teacher"
    elif "Raw" in r['Model']: note = "Untrained Base"
    else: note = "Distilled Student"

    print(f"{r['Model']:<20} | {r['Type']:<6} | {r['Acc']:<6.2f} | {r['Size']:<8.1f} | {r['Lat']:<8.2f} | {note}")
print("="*80)

Main Hardware: cuda

Loading Teacher...

--- Benchmarking: Teacher (BERT) (cuda) ---
Evaluating 872 items........ Done.

Loading Student (FP32)...

--- Benchmarking: Student (FP32) (cuda) ---
Evaluating 872 items........ Done.

Quantizing Student (INT8)...


For migrations of users: 
1. Eager mode quantization (torch.ao.quantization.quantize, torch.ao.quantization.quantize_dynamic), please migrate to use torchao eager mode quantize_ API instead 
2. FX graph mode quantization (torch.ao.quantization.quantize_fx.prepare_fx,torch.ao.quantization.quantize_fx.convert_fx, please migrate to use torchao pt2e quantization API instead (prepare_pt2e, convert_pt2e) 
3. pt2e quantization has been migrated to torchao (https://github.com/pytorch/ao/tree/main/torchao/quantization/pt2e) 
see https://github.com/pytorch/ao/issues/2259 for more details
  student_int8 = torch.quantization.quantize_dynamic(



--- Benchmarking: Student (INT8) (cpu) ---
Evaluating 872 items........

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Done.

Loading Student (4-bit)...
Skipping 4-bit eval: bitsandbytes not installed.

Loading Raw DistilBERT (Baseline)...

--- Benchmarking: DistilBERT (Raw) (cuda) ---
Evaluating 872 items........ Done.

Model                | Type   | Acc %  | Size MB  | Lat ms   | Notes
--------------------------------------------------------------------------------
Teacher (BERT)       | cuda   | 92.43  | 417.7    | 11.86    | Teacher
Student (FP32)       | cuda   | 90.37  | 255.4    | 4.36     | Distilled Student
Student (INT8)       | cpu    | 90.94  | 132.3    | 22.11    | Fastest CPU
DistilBERT (Raw)     | cuda   | 49.08  | 255.4    | 4.29     | Untrained Base


## ‚òÅÔ∏è Step 5: Upload to Hugging Face Hub<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 10px; color: white; margin: 20px 0;"><h3 style="color: white; margin-top: 0; font-size: 1.3em; font-weight: 600;">Model Deployment</h3><p style="color: white; margin-bottom: 0;">After successful distillation, upload the student model to Hugging Face Hub for:</p></div><div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 15px; margin: 20px 0;"><div style="background: #e3f2fd; padding: 15px; border-radius: 8px; border-top: 3px solid #2196f3;"><h4 style="margin-top: 0; color: #1976d2; font-size: 1.1em; font-weight: 600;">üîó Sharing</h4><p style="margin-bottom: 0; color: #212529;">Make model publicly available</p></div><div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-top: 3px solid #9c27b0;"><h4 style="margin-top: 0; color: #7b1fa2; font-size: 1.1em; font-weight: 600;">üìù Versioning</h4><p style="margin-bottom: 0; color: #212529;">Track model iterations</p></div><div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-top: 3px solid #ff9800;"><h4 style="margin-top: 0; color: #e65100; font-size: 1.1em; font-weight: 600;">üîå Integration</h4><p style="margin-bottom: 0; color: #212529;">Easy loading with <code style="background: #f5f5f5; padding: 2px 6px; border-radius: 3px; color: #000;">from_pretrained()</code></p></div><div style="background: #e8f5e9; padding: 15px; border-radius: 8px; border-top: 3px solid #4caf50;"><h4 style="margin-top: 0; color: #388e3c; font-size: 1.1em; font-weight: 600;">üìö Documentation</h4><p style="margin-bottom: 0; color: #212529;">Add model card with performance metrics</p></div></div>### üìã Upload Process:<div style="background-color: #f5f5f5; padding: 15px; border-radius: 8px; margin: 15px 0;"><ol style="margin: 0; padding-left: 20px; color: #212529;"><li style="color: #212529;"><strong style="color: #000;">üîê Login</strong>: Authenticate with Hugging Face token</li><li style="color: #212529;"><strong style="color: #000;">‚¨ÜÔ∏è Push</strong>: Upload model files (config.json, model weights, tokenizer)</li><li style="color: #212529;"><strong style="color: #000;">‚úÖ Verify</strong>: Check model appears on your Hugging Face profile</li></ol></div>### üìù Model Card Best Practices:<div style="background-color: #d4edda; border-left: 4px solid #28a745; padding: 15px; margin: 20px 0;"><ul style="margin: 0; padding-left: 20px; color: #155724;"><li>Document distillation parameters (temperature, alpha)</li><li>Report accuracy metrics (teacher vs student)</li><li>Include inference benchmarks</li><li>Note model size and speed improvements</li><li>Specify use cases and limitations</li></ul></div>

In [21]:
from huggingface_hub import notebook_login

# This will create a widget where you paste your token
notebook_login()

In [22]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# 1. Configuration
# =====================================================
local_model_path = "./distilled_student_saved"  # Where we saved it earlier
repo_name = "distilbert-sst2-student"         # The name you want on the Hub
username = "Harsha901"                   # OPTIONAL: It usually detects this auto-magically

# 2. Load the Model & Tokenizer
# =====================================================
print(f"Loading model from {local_model_path}...")

# Load the model (FP32 version is best for the Hub so others can quantize it themselves)
model = AutoModelForSequenceClassification.from_pretrained(local_model_path)
tokenizer = AutoTokenizer.from_pretrained(local_model_path)

# 3. Push to Hub
# =====================================================
print(f"Pushing to Hugging Face Hub: {repo_name}...")

# This pushes the weights, config, and vocabulary
# It will create the repo if it doesn't exist
model.push_to_hub(repo_name)
tokenizer.push_to_hub(repo_name)

print(f"\nSuccess! Your model is live at:")
print(f"https://huggingface.co/{username}/{repo_name}")

Loading model from ./distilled_student_saved...
Pushing to Hugging Face Hub: distilbert-sst2-student...



Success! Your model is live at:
https://huggingface.co/Harsha901/distilbert-sst2-student
