<a href="https://colab.research.google.com/github/Pankaj-Kharkwal/ProjectEuler/blob/master/Knowledge_Distillation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Knowledge Distillation Trainer Implementation

## Step 1: Define Hyperparameters for Knowledge Distillation

The new hyperparameters `α` and `T` control the relative weight of the distillation loss and the smoothing of the probability distribution, respectively.

- **α (alpha)**: Controls the weight of the distillation loss relative to the original loss (cross-entropy).
- **T (temperature)**: Determines the degree to which the teacher model’s output probability distribution is smoothed.

### BERT-based Teacher Model
We'll use the fine-tuned BERT-base model as the teacher model for distillation.


In [None]:
!pip install datasets transformers[torch] evaluate accelerate>=0.20.1

In [None]:
from transformers import TrainingArguments

In [None]:
class KnowledgeDistillationTrainingArguments(TrainingArguments):
  def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
    #*args allows us to pass a variable number of non-keyword arguments to a Python function.
    #**kwargs stands for keyword arguments. The only difference from args is that it uses keywords and returns the values in the form of a dictionary.
    super().__init__(*args, **kwargs)
    #The super() function is often used with the __init__() method to initialize the attributes of the parent class.
    self.alpha = alpha
    self.temperature = temperature

# Lets code for new Loss Function
#### We will subclass Trainer and overriding the compute_loss() method to include the knowledge distillation loss term LKD:

In [None]:
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

In [None]:
class KnowledgeDistillationTrainer(Trainer):
  def __init__(self, *args, teacher_model=None, **kwargs):
    super().__init__(*args, **kwargs)
    self.teacher_model = teacher_model

  def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None): # Add num_items_in_batch argument
    #Extract cross-entropy loss and logits from student
    outputs_student = model(**inputs)
    loss_ce = outputs_student.loss
    logits_student = outputs_student.logits

    # Extract logits from teacher
    outputs_teacher = self.teacher_model(**inputs)
    logits_teacher = outputs_teacher.logits

     #Computing distillation loss by Softening probabilities
    loss_fct = nn.KLDivLoss(reduction="batchmean")
    #The reduction=batchmean argument in nn.KLDivLoss() specifies that we average the losses over the batch dimension.
    loss_kd = self.args.temperature ** 2 * loss_fct(
                F.log_softmax(logits_student / self.args.temperature, dim=-1),
                F.softmax(logits_teacher / self.args.temperature, dim=-1))

    # Return weighted student loss
    loss = self.args.alpha * loss_ce + (1. - self.args.alpha) * loss_kd
    return (loss, outputs_student) if return_outputs else loss

# Choosing a Good Student Initialization

## Overview

When selecting a student model for knowledge distillation, the key objective is to choose a model that is smaller than the teacher to reduce latency and memory footprint, while still being able to mimic the teacher's behavior effectively.

## Criteria for Picking a Good Student Model

- **Smaller Model than Teacher**: The student model should have fewer parameters than the teacher model. This is essential to reduce the overall latency and memory usage, which is especially important for deploying the model in resource-constrained environments.

- **Same Model Type**: Knowledge distillation functions most effectively when the teacher and student are of the same model type. This ensures that the student model can mimic the teacher’s knowledge structure more efficiently. For instance, distillation works well when both the teacher and student are based on Transformer architectures like BERT or RoBERTa, as they share similar internal structures.

- **Embedding Space Compatibility**: It’s important that the teacher and student share similar output embedding spaces. For example, BERT and RoBERTa might have different output embedding spaces, which can create issues for the student model when trying to mimic the teacher.

## Choice of Student Model

In this project, **DistilBERT** is chosen as the student model for the following reasons:

- **Reduced Parameters**: DistilBERT has 40% fewer parameters compared to BERT, making it a lighter model.
  
- **Strong Performance**: Despite having fewer parameters, DistilBERT has been shown to achieve strong results on a wide range of downstream tasks, making it a natural candidate for the student model in knowledge distillation.

- **Compatibility**: DistilBERT is based on the BERT architecture, so it shares the same model type and output embedding space, ensuring a smoother knowledge transfer from the teacher to the student.

## Conclusion

DistilBERT serves as an ideal student model due to its compact size, strong performance, and compatibility with BERT, which makes it an effective choice for knowledge distillation in this project.


In [None]:
from datasets import load_dataset

In [None]:
#the plus configuration refers to the subset that contains the out-of-scope training examples.
clinc = load_dataset("clinc_oos", "plus")

In [None]:
sample = clinc["train"][0]
print(sample)
#Each example in the CLINC150 dataset consists of a query in the text column and its corresponding intent.

{'text': 'what expression would i use to say i love you if i were an italian', 'intent': 61}


The intents are provided as IDs, but we can easily get the mapping to strings (and vice versa) by accessing the features attribute of the dataset:



In [None]:
intents = clinc["train"].features["intent"]
intent = intents.int2str(sample["intent"])
print(intent)

translate


# Lets preprocess or tokenize the dataset


In [None]:
from transformers import AutoTokenizer

In [None]:
student_checkpoint = "distilbert-base-uncased"
student_tokenizer = AutoTokenizer.from_pretrained(student_checkpoint)

In [None]:
def tokenize_text(batch):
  return student_tokenizer(batch["text"], truncation=True)

In [None]:

clinc_tokenized = clinc.map(tokenize_text, batched=True, remove_columns=["text"])

#We will remove text column as we don't need it
#We will also rename the intent column to labels so it can be automatically detected by the trainer.
clinc_tokenized = clinc_tokenized.rename_column("intent", "labels")

Map:   0%|          | 0/5500 [00:00<?, ? examples/s]

# Lets define metrics for DistillationTrainer


In [None]:
import numpy as np
import evaluate

accuracy_score = evaluate.load("accuracy")  # Load the accuracy metric using evaluate.load

def compute_metrics(pred):
  predictions, labels = pred
  predictions = np.argmax(predictions, axis=1)
  return accuracy_score.compute(predictions=predictions, references=labels)

In this function, the predictions from the sequence modeling head come in the form of logits, so we use the np.argmax() function to find the most confident class predic‐ tion and compare that against the ground truth label.



# Lets define Training Arguments for DistillationTrainer


In [None]:
batch_size = 48
finetuned_student_ckpt = "distilbert-base-uncased-finetuned-clinc-student"

In [None]:
student_training_args = KnowledgeDistillationTrainingArguments(
    output_dir=finetuned_student_ckpt, evaluation_strategy = "epoch",
    num_train_epochs=5, learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size, alpha=0.5, weight_decay=0.01)



## Lets initialize student model but before that provide the student model with the mappings between each intent and label ID.

In [None]:
from transformers import pipeline

bert_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model=bert_ckpt)

id2label = pipe.model.config.id2label
label2id = pipe.model.config.label2id

Device set to use cuda:0


In [None]:
from transformers import AutoConfig
num_labels = intents.num_classes
student_config = (AutoConfig
                  .from_pretrained(student_checkpoint, num_labels=num_labels,
                                    id2label=id2label, label2id=label2id))

In [None]:
import torch
from transformers import AutoModelForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def student_init():
  return (AutoModelForSequenceClassification.from_pretrained(student_checkpoint, config=student_config).to(device))


# Lets Load teacher checkpoint and start finetuning


In [None]:
teacher_checkpoint = "transformersbook/bert-base-uncased-finetuned-clinc"


In [None]:
teacher_model = (AutoModelForSequenceClassification
                     .from_pretrained(teacher_checkpoint, num_labels=num_labels)
                     .to(device))

In [None]:
#Lets start the training
distilbert_trainer = KnowledgeDistillationTrainer(model_init=student_init,
        teacher_model=teacher_model, args=student_training_args,
        train_dataset=clinc_tokenized['train'], eval_dataset=clinc_tokenized['validation'],
        compute_metrics=compute_metrics, tokenizer=student_tokenizer)
distilbert_trainer.train()

  super().__init__(*args, **kwargs)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.510853,0.528387
2,2.698400,2.003514,0.719032
3,2.698400,1.671607,0.783871
4,1.960400,1.485446,0.813226
5,1.584600,1.425194,0.822903


TrainOutput(global_step=1590, training_loss=2.04804383283891, metrics={'train_runtime': 480.2727, 'train_samples_per_second': 158.764, 'train_steps_per_second': 3.311, 'total_flos': 414689637990180.0, 'train_loss': 2.04804383283891, 'epoch': 5.0})

# Lets compare Teacher and Student Model


In [None]:
def save_teacher_model():
  teacher_model.save_pretrained("teacher_model")
def save_student_model():
  distilbert_trainer.save_model('student_model')

In [None]:
save_teacher_model()
save_student_model()

In [None]:
from transformers import AutoConfig, AutoModelForSequenceClassification
import os

def compute_parameters(model_path):
  model = AutoModelForSequenceClassification.from_pretrained(model_path)
  parameters = model.num_parameters()
  return parameters

In [None]:
teacher_model_parameters = compute_parameters(model_path="/content/teacher_model")
print("Teacher Model: ", teacher_model_parameters)

Teacher Model:  109598359


In [None]:
student_model_parameters = compute_parameters(model_path="/content/student_model")
print("Student Model: ", student_model_parameters)

Student Model:  67069591


In [None]:
decrease = (student_model_parameters-teacher_model_parameters)/teacher_model_parameters
print(decrease*100)

-38.804201438818986


In [None]:
!ls /content/student_model -al --block-size=MB

total 270MB
drwxr-xr-x 2 root root   1MB Jan 30 03:06 .
drwxr-xr-x 1 root root   1MB Jan 30 03:06 ..
-rw-r--r-- 1 root root   1MB Jan 30 03:06 config.json
-rw-r--r-- 1 root root 269MB Jan 30 03:06 model.safetensors
-rw-r--r-- 1 root root   1MB Jan 30 03:06 special_tokens_map.json
-rw-r--r-- 1 root root   1MB Jan 30 03:06 tokenizer_config.json
-rw-r--r-- 1 root root   1MB Jan 30 03:06 tokenizer.json
-rw-r--r-- 1 root root   1MB Jan 30 03:06 training_args.bin
-rw-r--r-- 1 root root   1MB Jan 30 03:06 vocab.txt


In [None]:
!ls /content/teacher_model -al --block-size=MB

total 439MB
drwxr-xr-x 2 root root   1MB Jan 30 03:06 .
drwxr-xr-x 1 root root   1MB Jan 30 03:06 ..
-rw-r--r-- 1 root root   1MB Jan 30 03:06 config.json
-rw-r--r-- 1 root root 439MB Jan 30 03:06 model.safetensors


In [None]:
print(clinc['train']['text'][101])
print(clinc['train']['intent'][101])

complete a transaction from savings to checking of $20000
133


In [None]:
#Lets warmup first
from transformers import pipeline
import time

pipe = pipeline("text-classification", model="/content/teacher_model", tokenizer='bert-base-uncased')

sample_input = clinc['train']['text'][101]

#WARMUP
for _ in range(10):
  _ = pipe(sample_input)

start = time.time()
for _ in range(100):
  _ = pipe(sample_input)
total_time_teacher_model = time.time()-start
print("Total time to process 100 requests for Teacher Model: ",total_time_teacher_model)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Total time to process 100 requests for Teacher Model:  0.8713645935058594


In [None]:
pipe = pipeline("text-classification", model="/content/student_model", tokenizer="distilbert-base-uncased")

sample_input = clinc['train']['text'][101]

#WARMUP
for _ in range(10):
  _ = pipe(sample_input)

start = time.time()
for _ in range(100):
  _ = pipe(sample_input)
total_time_student_model = time.time()-start

print("Total time to process 100 requests for Student Model: ",total_time_student_model)

Device set to use cuda:0


Total time to process 100 requests for Student Model:  0.6051509380340576


In [None]:
decrease_in_time = (total_time_teacher_model-total_time_student_model)/total_time_teacher_model
print(decrease_in_time*100)

30.551351002307126
