## Step 1: Creating a Knowledge Distillation Trainer

1. The new hyperparameters α and T
α - control the relative weight of the distillation loss
T - how much the probability distribution of the labels should be smoothed

2. The fine-tuned teacher model, we will use BERT-base.

3. A new loss function that combines the cross-entropy loss with the knowledge distillation loss

Adding the new hyperparameters is quite simple, since we just need to subclass TrainingArguments and include them as new attributes

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import TrainingArguments

In [None]:
class KnowledgeDistillationTrainingArguments(TrainingArguments):
  def __init__(self, *args, alpha=0.5, temperature=2.0, **kwargs):
    #*args allows us to pass a variable number of non-keyword arguments to a Python function.
    #**kwargs stands for keyword arguments. The only difference from args is that it uses keywords and returns the values in the form of a dictionary.
    super().__init__(*args, **kwargs)
    #The super() function is often used with the __init__() method to initialize the attributes of the parent class.
    self.alpha = alpha
    self.temperature = temperature

#Lets code for new Loss Function
We will subclass Trainer and overriding the compute_loss() method to include the knowledge distillation loss term LKD:



In [None]:
import torch.nn as nn
import torch.nn.functional as F
from transformers import Trainer

In [None]:
class KnowledgeDistillationTrainer(Trainer):
  def __init__(self, *args, teacher_model=None, **kwargs):
    super().__init__(*args, **kwargs)
    self.teacher_model = teacher_model

  def compute_loss(self, model, inputs, return_outputs=False):
    #Extract cross-entropy loss and logits from student
    outputs_student = model(**inputs)
    loss_ce = outputs_student.loss
    logits_student = outputs_student.logits

    # Extract logits from teacher
    outputs_teacher = self.teacher_model(**inputs)
    logits_teacher = outputs_teacher.logits

     #Computing distillation loss by Softening probabilities
    loss_fct = nn.KLDivLoss(reduction="batchmean")
    #The reduction=batchmean argument in nn.KLDivLoss() specifies that we average the losses over the batch dimension.
    loss_kd = self.args.temperature ** 2 * loss_fct(
                F.log_softmax(logits_student / self.args.temperature, dim=-1),
                F.softmax(logits_teacher / self.args.temperature, dim=-1))

    # Return weighted student loss
    loss = self.args.alpha * loss_ce + (1. - self.args.alpha) * loss_kd
    return (loss, outputs_student) if return_outputs else loss


## Choosing a Good Student Initialization

How to pick good student model?
1. Smaller model than teacher for the student to reduce the latency and memory footprint

2. Knowledge distillation functions best when the teacher and learner are of the same model type. (BERT and RoBERTa, can have different output embedding spaces which creates issues for student to mimic the teacher)

In this project, we will use DistilBERT. DistilBERT is a natural candidate to initialize the student with since it has 40% fewer parameters and has been shown to achieve strong results on downstream tasks.


### Lets load dataset first

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from datasets import load_dataset

We will use CLINC150 dataset which is used to solve the problem of Intent Classification




In [None]:
clinc = load_dataset("clinc_oos", "plus")
#the plus configuration refers to the subset that contains the out-of-scope training examples.



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
sample = clinc["train"][0]
print(sample)
#Each example in the CLINC150 dataset consists of a query in the text column and its corresponding intent.

{'text': 'what expression would i use to say i love you if i were an italian', 'intent': 61}


The intents are provided as IDs, but we can easily get the mapping to strings (and vice versa) by accessing the features attribute of the dataset:

In [None]:
    intents = clinc["train"].features["intent"]
    intent = intents.int2str(sample["intent"])
    print(intent)

translate


#Lets preprocess or tokenize the dataset

In [None]:
from transformers import AutoTokenizer

In [None]:
student_checkpoint = "distilbert-base-uncased"
student_tokenizer = AutoTokenizer.from_pretrained(student_checkpoint)

In [None]:
def tokenize_text(batch):
  return student_tokenizer(batch["text"], truncation=True)

In [None]:
clinc_tokenized = clinc.map(tokenize_text, batched=True, remove_columns=["text"])

#We will remove text column as we don't need it
#We will also rename the intent column to labels so it can be automatically detected by the trainer.
clinc_tokenized = clinc_tokenized.rename_column("intent", "labels")




Map:   0%|          | 0/5500 [00:00<?, ? examples/s]

#Lets define metrics for DistillationTrainer

In [None]:
import numpy as np
from datasets import load_metric
accuracy_score = load_metric("accuracy")

def compute_metrics(pred):
  predictions, labels = pred
  predictions = np.argmax(predictions, axis=1)
  return accuracy_score.compute(predictions=predictions, references=labels)

  accuracy_score = load_metric("accuracy")


In this function, the predictions from the sequence modeling head come in the form of logits, so we use the np.argmax() function to find the most confident class predic‐ tion and compare that against the ground truth label.

In [None]:
!pip install transformers[torch]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#Lets define Training Arguments for DistillationTrainer

In [None]:
batch_size = 48
finetuned_student_ckpt = "distilbert-base-uncased-finetuned-clinc-student"

In [None]:
!pip install accelerate>=0.20.1

In [None]:
student_training_args = KnowledgeDistillationTrainingArguments(
    output_dir=finetuned_student_ckpt, evaluation_strategy = "epoch",
    num_train_epochs=1, learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size, alpha=1, weight_decay=0.01)

## Lets initialize student model but before that provide the student model with the mappings between each intent and label ID.

In [None]:
from transformers import pipeline

bert_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model=bert_ckpt)

id2label = pipe.model.config.id2label
label2id = pipe.model.config.label2id

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [None]:
from transformers import AutoConfig
num_labels = intents.num_classes
student_config = (AutoConfig
                  .from_pretrained(student_checkpoint, num_labels=num_labels,
                                    id2label=id2label, label2id=label2id))

In [None]:
import torch
from transformers import AutoModelForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def student_init():
  return (AutoModelForSequenceClassification.from_pretrained(student_checkpoint, config=student_config).to(device))

## Lets Load teacher checkpoint and start finetuning

In [None]:
teacher_checkpoint = "transformersbook/bert-base-uncased-finetuned-clinc"

In [None]:
teacher_model = (AutoModelForSequenceClassification
                     .from_pretrained(teacher_checkpoint, num_labels=num_labels)
                     .to(device))

In [None]:
#Lets start the training
distilbert_trainer = KnowledgeDistillationTrainer(model_init=student_init,
        teacher_model=teacher_model, args=student_training_args,
        train_dataset=clinc_tokenized['train'], eval_dataset=clinc_tokenized['validation'],
        compute_metrics=compute_metrics, tokenizer=student_tokenizer)
distilbert_trainer.train()

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.we

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,4.158688,0.576129


TrainOutput(global_step=318, training_loss=4.55223496155169, metrics={'train_runtime': 86.196, 'train_samples_per_second': 176.922, 'train_steps_per_second': 3.689, 'total_flos': 83021977369212.0, 'train_loss': 4.55223496155169, 'epoch': 1.0})

## Lets compare Teacher and Student Model

In [None]:
#We will compare the two models based on size and inference time

Saving Teacher and Student model and then computing model's size in MB

In [None]:
def save_teacher_model():
  teacher_model.save_pretrained("teacher_model")
def save_student_model():
  distilbert_trainer.save_model('student_model')


In [None]:
save_teacher_model()
save_student_model()

In [None]:
from transformers import AutoConfig, AutoModelForSequenceClassification
import os

def compute_parameters(model_path):
  model = AutoModelForSequenceClassification.from_pretrained(model_path)
  parameters = model.num_parameters()
  return parameters

In [None]:
teacher_model_parameters = compute_parameters(model_path="/content/teacher_model")
print("Teacher Model: ", teacher_model_parameters)

Teacher Model:  109598359


In [None]:
student_model_parameters = compute_parameters(model_path="/content/student_model")
print("Student Model: ", student_model_parameters)

Student Model:  67069591


In [None]:
decrease = (student_model_parameters-teacher_model_parameters)/teacher_model_parameters
print(decrease*100)

-38.804201438818986


In [None]:
!ls /content/student_model -al --block-size=MB

total 270MB
drwxr-xr-x 2 root root   1MB Jun 25 12:50 .
drwxr-xr-x 1 root root   1MB Jun 25 12:50 ..
-rw-r--r-- 1 root root   1MB Jun 25 12:50 config.json
-rw-r--r-- 1 root root 269MB Jun 25 12:50 pytorch_model.bin
-rw-r--r-- 1 root root   1MB Jun 25 12:50 special_tokens_map.json
-rw-r--r-- 1 root root   1MB Jun 25 12:50 tokenizer_config.json
-rw-r--r-- 1 root root   1MB Jun 25 12:50 tokenizer.json
-rw-r--r-- 1 root root   1MB Jun 25 12:50 training_args.bin
-rw-r--r-- 1 root root   1MB Jun 25 12:50 vocab.txt


In [None]:
!ls /content/teacher_model -al --block-size=MB

total 439MB
drwxr-xr-x 2 root root   1MB Jun 25 12:50 .
drwxr-xr-x 1 root root   1MB Jun 25 12:50 ..
-rw-r--r-- 1 root root   1MB Jun 25 12:50 config.json
-rw-r--r-- 1 root root 439MB Jun 25 12:50 pytorch_model.bin


In [None]:
print(clinc['train']['text'][101])
print(clinc['train']['intent'][101])


complete a transaction from savings to checking of $20000
133


In [None]:
#we will take average times of multiple inferences on same input

In [None]:
#Lets warmup first
from transformers import pipeline
import time

pipe = pipeline("text-classification", model="/content/teacher_model", tokenizer='bert-base-uncased')

sample_input = clinc['train']['text'][101]

#WARMUP
for _ in range(10):
  _ = pipe(sample_input)

start = time.time()
for _ in range(100):
  _ = pipe(sample_input)
total_time_teacher_model = time.time()-start
print("Total time to process 100 requests for Teacher Model: ",total_time_teacher_model)

Total time to process 100 requests for Teacher Model:  24.418488264083862


In [None]:
pipe = pipeline("text-classification", model="/content/student_model", tokenizer="distilbert-base-uncased")

sample_input = clinc['train']['text'][101]

#WARMUP
for _ in range(10):
  _ = pipe(sample_input)

start = time.time()
for _ in range(100):
  _ = pipe(sample_input)
total_time_student_model = time.time()-start

print("Total time to process 100 requests for Student Model: ",total_time_student_model)

Total time to process 100 requests for Student Model:  13.995459079742432


In [None]:
decrease_in_time = (total_time_teacher_model-total_time_student_model)/total_time_teacher_model
print(decrease_in_time*100)

42.684989634154505
