In this notebook we will go throught how to make model Distillation which are techniques used to reduce the computational demand of models without significantly compromising their performance, This process helps train a smaller model (DistilBERT) to approximate the behavior of a larger, more complex model (BERT) in a way that reduces the computational cost and inference time while maintaining reasonable performance.


# Install the required packages

In [None]:
! pip install torch transformers



# Model Initialization
In this part, we initialize both the teacher model (BERT) and the student model (DistilBERT). The teacher model is a larger, pre-trained model, and the student model is the smaller, distilled version of it.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import BertTokenizer, BertForSequenceClassification, DistilBertForSequenceClassification
import torch.nn.functional as F


# Initialize teacher and student models
teacher_model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)  # Pretrained BERT model
teacher_model.eval()  # Set the teacher to evaluation mode


student_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)  # Smaller distilled model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**BertForSequenceClassification** is a pre-trained BERT model fine-tuned for sequence classification tasks (like sentiment analysis, etc.).

**DistilBertForSequenceClassification** is a smaller and faster model derived from BERT. It maintains much of BERT’s performance but is optimized for speed and resource efficiency.

**num_labels=2** specifies that the model will predict 2 classes (binary classification).

# Tokenizer Setup
We initialize the tokenizer, which converts the input text into token IDs that can be passed into the model.


In [None]:
# Tokenizer for encoding text
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

**BertTokenizer.from_pretrained("bert-base-uncased")** loads the tokenizer that corresponds to the BERT model, ensuring it uses the same vocabulary and tokenization method.


# Distillation Loss Function
The distillation loss function is a custom loss that encourages the student model to mimic the teacher model. We use KL Divergence as the distance metric to compare the probability distributions of the teacher and student models.

In [None]:
# Distillation Loss Function
def distillation_loss(y_true, y_pred, teacher_logits, temperature=2.0):
   # Softmax temperature scaling for the teacher logits and student predictions
   return nn.KLDivLoss()(F.log_softmax(y_pred / temperature, dim=1),
                         F.softmax(teacher_logits / temperature, dim=1))

**temperature=2.0:** This is a hyperparameter that controls the "softness" of the probabilities. A higher temperature results in softer probability distributions, which can help the student model better learn from the teacher.

**KLDivLoss:** This computes the Kullback-Leibler divergence, a measure of how one probability distribution diverges from a second, expected distribution. We apply this between the teacher’s softmax logits and the student’s logits.

# Optimizer Setup
We define the optimizer for the student model. We will use Adam, a popular choice for training deep learning models.

In [None]:
# Optimizer setup
optimizer = optim.Adam(student_model.parameters(), lr=0.001)

**optim.Adam(student_model.parameters(), lr=0.001):** This optimizer adjusts the learning rate dynamically during training to minimize the loss.


# Text Tokenization and Input Preparation
Here, we prepare the text input by tokenizing it and converting it into a format that the models can understand.


In [None]:
# Example data (for demonstration, replace with actual data)
text_data = ["This is a great product", "Worst service ever"]
labels = [1, 0]  # Binary labels for sentiment (positive or negative)
# Tokenize the input text and convert to tensors
inputs = tokenizer(text_data, padding=True, truncation=True, return_tensors="pt", max_length=64)


# Remove token_type_ids for DistilBERT
DistilBERT does not need token_type_ids, which are used in BERT for sentence-pair tasks. We remove this field if it exists in the tokenized input.

In [None]:
# DistilBERT does not require 'token_type_ids', so remove it if it exists
inputs.pop('token_type_ids', None)  # Remove token_type_ids if it exists in the tokenized output

tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])

**inputs.pop('token_type_ids', None):** This removes the token_type_ids from the input dictionary if present. This is necessary because DistilBERT does not use this field.


# Training Loop
This is the core part of the training process, where we calculate the distillation loss and update the student model’s weights.


In [None]:
# Training loop for the student model using distillation
for epoch in range(3):  # Example of 3 epochs for training
   optimizer.zero_grad()


   # Pass data through the teacher and student models
   with torch.no_grad():
       teacher_logits = teacher_model(**inputs).logits  # Teacher model's output (logits)

   student_logits = student_model(**inputs).logits  # Student model's output (logits)


   # Calculate the distillation loss
   loss = distillation_loss(torch.tensor(labels), student_logits, teacher_logits)
   # Backpropagate and update the student model
   loss.backward()
   optimizer.step()


   # Optionally print loss every few iterations
   print(f"Epoch {epoch+1}, Loss: {loss.item()}")




Epoch 1, Loss: 0.00042747706174850464
Epoch 2, Loss: 0.01758313551545143
Epoch 3, Loss: 0.0032774433493614197


**optimizer.zero_grad():** Clears the old gradients before the new ones are calculated.
teacher_logits = teacher_model(**inputs).logits:  Runs the input through the teacher model to get its predictions (logits).

student_logits = student_model(**inputs).logits:  Runs the input through the student model to get its predictions (logits).

**loss.backward():** Performs backpropagation to calculate the gradients of the loss with respect to the model’s parameters.

**optimizer.step():** Updates the model’s parameters based on the calculated gradients.

**print(f"Epoch {epoch+1}, Loss: {loss.item()}"):** Optionally, you can print the loss after every epoch to monitor training progress.
