#### About

> Knowledge Distillation 

Knowledge distillation is a technique used in machine learning to transfer knowledge from a complex or large model (often called a "teacher" model) to a simpler or smaller model (often called a "learner" model). The goal of knowledge distillation is to train the student model to mimic the behavior of the teacher model, thus benefiting from the knowledge of the teacher model while reducing the memory footprint and potentially faster inference time. 

The basic idea behind knowledge distillation is to use teacher model outputs (eg, predicted probabilities or logistic) as "soft targets" during training, rather than the hard labels (eg, one-time coded labels) typically used in default supervision. Soft targets are more informative than hard targets because they encode the confidence or uncertainty of the teacher model in its predictions. The learner model is then trained to minimize the difference between its predictions and the soft measures produced by the teacher model. This allows the learner model to learn not only from the ground truth labels, but also from the knowledge and insights that the teacher model captures during training. The knowledge distillation process typically involves the following steps:

1. Train teacher models: Train complex or large models on large labeled data sets to achieve high accuracy or performance. This model is used as a source of knowledge to transfer to the student model.

2. Gathering soft measures: Use a trained teacher model to generate soft measures (such as expected probabilities or logistic) for a set of unlabeled or labeled data samples that the learner model will use during training.

3. Training the trained model: Use the labeled dataset and the soft objects generated by the teacher model to train simpler or smaller models (eg, with fewer parameters or layers). A learner model is typically trained to minimize the difference between predictions and soft targets using an appropriate loss function. 

4. Refine the learner model. The trained model can optionally be further refined using labeled datasets and ground truth labels (ie, hard labels) to improve its performance.

In [43]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

In [44]:
# Generate synthetic data
x_train = np.random.rand(1000, 10)
y_train = np.random.randint(0, 2, size=(1000,))
y_train_onehot = np.eye(2)[y_train]


In [45]:
# Define the teacher model
teacher_model = Sequential()
teacher_model.add(Dense(units=64, activation='relu', input_dim=10))
teacher_model.add(Dense(units=2, activation='softmax'))
teacher_model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])



In [46]:
# Train the teacher model
teacher_model.fit(x_train, y_train, batch_size=32, epochs=10, verbose=1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f83a06a2f10>

In [47]:
# Obtain teacher logits during training
teacher_logits = teacher_model.predict(x_train)




In [48]:
# Define the student model
student_model = Sequential()
student_model.add(Dense(units=32, activation='relu', input_dim=10))
student_model.add(Dense(units=2, activation='softmax'))


In [49]:
# Define the temperature hyperparameter
temperature = 5  # Example value, adjust as needed


In [50]:
# Define the distillation loss function
def distillation_loss(y_true, y_pred):

    # Soften the logits by dividing by temperature
    softened_logits = y_pred / temperature
    
    # Compute cross-entropy between softened logits and true labels
    loss = tf.nn.softmax_cross_entropy_with_logits(labels=y_true, logits=softened_logits)
    
    # Return the loss
    return loss

In [51]:
# Compile the student model with the distillation loss function
student_model.compile(optimizer=Adam(), loss=distillation_loss, metrics=['accuracy'])


In [52]:
# Train the student model with knowledge distillation
student_model.fit(x_train, y_train_onehot, batch_size=32, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f83a05b1190>

In [53]:
# Generate synthetic test data
x_test = np.random.rand(200, 10)
y_test = np.random.randint(0, 2, size=(200,))
y_test_onehot = np.eye(2)[y_test]



In [54]:
# Predict using the teacher model
teacher_pred = np.argmax(teacher_model.predict(x_test), axis=-1)




In [55]:
# Calculate accuracy of teacher model
teacher_accuracy = np.mean(teacher_pred == y_test)
print(f"Accuracy of teacher model: {teacher_accuracy:.4f}")


Accuracy of teacher model: 0.5550


In [56]:
# Predict using the student model
student_pred = np.argmax(student_model.predict(x_test), axis=-1)




In [58]:
# Calculate accuracy of student model
student_accuracy = np.mean(student_pred == y_test)
print(f"Accuracy of student model: {student_accuracy:.4f}")

Accuracy of student model: 0.5250
