### Ideas taken from:

Knowledge Distillation  Explained with Keras Example | #MLConcepts
(https://www.youtube.com/watch?v=0ZS2lLsZwBY)

https://keras.io/examples/vision/knowledge_distillation/

https://github.com/keras-team/keras-io/blob/master/examples/vision/knowledge_distillation.py


Knowledge Distillation - Keras Code Examples | Henry AI Labs
(https://www.youtube.com/watch?v=Y2K13XDqwiM)
(https://www.youtube.com/watch?v=gZPUGje1PCI)



#### TODO: 

https://huggingface.co/docs/transformers/model_doc/distilbert



    
    

### Main idea:

Knowledge Distillation is a procedure for model compression, in which a small (student) model is trained to match a large pre-trained (teacher) model. 

Knowledge is transferred from the teacher model to the student by minimizing a loss function, aimed at matching softened teacher logits as well as ground-truth labels.



### Distiller

The custom Distiller() class, overrides the Model methods train_step, test_step, and compile(). In order to use the distiller, we need:

1: Define and train a teacher model - This is a large convnet model for image classification.

2: Define a student model to train - This is a smaller convnet model for image classification.

A student loss function on the difference between student predictions and ground-truth

A distillation loss function, along with a temperature, on the difference between the soft student predictions and the soft teacher labels

An alpha factor to weight the student and distillation loss

An optimizer for the student and (optional) metrics to evaluate performance

In the train_step method, we perform a forward pass of both the teacher and student, calculate the loss with weighting of the student_loss and distillation_loss by alpha and 1 - alpha, respectively, and perform the backward pass. Note: only the student weights are updated, and therefore we only calculate the gradients for the student weights.

In the test_step method, we evaluate the student model on the provided dataset.

### Code starts here

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

## Create student and teacher models

Initialy, we create a teacher model and a smaller student model. Both models are
convolutional neural networks and created using `Sequential()`,
but could be any Keras model.


Q: Why the original teacher (v1) model not use the softmax activation - https://github.com/keras-team/keras-io/issues/755 ?

https://datascience.stackexchange.com/questions/73093/what-does-from-logits-true-do-in-sparsecategoricalcrossentropy-loss-function


The from_logits=True attribute inform the loss function that the output values generated by the model are not normalized, a.k.a. logits.
In other words, the softmax function has not been applied on them to produce a probability distribution. Therefore, the output layer in this case does not have a softmax activation function:

so basically what it means is if softmax layer is not being added at the last layer then we need to have the from_logits=True to indicate the probabilities are not normalized 


SparseCategoricalCrossentropy 에 softmax함수가 포함되어 있음


> https://guru.tistory.com/67



In [None]:
def make_teacher_model(version='v3'):
    if version == 'v1':
        # Create the teacher
        teacher = keras.Sequential(
            [
                keras.Input(shape=(32, 32, 3)),
                layers.Conv2D(256, (3, 3), strides=(2, 2), padding="same"),
                layers.LeakyReLU(alpha=0.2),
                layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
                layers.Conv2D(512, (3, 3), strides=(2, 2), padding="same"),
                layers.Flatten(),
                layers.Dense(10),
            ],
            name="teacher",
        )

    if version == 'v2':
        # add multiple conv blocks
        teacher = keras.Sequential(
            [
                keras.Input(shape=(32, 32, 3)),
                layers.Conv2D(256, (3, 3), strides=(2, 2), padding="same"),
                layers.LeakyReLU(alpha=0.2),
                layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
                layers.Conv2D(256, (3, 3), strides=(2, 2), padding="same"),
                layers.LeakyReLU(alpha=0.2),
                layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
                layers.Conv2D(512, (3, 3), strides=(2, 2), padding="same"),
                layers.LeakyReLU(alpha=0.2),
                layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
                layers.Flatten(),
                layers.Dense(128, activation='relu'),
                layers.Dense(10),
            ],
            name="teacher",
        )
        
    if version == 'v3':
        # add dropout
        teacher = keras.Sequential(
            [
                keras.Input(shape=(32, 32, 3)),
                layers.Conv2D(256, (3, 3), strides=(2, 2), padding="same"),
                layers.LeakyReLU(alpha=0.2),
                layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
                layers.Dropout(0.2),
                layers.Conv2D(256, (3, 3), strides=(2, 2), padding="same"),
                layers.LeakyReLU(alpha=0.2),
                layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
                layers.Dropout(0.3),
                layers.Conv2D(512, (3, 3), strides=(2, 2), padding="same"),
                layers.LeakyReLU(alpha=0.2),
                layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
                layers.Dropout(0.4),
                layers.Flatten(),
                layers.Dense(128, activation='relu'),
                layers.Dropout(0.5),
                layers.Dense(10),
            ],
            name="teacher",
        )

    return teacher

In [None]:
# teacher = make_teacher_model(version='v2')
teacher = make_teacher_model(version='v3')

## Prepare the dataset

Possible datasets:

[MNIST](https://keras.io/api/datasets/mnist/), 

(Not used )

---------

[CIFAR-10](https://keras.io/api/datasets/cifar10/)

This is a dataset of 50,000 32x32 color training images and 10,000 test images, labeled over 10 categories. 

https://www.cs.toronto.edu/~kriz/cifar.html

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

Pixel values range from 0 to 255.


Both the student and teacher are trained on the training set and evaluated on
the test set.


In [None]:
# Prepare the train and test dataset.
batch_size = 64
# (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [None]:
x_train.shape, y_train.shape

((50000, 32, 32, 3), (50000, 1))

In [None]:
x_test.shape, y_test.shape

((10000, 32, 32, 3), (10000, 1))

In [None]:
y_train[0]

array([6], dtype=uint8)

In [None]:
x_train[0]


array([[[ 59,  62,  63],
        [ 43,  46,  45],
        [ 50,  48,  43],
        ...,
        [158, 132, 108],
        [152, 125, 102],
        [148, 124, 103]],

       [[ 16,  20,  20],
        [  0,   0,   0],
        [ 18,   8,   0],
        ...,
        [123,  88,  55],
        [119,  83,  50],
        [122,  87,  57]],

       [[ 25,  24,  21],
        [ 16,   7,   0],
        [ 49,  27,   8],
        ...,
        [118,  84,  50],
        [120,  84,  50],
        [109,  73,  42]],

       ...,

       [[208, 170,  96],
        [201, 153,  34],
        [198, 161,  26],
        ...,
        [160, 133,  70],
        [ 56,  31,   7],
        [ 53,  34,  20]],

       [[180, 139,  96],
        [173, 123,  42],
        [186, 144,  30],
        ...,
        [184, 148,  94],
        [ 97,  62,  34],
        [ 83,  53,  34]],

       [[177, 144, 116],
        [168, 129,  94],
        [179, 142,  87],
        ...,
        [216, 184, 140],
        [151, 118,  84],
        [123,  92,  72]]

In [None]:
# Normalize data
x_train = x_train.astype("float32") / 255.0
x_train = np.reshape(x_train, (-1, 32, 32, 3))

x_test = x_test.astype("float32") / 255.0
x_test = np.reshape(x_test, (-1, 32, 32, 3))

In [None]:
x_train.shape, y_train.shape

((50000, 32, 32, 3), (50000, 1))

In [None]:
x_train[0]

array([[[0.23137255, 0.24313726, 0.24705882],
        [0.16862746, 0.18039216, 0.1764706 ],
        [0.19607843, 0.1882353 , 0.16862746],
        ...,
        [0.61960787, 0.5176471 , 0.42352942],
        [0.59607846, 0.49019608, 0.4       ],
        [0.5803922 , 0.4862745 , 0.40392157]],

       [[0.0627451 , 0.07843138, 0.07843138],
        [0.        , 0.        , 0.        ],
        [0.07058824, 0.03137255, 0.        ],
        ...,
        [0.48235294, 0.34509805, 0.21568628],
        [0.46666667, 0.3254902 , 0.19607843],
        [0.47843137, 0.34117648, 0.22352941]],

       [[0.09803922, 0.09411765, 0.08235294],
        [0.0627451 , 0.02745098, 0.        ],
        [0.19215687, 0.10588235, 0.03137255],
        ...,
        [0.4627451 , 0.32941177, 0.19607843],
        [0.47058824, 0.32941177, 0.19607843],
        [0.42745098, 0.28627452, 0.16470589]],

       ...,

       [[0.8156863 , 0.6666667 , 0.3764706 ],
        [0.7882353 , 0.6       , 0.13333334],
        [0.7764706 , 0

## Train the teacher
In knowledge distillation we assume that the teacher is trained and fixed. Thus, we start
by training the teacher model on the training set in the usual way.


### Model params:
https://stackoverflow.com/questions/44477489/keras-difference-between-categorical-accuracy-and-sparse-categorical-accuracy

in categorical_accuracy you need to specify your target (y) as one-hot encoded vector (e.g. in case of 3 classes, when a true class is second class, y should be (0, 1, 0). 

In sparse_categorical_accuracy you need should only provide an integer of the true class (in the case from previous example - it would be 1 as classes indexing is 0-based).


### Improve the teacher model: 

https://machinelearningmastery.com/how-to-develop-a-cnn-from-scratch-for-cifar-10-photo-classification/


It is better to use a separate validation dataset, e.g. by splitting the train dataset into train and validation sets. We will not split the data in this case, and instead use the test dataset as a validation dataset to keep the example simple.

using multi layers of conv2d+maxpool helps >5% - against 1 block

https://stackoverflow.com/questions/63989328/can-i-combine-conv2d-and-leakyrelu-into-a-single-layer

adding Leakyrelu after conv2d helped: around 4% boost

adding dropout - reduces the rate/extent of overfitting



In [None]:
# Train teacher as usual
teacher.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

SparseCategoricalCrossentropy

> https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy



In [None]:
# Train and evaluate teacher on data.
history  = teacher.fit(x_train, 
            y_train, 
            epochs=20,
            batch_size=batch_size,
            validation_data=(x_test, y_test), 
            verbose=1)            

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
teacher.evaluate(x_test, y_test)
print(teacher.metrics_names)

['loss', 'sparse_categorical_accuracy']


Model training APIs


> https://keras.io/api/models/model_training_apis/


### Next task:

## Train student from scratch for comparison

define student 

train student without distillation (student_scratch)


We can also train an equivalent student model from scratch without the teacher, in order
to evaluate the performance gain obtained by knowledge distillation.



    

In [None]:
# Create the student
student_scratch = keras.Sequential(
    [
        keras.Input(shape=(32, 32, 3)),
        layers.Conv2D(32, (3, 3), strides=(2, 2), padding="same"),
        layers.LeakyReLU(alpha=0.2),
        layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
        layers.Conv2D(32, (3, 3), strides=(2, 2), padding="same"),
        layers.LeakyReLU(alpha=0.2),
        layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
        layers.Dropout(0.2),
        layers.Flatten(),
        layers.Dense(16, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(10),
    ],
    name="student_scratch",
)

In [None]:
# Train student as doen usually
student_scratch.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

In [None]:
# Train and evaluate student trained from scratch.
history2 = student_scratch.fit(x_train, 
                               y_train, 
                               epochs=20,
                                batch_size=batch_size,
                                validation_data=(x_test, y_test), 
                                verbose=1)            

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
student_scratch.evaluate(x_test, y_test)
print(student_scratch.metrics_names)

['loss', 'sparse_categorical_accuracy']


### student_scratch best perf = 

63%

### Next task:

train student with distillation

can this increase the score for student?



    

In [None]:
class Distiller(keras.Model):
    def __init__(self, student, teacher):
        super(Distiller, self).__init__()
        self.teacher = teacher
        self.student = student

    def compile(
        self,
        optimizer,
        metrics,
        student_loss_fn,
        distillation_loss_fn,
        alpha=0.1,
        temperature=3,
    ):
        """ Configure the distiller.
        Args:
            optimizer: Keras optimizer for the student weights
            metrics: Keras metrics for evaluation
            student_loss_fn: Loss function of difference between student
                predictions and ground-truth
            distillation_loss_fn: Loss function of difference between soft
                student predictions and soft teacher predictions
            alpha: weight to student_loss_fn and 1-alpha to distillation_loss_fn
            temperature: Temperature for softening probability distributions.
                Larger temperature gives softer distributions.
        """
        super(Distiller, self).compile(optimizer=optimizer, metrics=metrics)
        self.student_loss_fn = student_loss_fn
        self.distillation_loss_fn = distillation_loss_fn
        self.alpha = alpha
        self.temperature = temperature

    def train_step(self, data):
        # Unpack data
        x, y = data

        # Forward pass of teacher
        teacher_predictions = self.teacher(x, training=False)

        with tf.GradientTape() as tape:
            # Forward pass of student
            student_predictions = self.student(x, training=True)

            # Compute losses
            student_loss = self.student_loss_fn(y, student_predictions)
            distillation_loss = self.distillation_loss_fn(
                tf.nn.softmax(teacher_predictions / self.temperature, axis=1),
                tf.nn.softmax(student_predictions / self.temperature, axis=1),
            )
            loss = self.alpha * student_loss + (1 - self.alpha) * distillation_loss

        # Compute gradients
        trainable_vars = self.student.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Update the metrics configured in `compile()`.
        self.compiled_metrics.update_state(y, student_predictions)

        # Return a dict of performance
        results = {m.name: m.result() for m in self.metrics}
        results.update(
            {"student_loss": student_loss, "distillation_loss": distillation_loss}
        )
        return results

    def test_step(self, data):
        # Unpack the data
        x, y = data

        # Compute predictions
        y_prediction = self.student(x, training=False)

        # Calculate the loss
        student_loss = self.student_loss_fn(y, y_prediction)

        # Update the metrics.
        self.compiled_metrics.update_state(y, y_prediction)

        # Return a dict of performance
        results = {m.name: m.result() for m in self.metrics}
        results.update({"student_loss": student_loss})
        return results


GradientTape


> https://shinslab.tistory.com/110



In [None]:
# Create the student
student = keras.Sequential(
    [
        keras.Input(shape=(32, 32, 3)),
        layers.Conv2D(32, (3, 3), strides=(2, 2), padding="same"),
        layers.LeakyReLU(alpha=0.2),
        layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
        layers.Conv2D(32, (3, 3), strides=(2, 2), padding="same"),
        layers.LeakyReLU(alpha=0.2),
        layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"),
        layers.Dropout(0.2),
        layers.Flatten(),
        layers.Dense(16, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(10),
    ],
    name="student",
)

In [None]:
# Initialize and compile distiller
distiller = Distiller(student=student, teacher=teacher)

In [None]:
distiller.compile(
    optimizer=keras.optimizers.Adam(),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
    student_loss_fn=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    distillation_loss_fn=keras.losses.KLDivergence(),
    alpha=0.1,
    temperature=10,
)

KL Divergence

> https://angeloyeo.github.io/2020/10/27/KL_divergence.html



KLDivergence 와 CrossEntropy


> https://uhou.tistory.com/200



In [None]:
# Distill teacher to student
history3 = distiller.fit(x_train, 
                         y_train, 
                         epochs=20,
                         batch_size=batch_size,
                         validation_data=(x_test, y_test), 
                         verbose=1)            

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
# Evaluate student on test dataset
distiller.evaluate(x_test, y_test)
print(distiller.metrics_names)

['sparse_categorical_accuracy']


### play with hyperparam for alpha

In [None]:
# Initialize and compile distiller
distiller = Distiller(student=student, teacher=teacher)

distiller.compile(
    optimizer=keras.optimizers.Adam(),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
    student_loss_fn=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    distillation_loss_fn=keras.losses.KLDivergence(),
    alpha=0.5,
    temperature=10,
)


# Distill teacher to student
history3b = distiller.fit(x_train, 
                         y_train, 
                         epochs=20,
                         batch_size=batch_size,
                         validation_data=(x_test, y_test), 
                         verbose=1)            




# Evaluate student on test dataset
distiller.evaluate(x_test, y_test)



Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.6266999840736389, 1.0080249309539795]

In [None]:
# Initialize and compile distiller
distiller = Distiller(student=student, teacher=teacher)

distiller.compile(
    optimizer=keras.optimizers.Adam(),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
    student_loss_fn=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    distillation_loss_fn=keras.losses.KLDivergence(),
    alpha=0.2,
    temperature=10,
)


# Distill teacher to student
history3c = distiller.fit(x_train, 
                         y_train, 
                         epochs=20,
                         batch_size=batch_size,
                         validation_data=(x_test, y_test), 
                         verbose=1)            

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.6373999714851379, 0.9663770198822021]

In [None]:
# Initialize and compile distiller
distiller = Distiller(student=student, teacher=teacher)

distiller.compile(
    optimizer=keras.optimizers.Adam(),
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
    student_loss_fn=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    distillation_loss_fn=keras.losses.KLDivergence(),
    alpha=0.9,
    temperature=10,
)


# Distill teacher to student
history3d = distiller.fit(x_train, 
                         y_train, 
                         epochs=20,
                         batch_size=batch_size,
                         validation_data=(x_test, y_test), 
                         verbose=1)            




# Evaluate student on test dataset
distiller.evaluate(x_test, y_test)



Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


[0.6686000227928162, 0.9546219706535339]

### effect of alpha:

alpha values taken were 0.1, 0.2. 0.5, 0.9

loss = self.alpha * student_loss + (1 - self.alpha) * distillation_loss

higher alpha suggests high student loss effect and low distilation effect.


i.e. low alpha should give better results, since it can use distillation better.

so far that is not so evident...

