In [None]:
#| include: false
from fasterai.distill.all import *
from fastai.vision.all import *

  from .autonotebook import tqdm as notebook_tqdm


We'll illustrate how to use Knowledge Distillation to distill the knowledge of a Resnet34 (the teacher), to a Resnet18 (the student)

Let's us grab some data

In [None]:
path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

def label_func(f): return f[0].isupper()

dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))

The first step is then to train the teacher model. We'll start from a pretrained model, ensuring to get good results on our dataset.

In [None]:
teacher = vision_learner(dls, resnet34, metrics=accuracy)
teacher.unfreeze()
teacher.fit_one_cycle(10, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.69422,0.399882,0.870771,00:02
1,0.464552,2.638947,0.7023,00:02
2,0.343979,0.304005,0.861976,00:02
3,0.235383,0.263767,0.88295,00:02
4,0.194038,0.303563,0.871448,00:02
5,0.198564,0.239137,0.907307,00:02
6,0.120475,0.220184,0.916779,00:02
7,0.082618,0.177775,0.935047,00:02
8,0.049762,0.184576,0.943166,00:02
9,0.029427,0.186081,0.937754,00:02


### Without KD

We'll now train a Resnet18 from scratch, and without any help from the teacher model, to get that as a baseline 

In [None]:
student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
#student = vision_learner(dls, resnet18, metrics=accuracy)
student.fit_one_cycle(10, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.607181,0.622096,0.68065,00:02
1,0.573065,0.574762,0.686062,00:02
2,0.546618,0.573394,0.667794,00:02
3,0.499513,0.515861,0.748309,00:02
4,0.460206,0.436721,0.794993,00:02
5,0.406419,0.485153,0.749662,00:02
6,0.348736,0.376182,0.82544,00:02
7,0.282992,0.416911,0.830176,00:02
8,0.202095,0.374159,0.844384,00:02
9,0.161077,0.379357,0.847091,00:02


### With KD

And now we train the same model, but with the help of the teacher. The chosen loss is a combination of the regular classification loss (Cross-Entropy) and a loss pushing the student to learn from the teacher's predictions.

In [None]:
student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
#student = vision_learner(dls, resnet18, metrics=accuracy)
kd = KnowledgeDistillationCallback(teacher.model, SoftTarget)
student.fit_one_cycle(10, 1e-3, cbs=kd)

epoch,train_loss,valid_loss,accuracy,time
0,2.600283,2.541826,0.595399,00:02
1,2.418635,2.119347,0.709743,00:02
2,2.303681,1.673524,0.754398,00:02
3,2.09004,1.836362,0.746279,00:02
4,1.862276,1.576037,0.778755,00:02
5,1.588549,1.26887,0.82544,00:02
6,1.317268,1.272325,0.824087,00:02
7,0.99067,0.98784,0.84912,00:02
8,0.77217,0.921217,0.858593,00:02
9,0.644602,0.916613,0.854533,00:02


When helped, the student model performs better ! 

There exist more complicated KD losses, such as the one coming from ``Paying Attention to Attention``, where the student tries to replicate the same attention maps of the teacher at intermediate layers.

Using such a loss requires to be able to specify from which layer we want to replicate those attention maps. To do so, we have to specify them from their `string` name, which can be obtained with the `get_model_layers` function.

For example, we set the loss to be applied after each Residual block of our models: 

In [None]:
student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
kd = KnowledgeDistillationCallback(teacher.model, Attention, ['layer1', 'layer2', 'layer3', 'layer4'], ['0.4', '0.5', '0.6', '0.7'], weight=0.9)
student.fit_one_cycle(10, 1e-3, cbs=kd)

epoch,train_loss,valid_loss,accuracy,time
0,0.090781,0.08485,0.688769,00:02
1,0.082261,0.072775,0.73207,00:02
2,0.070628,0.065847,0.763193,00:02
3,0.060705,0.055545,0.81935,00:02
4,0.052722,0.061782,0.806495,00:03
5,0.045175,0.047013,0.835589,00:02
6,0.038804,0.050148,0.844384,00:02
7,0.030617,0.043418,0.866712,00:02
8,0.022949,0.044476,0.872124,00:02
9,0.019451,0.044187,0.874154,00:02
