<a href="https://colab.research.google.com/github/BedinEduardo/Colab_Repositories/blob/master/Know_Ledge_Distillation_Tutorial_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Knowledge Distilation Tutorial

Knowledge Distilation is a technique that enebles knowledge transfer from large, computationally expensive models to smaller ones without losing validity.
This allows for deployment on less powwerfull hardware, making evaluation faster and more efficient.

In this tutorial, we will run a number of experiments focused at improving the accuracy of a lightweight NN, using a more powerful network as a teacher.
The computational costa and the speed of the lightweight network will remain unaffected, our intervention only focuses on its weights, not on its forward pass.

Will learn:
* How to modify model classes to extract hidden representations and use them for futher calculations.
* How to modify regular train loops in PyTorch to include additional losses on top, for example, cross-entropy for classification
* How to improve the performance of lightweith models by using more complex models as teacher.

## Prerequesites

* 1 GPU, 4GB of memory
* PyTorch V2.0 or smaller
* CIFAR-10 datasets

In [None]:
# Importing the datasets
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torch.optim as optim

In [None]:
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"

In [None]:
print(device)

cuda


## Loading CIFAR10

Cifar is a popular image dataset with ten classes.
The tutorial objective is to predict one of the following classes for each input image.

The input image are RGB, so they have 3 channels and are 32 x 32 pixels.
3 x 32 x 32 = 3072 pixesl --> number ranging from 0 to 255.
Commom practice in NN is to normalize the input data
--> Avoid saturation in commonly activation functions --> increase numerical stability.
The current normalization proccess consist of subtracting the mean and dividing by the standard deviation along each channel.
The tensor “mean=[0.485, 0.456, 0.406]” and “std=[0.229, 0.224, 0.225]" were already computaed, they represent the mean and standard deviation of each channel --> in the predefined subset of CIFAR-10 --> training set.
--> The values for test set as well --> The NN is trained on features produced by subtracting and dividing the numbers above --> maintain the consistency.
In real world, wew are not able to compute the mean and the standard deviation.

In [None]:
# Preprocessing data for CIFAR-10 - arbitary batch size of 128
transforms_cifar = transforms.Compose([
    transforms.ToTensor(),  # transfrom to tensor
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

In [None]:
# Loading the CIFAR-10 dataset:
training_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms_cifar)
testing_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms_cifar)


100%|██████████| 170M/170M [00:03<00:00, 48.6MB/s]


In [None]:
# Now defining the DataLoaders
train_loader = torch.utils.data.DataLoader(training_dataset, batch_size=64, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(training_dataset, batch_size=64, shuffle=False, num_workers=2)

## Defining model classes and utility functions

Next, we need to define our model classes.
Several user-defined parameters need to be set here.
Use two different architectures, keeping the number of the filters fixed accros our experiments to ensure fair comparisions.
Both are CNN --> With different number of Conv layers --> serve as feature extractors --> Followed by a classifier with 10 classes.

The number of filters and neurons is smaller for the students

In this point get how to do this in these examples, and after this two steps:

1. Replicate the codes in a VS in run it in the notebook and after in workstations.
2. Study an understand how to adapt others models, larger models and smaller models and try do the same steps of the code
3. Replicate it in VS code an run in the notebook and workstation.
4. Study other tutorials and run the distilation for detection
5. And replicate the code in VS code.

In [None]:
# Deeper NN class to be used as a teacher:
class DeepNN(nn.Module):
  def __init__(self, num_classes=10):
    super(DeepNN, self).__init__()
    self.features = nn.Sequential(
        nn.Conv2d(3, 128, kernel_size=3, padding=1),  # nn.Conv2d(3, ..., ...) the three is the number of channels, kernel size --> how much pixels per filter, padding --> how much moves in each iterations
        nn.ReLU(),
        nn.Conv2d(128,64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, padding=1),
        nn.Conv2d(64, 64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.Conv2d(64,32, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
    )  # this block the Conv layers extract features
    self.classifier = nn.Sequential(
        nn.Linear(2048,512),
        nn.ReLU(),
        nn.Dropout(0.1),
        nn.Linear(512, num_classes)
    )

  def forward(self, x):
    x = self.features(x)
    x = torch.flatten(x,1)  # flattenize the tensor
    x = self.classifier(x)
    return x


In [None]:
teacher = DeepNN(num_classes=10).to(device)
print(teacher)

DeepNN(
  (features): Sequential(
    (0): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU()
    (4): MaxPool2d(kernel_size=2, stride=2, padding=1, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU()
    (7): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU()
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Linear(in_features=2048, out_features=512, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1, inplace=False)
    (3): Linear(in_features=512, out_features=10, bias=True)
  )
)


In [None]:
# Light weight model -- Student model
class LightNN(nn.Module):
  def __init__(self, num_classes=10): # must have the same output shape -- number of classes
    super(LightNN, self).__init__()
    self.features = nn.Sequential(
        nn.Conv2d(3,16, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
        nn.Conv2d(16,16, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
    )
    self.classifier = nn.Sequential(
        nn.Linear(1024, 256),
        nn.ReLU(),
        nn.Dropout(0.1),
        nn.Linear(256, num_classes)
    )

  def forward(self, x):
    x = self.features(x)
    x = torch.flatten(x,1)
    x = self.classifier(x)

    return x

In [None]:
student = LightNN(num_classes=10).to(device)
print(student)

LightNN(
  (features): Sequential(
    (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Linear(in_features=1024, out_features=256, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1, inplace=False)
    (3): Linear(in_features=256, out_features=10, bias=True)
  )
)


Waas employed 2 functions to help produces and evaluate the resulsts --> classification tasks
* **model**: A model instance to train - update weights -- via this function
* **train_loader**: we define **train_loader**, and its job is to feed the data into the model
* **epochs**: how many times we loop over the dataset
* **Learning_rate**: the learning rate determines how large our steps towards convergence should be.
* **device**: determines the device to run the workload --> CPU, GPU, or other hardware accelerator

The test function is similar --> but it will invoked **test_loader** to load imates from the test set

### Steps to train, and transfer learning from teacher to student
1. Train the teacher and the student of the same training dataset, and evaluate in the same test set.
2. After the teacher to be finetunned, start the KD process
3. The teacher goes to evaluation model, and the student stay in the train mode
4. Forward pass a data (image) into the student (in train mode) and the same data into the teacher (in evaluation mode)
5. Get the output probabilities (logits) of the student and compare with the teacher
6. The KD is doing adjusting the weights and biases of the student by a comparision between:
  * The loss of teacher and student, or;
  * The features of the teachers and student, or;
  * The logits of the teachers and student,

In [None]:
# Before to continue we should dvelop a train and validation functions - to normal train and validation
def train(model, train_loader, epochs, learning_rate, device):  # here can also be inserted the LR_decay
  loss_fn = nn.CrossEntropyLoss()  # loss for classification - it can be seted from  hyperparameters
  optimizer = optim.Adam(model.parameters(), lr=learning_rate)

  model.train() # setting the model to train mode

  for epoch in range(epochs):  # after peerform this example and the detection exemples -- adapt the code to train  for KD Detection
    running_loss = 0.0
    for inputs, labels in train_loader:
      # inputs: A collection of batch_size mages
      # labels:: a vector of dimensionality batch_size with integers denoting class of each image
      inputs, labels = inputs.to(device), labels.to(device)  # sending to device

      optimizer.zero_grad()  # zeroing the gradients
      outputs = model(inputs)
      # outputs: output of the NN for the collection images. A tensor with dimensionality batch_size x num_classes
      # labels: The actual labels of the images. Vector of dimensionality batch_size
      loss = loss_fn(outputs, labels)  # compare the predicted with the ground truth
      loss.backward()  # adjust the network based on the loss calculated before
      optimizer.step()  # adjust the gradients based on the loss

      running_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

In [None]:
def test(model, test_loader, device):  # adjust the code to perform cross validation
  model.to(device)
  model.eval()

  correct = 0
  total = 0

  with torch.no_grad():  # for validation step did not use the gradients graph activate
    for inputs, labels in test_loader:
      inputs, labels = inputs.to(device), labels.to(device)

      outputs = model(inputs)
      _, predicted = torch.max(outputs.data, 1)   # get the logits - the maximum values

      total += labels.size(0)
      correct += (predicted == labels).sum().item() # sum and itemize if the predicted value is equal the current label

  accuracy = 100 * correct / total
  print(f"Test acc: {accuracy:.3f}")
  return accuracy

## Cross-Entropy runs

Start by training the teacher network using cross-entropy:


In [None]:
# now train the teacher, as the 1st step
train(model=teacher,
      train_loader=train_loader,
      epochs=11,
      learning_rate=0.001,
      device=device)
test_accuracy_teacher = test(model=teacher,
                             test_loader=test_loader,
                             device=device)

Epoch 1/11, Loss: 1.3355022575849158
Epoch 2/11, Loss: 0.8750486335409876
Epoch 3/11, Loss: 0.6935216221586823
Epoch 4/11, Loss: 0.5611832777938575
Epoch 5/11, Loss: 0.45006790917242884
Epoch 6/11, Loss: 0.3500845811098738
Epoch 7/11, Loss: 0.26924474640270635
Epoch 8/11, Loss: 0.21906665719740684
Epoch 9/11, Loss: 0.18478701465174824
Epoch 10/11, Loss: 0.16069171736802895
Epoch 11/11, Loss: 0.14771360345899373
Test acc: 97.508


In [None]:
# Instantiate the one more Student Network to compare their performance -- Back propagation is sensitive to weight initialization, so we need to make sure that
# the two NN has the same initialization
new_student = LightNN(num_classes=10).to(device)

To ensure we have builded a copy of the 1st network, we inspect the normlist of its first layers.
If it matches, the we are safe to conclude that the NN are indeed the same

In [None]:
# print the norm of the 1st layer of the initial lightweight model
print("Norm of 1st layer of student: ", torch.norm(student.features[0].weight).item())

# print the norm of the 2nd layer of the initial lightweight model
print("Norm of 2nd layer of student: ", torch.norm(new_student.features[0].weight).item())


Norm of 1st layer of student:  2.3673043251037598
Norm of 2nd layer of student:  2.2718844413757324


In [None]:
# print the total number of parameters in each model
total_params_teacher = "{:,}".format(sum(p.numel() for p in teacher.parameters()))
print(f"Teacher parameters: {total_params_teacher}")
total_params_student = "{:,}".format(sum(p.numel() for p in student.parameters()))
print(f"Sutudent parameters: {total_params_student}")

Teacher parameters: 1,186,986
Sutudent parameters: 267,738


In [None]:
# now train the student model
train(model=student,
      train_loader=train_loader,
      epochs=11,
      learning_rate=0.001,
      device=device)
test_accuracy_student = test(model=student,
                             test_loader=test_loader,
                             device=device)

Epoch 1/11, Loss: 1.4168188462934226
Epoch 2/11, Loss: 1.0798806783640782
Epoch 3/11, Loss: 0.9477588265295833
Epoch 4/11, Loss: 0.8508702839731865
Epoch 5/11, Loss: 0.7718450138940836
Epoch 6/11, Loss: 0.6993597454350927
Epoch 7/11, Loss: 0.633639339557694
Epoch 8/11, Loss: 0.5766048659487149
Epoch 9/11, Loss: 0.5212877927076481
Epoch 10/11, Loss: 0.4770622678729884
Epoch 11/11, Loss: 0.4244784154093174
Test acc: 90.144


In [None]:
print(f"Test Accuracy Teacher: {test_accuracy_teacher:.3f}%")
print(f"Test Accuracy Student: {test_accuracy_student:.3f}%")

Test Accuracy Teacher: 97.508%
Test Accuracy Student: 90.144%


## Knowledge distilation run

Now let's try to improve the test accuracy of the student network by incorporating the teacher.
Knowledge distilation is a straightforward techinique to achieve this, based on the fact that both networks ouput a probability distribuition over our classes.
Therefore, the two networks share the same number fo output neurons (THIS IS AN IMPORTANT POINT - CHEK THIS TO OD).
The methods works by incorporating an additional loss into the traditional cross entropy loss, which is based on the softmax output of the teacher network.
The assumption is that the output activations of a properly trained teacher network carry additional information that can be leveraged by a student network during training.
The original work suggest that utilizing ratios of smaller probabilities in the soft targets cal help achieve the underlying objective of DNN, which is to build a similarity structure over the data where similar objects are mapped closer together.
For example, in CIFAR-10, a truck could be mistaken for an automobile or airplane, if its wheels are present, but is less mistaken for a dog. Therefore, it makes sense to assume that valuable information resides not only in the top prediction of a properly trained model but in the entire output distribuition.
However, cross entropy anole does not significantly exploit this information as the activations for non-predicted classes tend be so small that propagated gradients do not meaningfully change the weights to construct this desirable vector space.

As we continue defining our first helper function that introduces a teacher-stuent dynamic, we need to include a few extra parameters.

* **T**: Temperature controls the smoothness of the output distributions. Larger **T** leads to smoother distributions, thus smaller probabilities get a larger boost.
* **soft_target_loss_weight**: A weight assigned to the extra objective we are about to include
* **ce_loss_weight**: A weight assigned to croos-entropy. Tunning these weights pushes the network towards optimizing for either objecitve.

1. **The distilation loss is calculated from the logits of the NN. It only returns gradients to the student.**
2. Fromt the output layer - get the gradient of all classes - and distilate the loss


In [None]:
# definng a training loop for distilation from the last layer
def train_knowledge_distillation(teacher, student, train_loader, epochs, learning_rate, T, soft_target_loss_weight, ce_loss_weight, device):
  ce_loss = nn.CrossEntropyLoss()  # for classification --> this can be a setted parameter in further versions
  optimizer = optim.Adam(student.parameters(), lr=learning_rate)  # seted hyperparameters

  teacher.eval()  # to perform this step the teacher should already be trained
  student.train()  # student was trained once - to compare results

  for epoch in range(epochs): # epochs --> hypeparameters --> in further versions --> set the early stoping
    running_loss = 0.0
    for inputs, labels in train_loader:  # get the train examples
      inputs, labels = inputs.to(device), labels.to(device)

      optimizer.zero_grad()  # zeroing the gradients

      # forward pass with teacher model
      with torch.no_grad():
        teacher_logits = teacher(inputs)  # get the predictions --> without acvtivatons --> raw values

      # Now get for the same input, the student logits
      student_logits = student(inputs)

      # Now, soften the logits values --> in these example applying softmax activation function --> can be tested using several
      soft_targets = nn.functional.softmax(teacher_logits / T, dim=-1)  # here the teacher - must be the best
      soft_prob = nn.functional.log_softmax(student_logits / T, dim=-1)  # for the student

      # print(f"Type soft_targets: {type(soft_targets), soft_targets} \n")
      # print(f"Type soft_probs: {type(soft_prob), soft_prob} \n")

      # print(f"Soft targets shape: {soft_targets.shape}")
      # print(f"Soft probs shape: {soft_prob.shape}")

      # now calculate the soft targets loss. Scaled by T**2 as sugested by authors of the paper
      soft_targets_loss = torch.sum(soft_targets * (soft_targets.log() - soft_prob)) / soft_prob.size()[0] * (T**2)  # study this formula and possible others

      # now calculate the true label loss
      label_loss = ce_loss(student_logits, labels)  # cross entropy loss

      # print(f"Type soft_targets_loss: {type(soft_targets_loss), soft_targets_loss} \n")
      # print(f"Type label_loss: {type(label_loss), label_loss} \n")

      # print(f"soft_targets_loss shape: {soft_targets_loss.shape}")
      # print(f"label_loss shape: {label_loss.shape}")
      # input()

      # weighted sum of the two losses
      loss = soft_target_loss_weight * soft_targets_loss + ce_loss_weight * label_loss  # the entire loss is this expression

      loss.backward()
      optimizer.step()

      running_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

## Continuer from the end of the section
* Apply ``train_knowledge_distillation`` with a temperature of 2. Arbitrarily set the weights to 0.75 for CE and 0.25 for distillation loss.

In [None]:
train_knowledge_distillation(teacher=teacher,
                             student=new_student,
                             train_loader=train_loader,
                             epochs=11,
                             learning_rate=0.001,
                             T=2,
                             soft_target_loss_weight=0.25,
                             ce_loss_weight=0.75,
                             device=device
                             )

Epoch 1/11, Loss: 1.7091755077357182
Epoch 2/11, Loss: 1.4989796847181247
Epoch 3/11, Loss: 1.3373480530651025
Epoch 4/11, Loss: 1.217679157891237
Epoch 5/11, Loss: 1.1024534967549318
Epoch 6/11, Loss: 1.0127524821197285
Epoch 7/11, Loss: 0.9249742182395647
Epoch 8/11, Loss: 0.85430480467389
Epoch 9/11, Loss: 0.7873108614512416
Epoch 10/11, Loss: 0.7221614835817186
Epoch 11/11, Loss: 0.6685162819636142


In [None]:
test_accuracy_light_ce_and_kd = test(new_student, test_loader, device)

In [None]:
# Now compare the student test accuracy with and without the teacher, after disitilation
print(f"Teacher accuracy: {test_accuracy_teacher:.3f} %")
print(f"Student accuracy without teacher: {test_accuracy_student:.3f}%")
print(f"Student accuracy with CE + KD: {test_accuracy_light_ce_and_kd:.3f}")

# Cosine Loss Minimization Run

Feel free to play around with the temperature that controls the softness of the softmax function and the loss coeficients.
In NN, it is easy to include additional loss functions to the main objectives to achieve goals like better generalization.
Let's try to including an objectve for the student, but now focus on their hidden states rather than their output layers.
The goal is to convey information from the teacher's representation to the student by including a **naive loss function**, whose minimization implies that the flattened vectors that are subsequently passed to the classifiers have become more *similar* as the loss decrease.
Of course, the teacher does not update its weights, so the minimization depends only on the student weights.
The rationale behind this method is that we are operating under the assumption that the teacher model has a better internal representation that is unlikely to be achieved by the student without external intervention, therefore we artificially push the student to mimic the internal representation of the teacher.
Whether or not this will end up helping the students is not straightfoward, though, because pushing the lightweight network to reach this point could be a good thing, assuming that we have found an internal representation that leads to better test accuracy, but it could also be harmful because the networks have different architectures and the student does not have the same learning capacity as the teacher.
In other words, there is no reason for these two vectors, the student's and the teacher's to match per component.
The student could rach an internal representation that is a permutation of the teacher's and it would be just efficient. Nonetheless, we can still run a quick experiment to figure out the inpact of this method.
We will be using the **CousineEmbeddingLoss** which is given by the following formula:


    loss(x,y) = { 1- cos(x1,x2),    if y=1, or
                {max(0,cos(x1,x2) - margin), if y= -1


Obviously, there is one thing that we need to resolve first.
When we applied distillation to the output layer we mentioned that both networks have the same number of neurons, equal to the number of classes.
Howevder, this is not the case for layer following our Conv layers.
**Here, the teacher has more neurons that the student after the flattening of the final conv layer**.
*Our loss function accpets two vectors of equal dimensionallity as inputs, therefore we need to somehow match them*.
We will solve this by *including* an **average pooling layer** *after the teacher Conv layers* to reduce its dimensionality to match that of the student.

To proceed, we will modify our model classes, or build a new ones. Now the **forward function** returns not only the logits of the network but also the **flattened hidden** representation after the Conv layer. We include the aforementioned *pooling* for the modified teacher.

In [None]:
# buidl a modified network addding average pooling layer after teacher conv layer --> to format the output tensor of the teacher in the same format of the student
# to build two vectors with the same dimensionality --> Loss function only accepts vectors with same dimensionality

class ModifiedDeepCossine(nn.Module):
  def __init__(self, num_classes=10):  # the num classes can be setted as a hyperparameter --> se Daniels code to extract the number of classes
    super(ModifiedDeepCossine, self).__init__()  # initializating the class --> get all parents dependencies
    self.features = nn.Sequential(
        nn.Conv2d(3, 128, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.Conv2d(128, 64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),  # 2D max pooling operation --> over input signal --> Several input planes -->
                                                # Selects the maximum values with each pooling region (kernel) --> Passes it to the next layer -> Reduces the size of the tensor
        nn.Conv2d(64,64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.Conv2d(64,32, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2) # stride=2, pass two by two pixels each step

    )
    self.classifier = nn.Sequential(
        nn.Linear(2048,512),
        nn.ReLU(),
        nn.Dropout(0.1),
        nn.Linear(512, num_classes)
    )

  def forward(self,x):
    x = self.features(x)
    flattened_conv_output = torch.flatten(x,1)  # get 2D and turn it in 1D
    x = self.classifier(flattened_conv_output)
    flattened_conv_output_after_pooling = torch.nn.functional.avg_pool1d(flattened_conv_output,2)

    return x, flattened_conv_output_after_pooling  # x is the predicted and flattened_conv_output_after_pooling is the vector with dimensionality adjusted to be used in the loss function
                                                   # This value is taken after the conv steps not after the classification step --> here is extracted the features, for adjust prediction

In [None]:
# Build a modified student class where return a tuple. We do not apply pooling after flattening
class ModifiedLightDeepCossine(nn.Module):
  def __init__(self,num_classes=10):
    super(ModifiedLightDeepCossine, self).__init__()
    self.features = nn.Sequential(
        nn.Conv2d(3, 16, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
        nn.Conv2d(16,16, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
    )
    self.classifier = nn.Sequential(
        nn.Linear(1024, 256),
        nn.ReLU(),
        nn.Dropout(0.1),
        nn.Linear(256, num_classes)  # num_classes --> Hyperparameter
    )

  def forward(self, x):
    x = self.features(x)
    flattened_conv_output = torch.flatten(x,1)  # flatten the tensor
    x = self.classifier(flattened_conv_output)

    return x, flattened_conv_output  # flattened_conv_output will be used to adjust loss function

In [None]:
# We do not have to train the modified DNN from scratch of course, it is needed just load its weights
modified_teacher = ModifiedDeepCossine(num_classes=10).to(device)
modified_teacher.load_state_dict(teacher.state_dict())

In [None]:
# Once again to ensure the norm of the 1st layer is the same for both networks
print("Norm of 1st layer for deep_nn:", torch.norm(teacher.features[0].weight).item())
print("Norm of 1st layer for deep_nn", torch.norm(modified_teacher.features[0].weight).item())

In [None]:
# Initialize a modified lightweight network with the same seed as the other lightweight instances.
# This will be trained from scratch to examine the effectiveness of cousine loss minimization
modified_student = ModifiedLightDeepCossine(num_classes=10).to(device)
print("Norm of 1st layer of modified student: ", torch.norm(modified_student.features[0].weight).item())

Now, it is needed to modify the train loop because the model returns a tuple (logits, hidden_representation).

Using a sample input tensor we can print their shapes

In [None]:
# Build a sample input tensor
sample_input = torch.rand(128,3,32,32).to(device)  # batch size: 128, filters: 3, image size: 3x3

In [None]:
# pass the input through the student
logits, hidden_representation = modified_student(sample_input)

In [None]:
# Print the shape of the tensors
print("Stutent logits shape: ", logits.shape)  # batch_size x total_classes
print("Stutent hidden representation shape: ", hidden_representation.shape)

In [None]:
# Pass the input through the teacher
logits, hidden_representation = modified_teacher(sample_input)

In [None]:
# print the teacher shapes
print("Teacher Logits shape: ", logits.shape)
print("Teacher hidden representation shape: ", hidden_representation.shape)

In our case, **hidden_representation_size** is **1024**. This is flattened feature map of the final Conv layer of the student and as you can see, it is the input for its classifier.
It is **1024** for the teacher too, because we made it so with **avg_pool1d** from **2048**. The loss applied here only affects the weights of the student priot to the loss calculation.
In other words, it does not affect the classifier of the student.
The modified training loop is the following.

In Cosine Loss minimization, we want to maximize the cosine similarity of the two representation by returning gradients to the student:

In [None]:
def train_cosine_loss(teacher, student, train_loader, epochs, learning_rate, hidden_rep_loss_weight, ce_loss_weight, device):
  ce_loss = nn.CrossEntropyLoss()  # for classification
  cosine_loss = nn.CosineEmbeddingLoss()  # for features after conv layers
  optimizer = optim.Adam(student.parameters(), lr= learning_rate)  # can include learning_rate decay

  teacher.to(device)
  student.to(device)
  teacher.eval()  # after had trained the teacher
  student.train()

  # now the loop to iterate the data
  for epoch in range(epochs):
    running_loss = 0.0
    for inputs, labels in train_loader:
      inputs, labels = inputs.to(device), labels.to(device)

      # zeroing the optmizer graph before to train
      optimizer.zero_grad()

      #forward pass with the teacher model and keep only the hidden representation
      with torch.no_grad():
        _, teacher_hidden_representation = student(inputs)  # get the features --> hidden repressentation

      # Forward pass with the student model
      student_logits, student_hidden_representation = student(inputs)  # get the ouptut values after classification, and the hidden representation to distile knowledge

      # Calculate the cosien loss. Target is a vector of ones.
      # from the loss formula above we can see that is the case where loss minimization leads to cosine similarity increase.
      hidden_rep_loss = cosine_loss(student_hidden_representation, teacher_hidden_representation, target=torch.ones(inputs.size(0)).to(device))

      # Calculate the true label loss
      label_loss = ce_loss(student_logits, labels)

      # weighted sum of the two losses
      loss = hidden_rep_loss_weight * hidden_rep_loss + ce_loss_weight * label_loss

      loss.backward()
      optimizer.step()

      running_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

Now it is needed to modify the test function for the same reason.
Here we ignore the hidden representation returned by the model.

In [None]:
# a function to test the models with Cosine Loss Function
def test_multiple_outputs(model, test_loader, device):
  model.to(device)
  model.eval()

  correct = 0  # sum corrects
  total = 0

  with torch.no_grad():  # put the model in evaluation mode
    for inputs, labels in test_loader:
      inputs, labels = inputs.to(device), labels.to(device)

      outputs, _ = model(inputs)  # disregard the second tensor of the tuple - get the logits
      _ , predicted = torch.max(outputs.data,1) # get the maximum values --> predict the class

      total += labels.size(0) # get the total of predictios
      correct += (predicted == labels).sum().item()   # calculate the total of corrects

  accuracy = 100 * correct / total
  print(f"Test Accuracy: {accuracy:.3f}")

  return accuracy

In this case, we could easily include both KD and cossine loss minimization in the same function.
It is common to combine methods to achieve better performance in teacher-student paradigms.
For now, we can run a simple train-test session.

In [None]:
# train and test the lightweight network with cross entropy loss
train_cosine_loss(teacher=modified_teacher,
                  student=modified_student,
                  train_loader=train_loader,
                  epochs=11,
                  learning_rate=0.001,
                  hidden_rep_loss_weight=0.25,
                  ce_loss_weight=0.75,
                  device=device)

In [None]:
test_accuracy_light_ce_and_cosine_loss = test_multiple_outputs(modified_teacher, test_loader, device)

# Intermediate regressor run

Our naive minimization does not guarantee better results for several reasons, one being the dimensionality of the vectors.
Cosine similarity generally works better than Euclidean distance for vectors of higher dimensionality, but we were dealing with vectors with 1024 components each, so it is much harder to extract meaningful similirarities.
Furthemore, as we mentioned, pushing towards a match of the hidden representation of the teacher and the student is not supported by theory.
There are no good reasons why we should be amiming for a 1:1 match of these vectors. We will provide a final example of training intervention by including an extra network called regressor.
The objective is to first extract the feature map of the teacher after conv layer, then extract a feature map of the student after conv layer, and finally try to match these maps.
However, this time, we will introduce a regressor between the networks to facilitate the matching process.
The regressor will be trainable and ideally do a better job than out naive cosine loss minimization scheme.
Its main job is to match the dimensionality of these feature maps so that we can properly define a loss function between the teacher and the student.
Defining such a loss function provides a teaching "path", which is basically a flow to back-propagate gradients that will change the student's weights. Focusing on the output of the Conv layers right before each classifier for our original networks, we have the following shapes:

1. Naive minimization does not guarantee better results.
2. Cosine similarity generall works better that Euclidian distance, for vectors of higher dimensionality - 1024 components --> harder to extract meaningfull similirarities
3. match a hidden representation --> teacher --> student are not supported by theory
4. Objective --> Extract feature maps --> Introduce a regressor between networks to facilitate the matching process.
5. The regressor will be trainable --> Ideally do a better job than our naive cosine loss minimization
6.  Main job --> match the dimensionality --> can properly define a loss function --> Between teacher and student
7. Defining a loss function --> provides a "path" --> which is basically back-propagate gradients that will change the student weights.
8. Focusing on ouput of Conv layers tight before classifier for original networtks



In [None]:
# Pass the sample input only from the Conv feature extractor
convolutional_fe_output_student = student.features(sample_input)
convolutional_fe_output_teacher = teacher.features(sample_input)

In [None]:
print(f"Student feature extractor output shape: {convolutional_fe_output_student.shape}")
print(f"Teacher feature extractor output shape: {convolutional_fe_output_teacher.shape}")

We have 32 filters for the teacher and 16 filters for the student.
We will include a trainable layer that converts the feature map of the student to the shape of the feature of the teacher.
In practice, we modify the lightweight class to return the hidden state after an intermediate regressor that matches the sizes of the Conv feature maps and the teacher class to return the output of the final Conv layer without pooling or flattening.

calculate the loss function in the MaxPool2d function

The trainable layer matches the shapes of intermediate tensors and MSE is properly defined:


In [None]:
class ModifiedDeepNNRegressor(nn.Module):
  def __init__(self, num_classes=10):
    super(ModifiedDeepNNRegressor, self).__init__()
    self.features = nn.Sequential(
        nn.Conv2d(3, 128, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.Conv2d(128,64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
        nn.Conv2d(64,64, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.Conv2d(64,32, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
        )
    self.classifier = nn.Sequential(  # in other examples can be used another backbone
        nn.Linear(2048, 512),
        nn.ReLU(),
        nn.Dropout(0.1),
        nn.Linear(512, num_classes)
    )

  def forward(self, x):
    x = self.features(x)
    conv_features_maps = x  # get the output of the convolutional step
    x = torch.flatten(x,1)  # flaten the feature maps before to input it in the classsifier
    x = self.classifier(x)  # made the prediction

    return x, conv_features_maps

In [None]:
class ModifiedLightNNRegressor(nn.Module):
  def __init__(self, num_classes=10):
    super(ModifiedLightNNRegressor, self).__init__()
    self.features = nn.Sequential(
        nn.Conv2d(3, 16, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2),
        nn.Conv2d(16,16, kernel_size=3, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2)
    )
    # Include an extra regressor - in this case linear
    self.regressor = nn.Sequential(
        nn.Conv2d(16,32, kernel_size=3, padding=1)
    )
    self.classifier = nn.Sequential(
        nn.Linear(1024, 256),
        nn.ReLU(),
        nn.Dropout(0.1),  # Also a Hyperparameters
        nn.Linear(256, num_classes)
    )

  def forward(self, x):
    x = self.features(x)
    regressor_output = self.regressor(x)  # Trainable layer convert the feature maps of the student to the shape of the teacher
                                          # Modify the lightweight class to return the hidden state after intermediate regressor that matches the size of Conv feature maps
    x = torch.flatten(x,1)
    x = self.classifier(x)

    return x, regressor_output

After that, we have to update our train loop again. This time, we extract the regressor output of the stundent, the feature map of the teacher, we calculate the **MSE** on these tensors - they have the exact same shape so it is properly defined - and back propagate gradients based on that loss, in addition to the regular cross entropy loss of the classification task.

In [None]:
def train_mse_loss(teacher, student, train_loader, epochs, learning_rate, feature_map_weight, ce_loss_weight, device):
  ce_loss = nn.CrossEntropyLoss()  # Hyperparameters
  mse_loss = nn.MSELoss()  # Hyperparameters
  optimizer = optim.Adam(student.parameters(), lr=learning_rate)

  teacher.to(device)
  student.to(device)
  teacher.eval()  # evaluation mode
  student.train()  # student to train mode

  for epoch in range(epochs):
    running_loss = 0.0
    for inputs, labels in train_loader:
      inputs, labels = inputs.to(device), labels.to(device)

      optimizer.zero_grad()

      # again ignore the teacher logits
      with torch.no_grad():
        _, teacher_feature_map = teacher(inputs)  # get the feature maps from the teacher - not the logits

      # Forward pass with the student model.
      student_logits, regressor_feature_map = student(inputs)  # get the logits and feature maps from student

      # Calculate the loss
      hidden_rep_loss = mse_loss(regressor_feature_map, teacher_feature_map)

      # Calulate the ture label loss
      label_loss = ce_loss(student_logits, labels)

      # weighted sum of the two losses
      loss = feature_map_weight * hidden_rep_loss + ce_loss_weight * label_loss # study why this configuration of the sum and multiplication of the losses

      loss.backward()
      optimizer.step()

      running_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")


In [None]:
# Initialize the ModifiedLightNNRegressor
modified_student_reg = ModifiedLightNNRegressor(num_classes=10).to(device)

In [None]:
# Do not have to train the modified DNN from scratch, load the state_dict --> weight and biases
modified_teacher_reg = ModifiedDeepNNRegressor(num_classes=10).to(device)
modified_teacher_reg.load_state_dict(teacher.state_dict())

In [None]:
# Train and test once again
train_mse_loss(teacher=modified_teacher_reg,
               student=modified_student_reg,
               train_loader=train_loader,
               epochs=10,
               learning_rate=0.001,
               feature_map_weight=0.25,
               ce_loss_weight=0.75,
               device=device)

In [None]:
test_accuracy_light_ce_and_mse_loss = test_multiple_outputs(modified_student_reg, test_loader, device)

It is expected that the final method will work better than **CosineLoss** Because now we have allowed a trainable layer between the teacher and the student.
Which gives the student some wiggle room when it comes to learning, rather than pushing the student to copy the teacher's representation.
Including the extra network is the idea behind hint-based distilation.

In [None]:
print(f"Teacher accuracy: {test_accuracy_teacher:.2f}%")
print(f"Student accuracy without teacher: {test_accuracy_student:.2f}%")
print(f"Student accuracy with CE + KD: {test_accuracy_light_ce_and_kd:.2f}%")
print(f"Student accuracy with CE + CosineLoss: {test_accuracy_light_ce_and_cosine_loss:.2f}%")
print(f"Student accuracy with CE + RegressorMSE: {test_accuracy_light_ce_and_mse_loss:.2f}%")

## Conclusion

None of these methods above increases the number of parameters of the NN or inference time.
So the performance increase comes at a little cost of calculating gradients during training.
In ML app, we most care about inference time because training happens before the model deployment.
If your light weight model is still to heavy for deployment, we can apply different ideas
--> Such post-training quantization.
Additional losses can be applied in many tasks, no just classification --> can experiment with quantities like coeficients, temperatures, or number of neurons
Tune any number of the tutorial