<img style="max-width:20em; height:auto;" src="../graphics/A-Little-Book-on-Adversarial-AI-Cover.png"/>

Author: Nik Alleyne   
Author Blog: https://www.securitynik.com   
Author GitHub: github.com/securitynik   

Author Other Books: [   

            "https://www.amazon.ca/Learning-Practicing-Leveraging-Practical-Detection/dp/1731254458/",   
            
            "https://www.amazon.ca/Learning-Practicing-Mastering-Network-Forensics/dp/1775383024/"   
        ]   


This notebook ***(understanding_transfer_learning.ipynb)*** is part of the series of notebooks From ***A Little Book on Adversarial AI***  A free ebook released by Nik Alleyne

### Understanding Transfer Learning  

### Lab Objectives:   
- Understanding Transfer Learning  
- Creating a classifier from transfer learning 
- Understand what freezing the model's parameters mean 

  
### Steps 1:  
Get the model 

In [1]:
# import libraries   
from torchvision.datasets import MNIST
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
import torchinfo

In [2]:
### Version of key libraries used  
print(f'Torch version used:  {torch.__version__}')
print(f'Torchinfo version used:  {torchinfo.__version__}')


Torch version used:  2.7.1+cu128
Torchinfo version used:  1.8.0


In [3]:
# Setup the device to work with
# This should ensure if there are accelerators in place, such as Apple backend or CUDA, 
# we should be able to take advantage of it.

if torch.cuda.is_available():
    print('Setting the device to cuda')
    device = 'cuda'
elif torch.backends.mps.is_available():
    print('Setting the device to Apple mps')
    device = 'mps'
else:
    print('Setting the device to CPU')
    device = torch.device('cpu')

Setting the device to cuda


In [4]:
# Load up a pretrained model
# In this case, we will use the resnet18
# https://pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html 
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT).to(device)

# Looking at the model summary
torchinfo.summary(model=model)

Layer (type:depth-idx)                   Param #
ResNet                                   --
├─Conv2d: 1-1                            9,408
├─BatchNorm2d: 1-2                       128
├─ReLU: 1-3                              --
├─MaxPool2d: 1-4                         --
├─Sequential: 1-5                        --
│    └─BasicBlock: 2-1                   --
│    │    └─Conv2d: 3-1                  36,864
│    │    └─BatchNorm2d: 3-2             128
│    │    └─ReLU: 3-3                    --
│    │    └─Conv2d: 3-4                  36,864
│    │    └─BatchNorm2d: 3-5             128
│    └─BasicBlock: 2-2                   --
│    │    └─Conv2d: 3-6                  36,864
│    │    └─BatchNorm2d: 3-7             128
│    │    └─ReLU: 3-8                    --
│    │    └─Conv2d: 3-9                  36,864
│    │    └─BatchNorm2d: 3-10            128
├─Sequential: 1-6                        --
│    └─BasicBlock: 2-3                   --
│    │    └─Conv2d: 3-11                 73,728

In [5]:
# Take a closer look at the layer
model

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

The model needs to be transferred to **eval** mode.  This is important as can be seen above, the model has **BatchNorm2d** layers. BatchNorm2d operates different at training time versus testing time.  

In [6]:
# Putting the model in eval mode
model.eval()

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

### Step 2:  
Review the model parameters for layer *model.conv1.weight* .  Also verify that it is trainable, via its **requires_grad** parameter. If *True* it means its parameters can be updated.  

In [7]:
# Get the layer information
print(f'model.conv1.weight state BEFORE training. Trainable?: {model.conv1.weight.requires_grad}')

# Now that we have the model, lets look at layer 1 and its parameters
# Take the first layer and get a look at the weights
print(f'model.conv1.weight parameters BEFORE training: \n{model.conv1.weight[0, 0, :, :]}')

# Let us also capture this as a variable
conv1_weights_before = model.conv1.weight[0, 0, :, :].detach().clone()

model.conv1.weight state BEFORE training. Trainable?: True
model.conv1.weight parameters BEFORE training: 
tensor([[-0.0104, -0.0061, -0.0018,  0.0748,  0.0566,  0.0171, -0.0127],
        [ 0.0111,  0.0095, -0.1099, -0.2805, -0.2712, -0.1291,  0.0037],
        [-0.0069,  0.0591,  0.2955,  0.5872,  0.5197,  0.2563,  0.0636],
        [ 0.0305, -0.0670, -0.2984, -0.4387, -0.2709, -0.0006,  0.0576],
        [-0.0275,  0.0160,  0.0726, -0.0541, -0.3328, -0.4206, -0.2578],
        [ 0.0306,  0.0410,  0.0628,  0.2390,  0.4138,  0.3936,  0.1661],
        [-0.0137, -0.0037, -0.0241, -0.0659, -0.1507, -0.0822, -0.0058]],
       device='cuda:0', grad_fn=<SliceBackward0>)


Let us take layer 2 *conv2 layer* in tbe *BasicBlock* and grab its parameters. Ultimately, we want to freeze all layers except this one.  This one was chosen randomly simply for explaining Transfer Learning.  No other reason.   

In [8]:
print(f'model.layer2[0].conv2.weight state BEFORE training. Trainable?: {model.layer2[0].conv2.weight.requires_grad}')

print(f'model.layer2[0].conv2.weight BEFORE training: \n{model.layer2[0].conv2.weight[0, 0, :, :]}')

# Capture this as a variable to compare with the trained version later
conv2_weights_before = model.layer2[0].conv2.weight[0, 0, :, :].detach().clone()


model.layer2[0].conv2.weight state BEFORE training. Trainable?: True
model.layer2[0].conv2.weight BEFORE training: 
tensor([[-0.0074, -0.0098,  0.0028],
        [-0.0108,  0.0258,  0.0455],
        [-0.0272,  0.0053,  0.0132]], device='cuda:0',
       grad_fn=<SliceBackward0>)


What we want during transfer learning, is to freeze these weights. We will leverage the features learned from the previous training process. 

Freezing means that when we train the network, we don't wish to update these parameters (weights and biases). In the case of Fine tuning, we would retrain some of these weights.  

Let's keep it simple and stick with transfer learning. We will freeze all layers, except model.layer2[0].conv2.weight. 

When this is finished, we should see that the weight for model.conv1 remains the same, while the weight for model.layer2[0].conv2.weight will change.

To freeze the layers, we set **requires grad to False**

In [9]:
# Feeze all layers
for param in model.parameters():
    param.requires_grad = False

In [10]:
# Confirm that gradient is no longer required on the two layers we targeted above
print(f'Layer: model.conv1.weight - AFTER FREEZING but BEFORE training. Trainable?: {model.conv1.weight.requires_grad}')
print(f'Layer: model.layer2[0].conv2.weight AFTER FREEZING but BEFORE training. Trainable?: {model.layer2[0].conv2.weight.requires_grad}')

Layer: model.conv1.weight - AFTER FREEZING but BEFORE training. Trainable?: False
Layer: model.layer2[0].conv2.weight AFTER FREEZING but BEFORE training. Trainable?: False


Above confirms that by using .requires_grad = False, we were able to freeze the layers, thus making them untrainable. 

Let us now unfreeze model.layer2[0].conv2.weight to see the updates.   

In [11]:
# Unfreezing the model.layer2[0].conv2 layer
model.layer2[0].conv2.weight.requires_grad = True

print(f'model.layer2[0].conv2.weight AFTER FREEZING but BEFORE training: {model.layer2[0].conv2.weight.requires_grad}')

model.layer2[0].conv2.weight AFTER FREEZING but BEFORE training: True


In [12]:
# Let's capture the information from the last layer of the network
print(f'Last layer of the network BEFORE modification: \n{model.fc}')

# Capture the number of neurons coming into and exiting the final layer
final_layer_in_features = model.fc.in_features

Last layer of the network BEFORE modification: 
Linear(in_features=512, out_features=1000, bias=True)


In [13]:
# Freeze Torch's random number generator
torch.manual_seed(seed=10)

# Let us create our own final layer
# Replace the output layer with 1 neuron
model.fc = nn.Linear(in_features=final_layer_in_features, out_features=1, bias=True)

# Revisit the final layer of the ne|twork to confirm our model architecture has been updated
print(f'Last layer of the network AFTER modification: \n{model.fc}')

Last layer of the network AFTER modification: 
Linear(in_features=512, out_features=1, bias=True)


The network requires input of shape 3x224x224 https://pytorch.org/hub/pytorch_vision_resnet/. Let's create 10 random samples to use. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. 

We are only concerned with seeing how the parameters change state. As a result, getting a true image is of no importance. We can simply replace this approach with a real dataset if we wanted to solve a real problem.

In [14]:
# Freeze the random number generator
torch.manual_seed(seed=10)

# Create six random samples 
# Place the samples on the device  
X = torch.randint(low=0, high=2, size=(6,3,224,224), dtype=torch.float, device=device)

# Just a peak of a sample if you are interested
X[0]

tensor([[[0., 1., 1.,  ..., 0., 1., 1.],
         [0., 0., 1.,  ..., 1., 0., 1.],
         [0., 1., 0.,  ..., 0., 1., 0.],
         ...,
         [0., 1., 1.,  ..., 1., 1., 0.],
         [0., 0., 1.,  ..., 0., 1., 1.],
         [1., 1., 1.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 0., 0.],
         [1., 1., 1.,  ..., 1., 0., 0.],
         [0., 0., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 0., 0., 1.],
         [0., 0., 0.,  ..., 0., 1., 1.],
         [0., 0., 0.,  ..., 0., 1., 1.]],

        [[1., 1., 0.,  ..., 1., 1., 0.],
         [0., 0., 1.,  ..., 0., 1., 1.],
         [1., 0., 0.,  ..., 0., 1., 1.],
         ...,
         [0., 1., 1.,  ..., 0., 1., 1.],
         [0., 1., 1.,  ..., 1., 0., 0.],
         [1., 0., 1.,  ..., 0., 1., 0.]]], device='cuda:0')

In [15]:
# Let's create a few labels to match those samples
y_truth = torch.tensor(data=[[0], [1], [1], [0], [1], [0]], dtype=torch.float32, device=device)
y_truth

tensor([[0.],
        [1.],
        [1.],
        [0.],
        [1.],
        [0.]], device='cuda:0')

In [16]:
# Set the manual seed 
torch.manual_seed(seed=10)

# Move the model to the device
model = model.to(device)

# Let's prepare to train the model
# First define a loss function
# Here we use with logits as we don't have an activation function at the end of the network
loss_fn = nn.BCEWithLogitsLoss()

# Define the optimizer for Gradient Descent
# Setting a relatively high learning rate here jus to speed up the learning process
optimizer = torch.optim.SGD(params=model.parameters(), lr=.1)

# Define a number of epochs
num_epochs = 5

# Train the network now for 5 epochs
for epoch in range(num_epochs):
    # Zero out the gradients to prevent them from accumulating
    optimizer.zero_grad(set_to_none=True)

    # calculate the loss on the model predictions vs the ground truth
    loss = loss_fn(input=model(X.to(device)), target=y_truth)
    
    # Perform backpropagation on the loss
    loss.backward()

    # Updates the parameters with gradient descent
    optimizer.step()

    print(f'Epoch: {epoch+1}/{num_epochs} \t loss: {loss}')


Epoch: 1/5 	 loss: 0.7258352041244507
Epoch: 2/5 	 loss: 1.1715543270111084
Epoch: 3/5 	 loss: 4.444774150848389
Epoch: 4/5 	 loss: 3.578293800354004
Epoch: 5/5 	 loss: 3.6656856536865234


In [17]:
# Get some sample predictions
# Note the sigmoid activation function
# This is needed as the model output the raw values as the logits
F.sigmoid(model(X))

tensor([[0.0073],
        [0.0079],
        [0.0081],
        [0.0088],
        [0.0073],
        [0.0081]], device='cuda:0', grad_fn=<SigmoidBackward0>)

Now that the model has been trained and we know we can make predictions. Time to find out if our parameters have been updated.  

In [18]:
# Confirm the layer is still frozen
# Meaning it is not still trainable
print(f'model.conv1.weight AFTER FREEZING and AFTER training: {model.conv1.weight.requires_grad}')

# Capture the variable again
conv1_weights_after_training = model.conv1.weight[0, 0, :, :].detach().clone()

# Print the original weights
print(f'model.conv1.weight: Original weights: \n{conv1_weights_before}')

# print the weights after training
print(f'\nmodel.conv1.weight AFTER FREEZING and AFTER training: \n{model.conv1.weight[0, 0, :, :]}')

model.conv1.weight AFTER FREEZING and AFTER training: False
model.conv1.weight: Original weights: 
tensor([[-0.0104, -0.0061, -0.0018,  0.0748,  0.0566,  0.0171, -0.0127],
        [ 0.0111,  0.0095, -0.1099, -0.2805, -0.2712, -0.1291,  0.0037],
        [-0.0069,  0.0591,  0.2955,  0.5872,  0.5197,  0.2563,  0.0636],
        [ 0.0305, -0.0670, -0.2984, -0.4387, -0.2709, -0.0006,  0.0576],
        [-0.0275,  0.0160,  0.0726, -0.0541, -0.3328, -0.4206, -0.2578],
        [ 0.0306,  0.0410,  0.0628,  0.2390,  0.4138,  0.3936,  0.1661],
        [-0.0137, -0.0037, -0.0241, -0.0659, -0.1507, -0.0822, -0.0058]],
       device='cuda:0')

model.conv1.weight AFTER FREEZING and AFTER training: 
tensor([[-0.0104, -0.0061, -0.0018,  0.0748,  0.0566,  0.0171, -0.0127],
        [ 0.0111,  0.0095, -0.1099, -0.2805, -0.2712, -0.1291,  0.0037],
        [-0.0069,  0.0591,  0.2955,  0.5872,  0.5197,  0.2563,  0.0636],
        [ 0.0305, -0.0670, -0.2984, -0.4387, -0.2709, -0.0006,  0.0576],
        [-0.0275,

In [19]:
# Compare conv1 before and after training
# We expect this to be unchanged as is above
(conv1_weights_before == conv1_weights_after_training).all()

tensor(True, device='cuda:0')

In [20]:
# Look at the model.layer2[0].conv2 layer
print(f'model.layer2[0].conv2.weight AFTER FREEZING and AFTER training. Trainable?: {model.layer2[0].conv2.weight.requires_grad}')

# Capture the new weights, now that the model has been trained
conv2_weights_after_training = model.layer2[0].conv2.weight[0, 0, :, :].detach().clone()

print(f'\n Weights before: \n{conv2_weights_before}')

print(f'\nmodel.layer2[0].conv2.weight AFTER FREEZING and AFTER training : \n{model.layer2[0].conv2.weight[0, 0, :, :]}')

model.layer2[0].conv2.weight AFTER FREEZING and AFTER training. Trainable?: True

 Weights before: 
tensor([[-0.0074, -0.0098,  0.0028],
        [-0.0108,  0.0258,  0.0455],
        [-0.0272,  0.0053,  0.0132]], device='cuda:0')

model.layer2[0].conv2.weight AFTER FREEZING and AFTER training : 
tensor([[-0.0080, -0.0106,  0.0022],
        [-0.0117,  0.0250,  0.0443],
        [-0.0277,  0.0047,  0.0124]], device='cuda:0',
       grad_fn=<SliceBackward0>)


In [21]:
# Looking at things from a different perspective.  
print(f'Are the untrained and trained weights equal? \
       \n{(conv2_weights_before == conv2_weights_after_training)}')

Are the untrained and trained weights equal?        
tensor([[False, False, False],
        [False, False, False],
        [False, False, False]], device='cuda:0')


### Lab Takeaways:
- We learned about transfer learning
- In transfer learning, we frooze the layers of the trained model and modified the final layer
- Alternatively, we could have added additional layers to make a larger *head* if we wanted   
- With more layers, it should be easier for the new model to learn the pattern of the data.   
- How you approach the additional of additional layers will depend on the problem you are solving.

#### Fine Tuning  
In fine tuning, we retrain the majority of the layers in the network at a slower learning rate. We may also introduce learning rate decay, which further slows down the training as we progress over the epochs. 

Additional Reference:   
- https://www.geeksforgeeks.org/top-pre-trained-models-for-image-classification/