# Transfer Learning in PyTorch

This lecture will walk through the steps for fine-tuning a pre-trained CNN (e.g., AlexNet) in PyTorch. We'll explore:

1. **Loading a pre-trained model**  
2. **Inspecting trainable parameters**  
3. **Adjusting layers: freezing and unfreezing**  
4. **Replacing the output (classification) layer**  
5. **Replacing the entire (classification) head**  


---

**Loading a pre-trained model**

PyTorch makes it easy to load pre-trained models from `torchvision.models`. 

Let's load AlexNet, a CNN that was one of the early deep learning models to perform well on ImageNet.

In [1]:
import torch
import torch.nn as nn
from torchvision import models


In [2]:
# Load AlexNet with pre-trained weights
model = models.alexnet(pretrained=True)

Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /home/gustaf/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
100.0%


The AlexNet architecture consists of convolutional layers (for feature extraction) followed by a fully connected (FC) classification head. 

The FC layer outputs predictions for 1,000 ImageNet classes.

In [3]:
# note that all layers are shown in order below. The first layers are printed first, and the last ones last.

print(model)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

---

**Inspecting trainable parameters**

In PyTorch, each layer has an attribute `.requires_grad` that determines whether the parameters of that layer are updated during training.


In [5]:
for name, param in model.named_parameters():
    print(f'Requires grad: {param.requires_grad} | Parameter name: {name}')

Requires grad: True | Parameter name: features.0.weight
Requires grad: True | Parameter name: features.0.bias
Requires grad: True | Parameter name: features.3.weight
Requires grad: True | Parameter name: features.3.bias
Requires grad: True | Parameter name: features.6.weight
Requires grad: True | Parameter name: features.6.bias
Requires grad: True | Parameter name: features.8.weight
Requires grad: True | Parameter name: features.8.bias
Requires grad: True | Parameter name: features.10.weight
Requires grad: True | Parameter name: features.10.bias
Requires grad: True | Parameter name: classifier.1.weight
Requires grad: True | Parameter name: classifier.1.bias
Requires grad: True | Parameter name: classifier.4.weight
Requires grad: True | Parameter name: classifier.4.bias
Requires grad: True | Parameter name: classifier.6.weight
Requires grad: True | Parameter name: classifier.6.bias


Let's create a function to **inspect the trainable parameters** and calculate:
- The total number of parameters.
- The number and percentage of trainable parameters.

In [6]:
def count_parameters(model):

    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total Parameters: {total_params:,}")
    print(f"Trainable Parameters: {trainable_params:,}")
    print(f"Percentage Trainable: {100 * trainable_params / total_params:.2f}%")


count_parameters(model)

Total Parameters: 61,100,840
Trainable Parameters: 61,100,840
Percentage Trainable: 100.00%


In [17]:
# own version
import torchinfo
print(str(torchinfo.summary(model, input_size=(1, 3, 224, 224), col_names=["input_size", "output_size", "num_params", "params_percent", "kernel_size", "trainable",])))
print(f"Percentage Trainable: {100 * sum(p.numel() for p in model.parameters() if p.requires_grad) / sum(p.numel() for p in model.parameters()):.2f}%")

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Param %                   Kernel Shape              Trainable
AlexNet                                  [1, 3, 224, 224]          [1, 1000]                 --                             --                   --                        True
├─Sequential: 1-1                        [1, 3, 224, 224]          [1, 256, 6, 6]            --                             --                   --                        True
│    └─Conv2d: 2-1                       [1, 3, 224, 224]          [1, 64, 55, 55]           23,296                      0.04%                   [11, 11]                  True
│    └─ReLU: 2-2                         [1, 64, 55, 55]           [1, 64, 55, 55]           --                             --                   --                        --
│    └─MaxPool2d: 2-3                    [1, 64, 55, 55]           [1, 64, 27, 27]           --                      

As seen, all parameters are initially trainable.

---

**Adjusting layers, freezing and unfreezing**

To **freeze a layer**, set its `.requires_grad` attribute to `False`. This prevents the optimizer from updating the layer's weights during training. Freezing is typically applied to lower layers (convolutional layers), which extract generic features, while leaving the classification head trainable.

Let's freeze all layers.

In [18]:
for name, param in model.named_parameters():      # loops over all layers
    param.requires_grad = False                   # freezes the actual layer

# Check trainable parameters again
count_parameters(model)

Total Parameters: 61,100,840
Trainable Parameters: 0
Percentage Trainable: 0.00%


In [19]:
# double check requires grad status of layers

for name, param in model.named_parameters():
    print(f'Requires grad: {param.requires_grad} | Parameter name: {name}')

Requires grad: False | Parameter name: features.0.weight
Requires grad: False | Parameter name: features.0.bias
Requires grad: False | Parameter name: features.3.weight
Requires grad: False | Parameter name: features.3.bias
Requires grad: False | Parameter name: features.6.weight
Requires grad: False | Parameter name: features.6.bias
Requires grad: False | Parameter name: features.8.weight
Requires grad: False | Parameter name: features.8.bias
Requires grad: False | Parameter name: features.10.weight
Requires grad: False | Parameter name: features.10.bias
Requires grad: False | Parameter name: classifier.1.weight
Requires grad: False | Parameter name: classifier.1.bias
Requires grad: False | Parameter name: classifier.4.weight
Requires grad: False | Parameter name: classifier.4.bias
Requires grad: False | Parameter name: classifier.6.weight
Requires grad: False | Parameter name: classifier.6.bias


**Unfreezing Specific Layers**

If your task requires fine-tuning certain layers, you can selectively unfreeze them.

In [20]:
# Unfreeze only the last classifier layer (the output layer) 
for name, param in model.named_parameters():
    if "classifier.6" in name:                
        param.requires_grad = True

# Check trainable parameters again
count_parameters(model)

Total Parameters: 61,100,840
Trainable Parameters: 4,097,000
Percentage Trainable: 6.71%


In [21]:
for name, param in model.named_parameters():
    print(f'Requires grad: {param.requires_grad} | Parameter name: {name}')

Requires grad: False | Parameter name: features.0.weight
Requires grad: False | Parameter name: features.0.bias
Requires grad: False | Parameter name: features.3.weight
Requires grad: False | Parameter name: features.3.bias
Requires grad: False | Parameter name: features.6.weight
Requires grad: False | Parameter name: features.6.bias
Requires grad: False | Parameter name: features.8.weight
Requires grad: False | Parameter name: features.8.bias
Requires grad: False | Parameter name: features.10.weight
Requires grad: False | Parameter name: features.10.bias
Requires grad: False | Parameter name: classifier.1.weight
Requires grad: False | Parameter name: classifier.1.bias
Requires grad: False | Parameter name: classifier.4.weight
Requires grad: False | Parameter name: classifier.4.bias
Requires grad: True | Parameter name: classifier.6.weight
Requires grad: True | Parameter name: classifier.6.bias


---

**Replacing the Classification Layer**

The classification layer of AlexNet is named `model.classifier[6]`, which is a fully connected (FC) layer with 1,000 outputs suitable for ImageNet classes. 

For a new task (e.g., 10-class classification), replace this with a new FC layer.

In [22]:
num_features = model.classifier[6].in_features     # get the number of input features the current classification layer
model.classifier[6] = nn.Linear(num_features, 10)  # replace it with a new classification layer with 10 outputs

# print updated model, check the the last layer output size is now 10 (previous was 1,000)
print(model)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

In [24]:
# count current trainable parameters
count_parameters(model)

Total Parameters: 57,044,810
Trainable Parameters: 40,970
Percentage Trainable: 0.07%


In [25]:
for name, param in model.named_parameters():
    print(f'Requires grad: {param.requires_grad} | Parameter name: {name}')

Requires grad: False | Parameter name: features.0.weight
Requires grad: False | Parameter name: features.0.bias
Requires grad: False | Parameter name: features.3.weight
Requires grad: False | Parameter name: features.3.bias
Requires grad: False | Parameter name: features.6.weight
Requires grad: False | Parameter name: features.6.bias
Requires grad: False | Parameter name: features.8.weight
Requires grad: False | Parameter name: features.8.bias
Requires grad: False | Parameter name: features.10.weight
Requires grad: False | Parameter name: features.10.bias
Requires grad: False | Parameter name: classifier.1.weight
Requires grad: False | Parameter name: classifier.1.bias
Requires grad: False | Parameter name: classifier.4.weight
Requires grad: False | Parameter name: classifier.4.bias
Requires grad: True | Parameter name: classifier.6.weight
Requires grad: True | Parameter name: classifier.6.bias


We've now replaced the original classification (ouput) layer with a customer layer. This layer is now both unfrozeen and initialized with random parameters.

---

**Replacing the Entire Classification Head**

For more flexibility, you can replace the **entire classification head**, not just the last layer. In AlexNet, the head typically starts after the feature extractor (convolutional layers).

Note also that the the whole classification head is encapsulated in 'model.classifier', using the so-called nn.Sequential syntax. 

We can define our own nn.Sequential head and replace the built-in one with it.


In [26]:
# example of nn.Sequential syntax

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()

        self.model = nn.Sequential(
                                    # Layer 1: Conv -> ReLU -> Pool
                                    nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1), 
                                    nn.ReLU(),
                                    nn.MaxPool2d(kernel_size=2, stride=2), 

                                    # Layer 2: Conv -> ReLU -> Pool
                                    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1),
                                    nn.ReLU(),
                                    nn.MaxPool2d(kernel_size=2, stride=2), 

                                    # Layer 3: Conv -> ReLU -> Pool
                                    nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1),
                                    nn.ReLU(),
                                    nn.MaxPool2d(kernel_size=2, stride=2), 

                                    # Flatten
                                    nn.Flatten(),

                                    # One-layer classification head
                                    nn.Linear(64 * 4 * 4, num_classes)  # Assuming input size 32x32
                                )

    def forward(self, x):
        return self.model(x)

In [27]:
# the whole classifier encapulated in model.classifier

model.classifier

Sequential(
  (0): Dropout(p=0.5, inplace=False)
  (1): Linear(in_features=9216, out_features=4096, bias=True)
  (2): ReLU(inplace=True)
  (3): Dropout(p=0.5, inplace=False)
  (4): Linear(in_features=4096, out_features=4096, bias=True)
  (5): ReLU(inplace=True)
  (6): Linear(in_features=4096, out_features=10, bias=True)
)

In [28]:
# note that the first layer of neurons in the classifier head has 9216 in_features, we need to have the same in our new head.

head_in_features = 9216

In [29]:
# Define a custom sequence and replace the entire classification head with it
# note that our custom sequence architecture is completely arbitrary and up to us to decide
# only constrains are the 9216 in_features and 10 out_features of the last layer (which corresponds to the problem we want to solve)


model.classifier = nn.Sequential(
                                 nn.Linear(head_in_features, 512),  # First FC layer
                                 nn.ReLU(),                         
                                 nn.Dropout(0.5),                   
                                 nn.Linear(512, 10)                 # Output layer
                                )

print(model)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Linear(in_features=9216, out_features=512, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.5, i

In [30]:
# Check parameters again after adding a new head
count_parameters(model)

Total Parameters: 7,193,930
Trainable Parameters: 4,724,234
Percentage Trainable: 65.67%


In [31]:
for name, param in model.named_parameters():
    print(f'Requires grad: {param.requires_grad} | Parameter name: {name}')

Requires grad: False | Parameter name: features.0.weight
Requires grad: False | Parameter name: features.0.bias
Requires grad: False | Parameter name: features.3.weight
Requires grad: False | Parameter name: features.3.bias
Requires grad: False | Parameter name: features.6.weight
Requires grad: False | Parameter name: features.6.bias
Requires grad: False | Parameter name: features.8.weight
Requires grad: False | Parameter name: features.8.bias
Requires grad: False | Parameter name: features.10.weight
Requires grad: False | Parameter name: features.10.bias
Requires grad: True | Parameter name: classifier.0.weight
Requires grad: True | Parameter name: classifier.0.bias
Requires grad: True | Parameter name: classifier.3.weight
Requires grad: True | Parameter name: classifier.3.bias


Maybe we'd like to unfreeze the last conv layer aswell?

In [32]:
# Unfreeze only the last classifier layer (the output layer) 
for name, param in model.named_parameters():
    if "features.10" in name:                
        param.requires_grad = True

# Check trainable parameters again
count_parameters(model)

Total Parameters: 7,193,930
Trainable Parameters: 5,314,314
Percentage Trainable: 73.87%


In [33]:
for name, param in model.named_parameters():
    print(f'Requires grad: {param.requires_grad} | Parameter name: {name}')

Requires grad: False | Parameter name: features.0.weight
Requires grad: False | Parameter name: features.0.bias
Requires grad: False | Parameter name: features.3.weight
Requires grad: False | Parameter name: features.3.bias
Requires grad: False | Parameter name: features.6.weight
Requires grad: False | Parameter name: features.6.bias
Requires grad: False | Parameter name: features.8.weight
Requires grad: False | Parameter name: features.8.bias
Requires grad: True | Parameter name: features.10.weight
Requires grad: True | Parameter name: features.10.bias
Requires grad: True | Parameter name: classifier.0.weight
Requires grad: True | Parameter name: classifier.0.bias
Requires grad: True | Parameter name: classifier.3.weight
Requires grad: True | Parameter name: classifier.3.bias


---

## Next Steps: Training the Model

Using our model for training is super easy. We have it saved in the variable `model` and can, at any point,

throw it into our training loop and train it - usually once we're satisfied with freezing/unfreezing of layers 

and potential replacement of the classification head/layers. 

**Note**: 

For optimal performance, you might need to resize your images to the input size (often 224x224) of the pre-trained model. 

This can easily be done using transformations though. Look it up.