# Transfer Learning for Image Classification

- ###### Transfer learning involves leveraging knowledge gained from training a model on a generic dataset, typically consisting of millions of images, and fine-tuning it on a specific dataset of interest. This technique allows the model to adapt its learned features to the characteristics of the new dataset quickly, potentially improving performance and efficiency.
- ###### Transfer learning is a technique where knowledge gained from one task is leveraged to solve another similar task.

- **Topics are covered in the chapter**:
    - Introducing transfer learning
    - Understanding VGG16 and ResNet architectures
    - Implementing facial key point detection
    - Multi-task learning: Implementing age estimation and gender classification
    - Introducing the torch_snippets library

**Transfer Learning High-Level Flow:**

1. **Input Normalization:**
   - Normalize input images using the same mean and standard deviation as used during the training of the pre-trained model.

2. **Fetch Pre-trained Model:**
   - Obtain the architecture of the pre-trained model.
   - Fetch the weights learned by the pre-trained model on a large dataset.

3. **Truncate Pre-trained Model:**
   - Remove the last few layers of the pre-trained model.

4. **Connect to New Layers:**
   - Connect the truncated pre-trained model to a newly initialized layer (or layers).
   - Ensure the output of the last layer matches the number of classes/outputs for prediction.

5. **Freeze Pre-trained Weights:**
   - Make the weights of the pre-trained model non-trainable (frozen) during backpropagation.
   - Train only the weights of the newly initialized layer and those connecting it to the output layer.
     - Rationale: Leverage the well-learned features of the pre-trained model for the task at hand.

6. **Training:**
   - Update the trainable parameters (weights of the new layers) over increasing epochs to fit the model to the specific dataset.
   - Gradually fine-tune the model on the smaller dataset, allowing it to adapt to specific features while retaining the knowledge from the pre-trained model.

---

# Understanding VGG16 architecture

- ###### `VGG` stands for `Visual Geometry Group`, which is based out of the University of Oxford, and 16 stands for the number of layers in the model. The VGG16 model is trained to classify objects in the ImageNet competition and stood as the runner-up architecture in 2014.

In [1]:
# 1. Import the required packages
import torchvision
from torchvision import models , transforms , datasets
import torch
import torch.nn as nn
import torch.nn.functional as f
from torchsummary import summary
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [3]:
# 2. Load the VGG16 model and register the model within the device
model = models.vgg16(weights='IMAGENET1K_V1').to(device)

# 3. Fetch the summary of the model
summary(model,torch.zeros(1,3,224,224))

Layer (type:depth-idx)                   Output Shape              Param #
├─Sequential: 1-1                        [-1, 512, 7, 7]           --
|    └─Conv2d: 2-1                       [-1, 64, 224, 224]        1,792
|    └─ReLU: 2-2                         [-1, 64, 224, 224]        --
|    └─Conv2d: 2-3                       [-1, 64, 224, 224]        36,928
|    └─ReLU: 2-4                         [-1, 64, 224, 224]        --
|    └─MaxPool2d: 2-5                    [-1, 64, 112, 112]        --
|    └─Conv2d: 2-6                       [-1, 128, 112, 112]       73,856
|    └─ReLU: 2-7                         [-1, 128, 112, 112]       --
|    └─Conv2d: 2-8                       [-1, 128, 112, 112]       147,584
|    └─ReLU: 2-9                         [-1, 128, 112, 112]       --
|    └─MaxPool2d: 2-10                   [-1, 128, 56, 56]         --
|    └─Conv2d: 2-11                      [-1, 256, 56, 56]         295,168
|    └─ReLU: 2-12                        [-1, 256, 56, 56]      

Layer (type:depth-idx)                   Output Shape              Param #
├─Sequential: 1-1                        [-1, 512, 7, 7]           --
|    └─Conv2d: 2-1                       [-1, 64, 224, 224]        1,792
|    └─ReLU: 2-2                         [-1, 64, 224, 224]        --
|    └─Conv2d: 2-3                       [-1, 64, 224, 224]        36,928
|    └─ReLU: 2-4                         [-1, 64, 224, 224]        --
|    └─MaxPool2d: 2-5                    [-1, 64, 112, 112]        --
|    └─Conv2d: 2-6                       [-1, 128, 112, 112]       73,856
|    └─ReLU: 2-7                         [-1, 128, 112, 112]       --
|    └─Conv2d: 2-8                       [-1, 128, 112, 112]       147,584
|    └─ReLU: 2-9                         [-1, 128, 112, 112]       --
|    └─MaxPool2d: 2-10                   [-1, 128, 56, 56]         --
|    └─Conv2d: 2-11                      [-1, 256, 56, 56]         295,168
|    └─ReLU: 2-12                        [-1, 256, 56, 56]      

- The 16 layers we mentioned are grouped as follows:

    - {1,2},{3,4,5},{6,7},{8,9,10},{11,12},{13,14},{15,16,17},{18,19},{20,21},{22,23,24},{25,26},{27,28},{29,30,31,32},{33,3 4,35},{36 37,38},{39}



- `Note` that there are ~138 million parameters (of which ~122 million are the linear layers at the end of the network – 102 + 16 + 4 million parameters) in this network, which comprises 13 layers of convolution and/or pooling, with increasing number of filters, and 3 linear layers.

In [4]:
# Another way to understand the components of the VGG16 model is by simply printing 
model

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

- Model has three sub-modules: features, avgpool, and classifier.
- It's common to freeze features and avgpool during transfer learning.
- Remove the classifier module (or some bottom layers).
- Replace it with a new classifier for the specific dataset classes.
- Adapt the output to predict the desired number of classes (not the original 1,000).