# Transfer Learning

In this notebook, you'll learn how to use pre-trained networks to solved challenging problems in computer vision. Specifically, you'll use networks trained on [ImageNet](http://www.image-net.org/) [available from torchvision](http://pytorch.org/docs/0.3.0/torchvision/models.html). 

ImageNet is a massive dataset with over 1 million labeled images in 1000 categories. It's used to train deep neural networks using an architecture called convolutional layers. I'm not going to get into the details of convolutional networks here, but if you want to learn more about them, please [watch this](https://www.youtube.com/watch?v=2-Ol7ZB0MmU).

Once trained, these models work astonishingly well as feature detectors for images they weren't trained on. Using a pre-trained network on images not in the training set is called transfer learning. Here we'll use transfer learning to train a network that can classify our cat and dog photos with near perfect accuracy.

With `torchvision.models` you can download these pre-trained networks and use them in your applications. We'll include `models` in our imports now.

In [1]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt

import torch
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms, models

Most of the pretrained models require the input to be 224x224 images. Also, we'll need to match the normalization used when the models were trained. Each color channel was normalized separately, the means are `[0.485, 0.456, 0.406]` and the standard deviations are `[0.229, 0.224, 0.225]`.

In [2]:
data_dir = 'Cat_Dog_data'

# TODO: Define transforms for the training data and testing data
train_transforms = transforms.Compose([transforms.RandomRotation(30),
                                       transforms.RandomResizedCrop(224),
                                       transforms.RandomHorizontalFlip(),
                                       transforms.ToTensor(),
                                       transforms.Normalize([0.485, 0.456, 0.406],
                                                            [0.229, 0.224, 0.225])])

test_transforms = transforms.Compose([transforms.Resize(255),
                                      transforms.CenterCrop(224),
                                      transforms.ToTensor(),
                                      transforms.Normalize([0.485, 0.456, 0.406],
                                                           [0.229, 0.224, 0.225])])

# Pass transforms in here, then run the next cell to see how the transforms look
train_data = datasets.ImageFolder(data_dir + '/train', transform=train_transforms)
test_data = datasets.ImageFolder(data_dir + '/test', transform=test_transforms)

trainloader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(test_data, batch_size=64)

We can load in a model such as [DenseNet](http://pytorch.org/docs/0.3.0/torchvision/models.html#id5). Let's print out the model architecture so we can see what's going on.

In [3]:
model = models.densenet121(pretrained=True)
model



DenseNet(
  (features): Sequential(
    (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace=True)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace=True)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace=True)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      (denselayer2): _DenseLayer(
        (norm1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu


---

### Transfer Learning with Pre-trained CNNs

This model is built out of two main parts, the features and the classifier. The features part is a stack of convolutional layers and overall works as a feature detector that can be fed into a classifier. The classifier part is a single fully-connected layer `(classifier): Linear(in_features=1024, out_features=1000)`. This layer was trained on the ImageNet dataset, so it won't work for our specific problem. That means we need to replace the classifier, but the features will work perfectly on their own. In general, I think about pre-trained networks as amazingly good feature detectors that can be used as the input for simple feed-forward classifiers.


###### The Concept: Leveraging Pre-trained Knowledge

Transfer learning exploits the hierarchical feature learning in deep convolutional neural networks. Pre-trained models like VGG, ResNet, or DenseNet trained on massive datasets (ImageNet with millions of images) develop sophisticated feature detection capabilities across multiple abstraction levels.

**Feature Hierarchy in CNNs:**
- **Early layers**: Detect basic features (edges, colors, textures)
- **Middle layers**: Combine basic features into patterns (shapes, object parts)
- **Deep layers**: Recognize complex structures (faces, objects, scenes)

###### Model Architecture: Two-Part Structure

```mermaid
graph TD
    A["Input Image"] --> B["Feature Extractor"]
    B --> C["Features (1024-dim vector)"]
    C --> D["Classifier"]
    D --> E["Output Predictions"]
    
    subgraph "Pre-trained Part (Frozen)"
        B
        B1["Conv Layer 1"] --> B2["Conv Layer 2"]
        B2 --> B3["..."]
        B3 --> B4["Conv Layer N"]
        B4 --> B5["Global Average Pool"]
    end
    
    subgraph "Custom Part (Trainable)"
        D
        D1["Linear(1024, 500)"] --> D2["ReLU"]
        D2 --> D3["Linear(500, 2)"]
        D3 --> D4["LogSoftmax"]
    end
    
    style A fill:#FFF3E0
    style B fill:#FFE0B2
    style C fill:#FFCC80
    style D fill:#FFB74D
    style E fill:#FFA726
    style B1 fill:#FF9800
    style B2 fill:#FB8C00
    style B3 fill:#F57C00
    style B4 fill:#FF8A65
    style B5 fill:#FF7043
    style D1 fill:#F48FB1
    style D2 fill:#F06292
    style D3 fill:#EC407A
    style D4 fill:#E91E63
```

---

###### Step-by-Step Implementation Process

**Step 1: Freeze Pre-trained Parameters**
```python
for param in model.parameters():
    param.requires_grad = False
```

This prevents the valuable pre-trained weights from being modified during training. The feature extractor becomes a fixed feature extraction function.

**Mathematical Impact:**
- Gradients are not computed for frozen parameters
- Memory usage reduced (no gradient storage)
- Training speed increased (fewer parameters to update)
- Pre-trained knowledge preserved

**Step 2: Design Custom Classifier**

```python
classifier = nn.Sequential(OrderedDict([
    ('fc1', nn.Linear(1024, 500)),
    ('relu', nn.ReLU()),
    ('fc2', nn.Linear(500, 2)),
    ('output', nn.LogSoftmax(dim=1))
]))
```

**Architecture Decisions:**
- **Input size (1024)**: Must match feature extractor output dimensions
- **Hidden layer (500)**: Reduces dimensionality while maintaining representation capacity
- **Output size (2)**: Matches new task classes (e.g., cats vs dogs)
- **LogSoftmax**: Appropriate for classification with cross-entropy loss

**Step 3: Replace Original Classifier**

```python
model.classifier = classifier
```

This substitution creates a hybrid model combining:
- **Pre-trained feature extraction** (frozen, sophisticated)
- **Task-specific classification** (trainable, simple)

---

###### Training Dynamics

**Forward Pass Flow:**
1. Input image → Pre-trained feature extractor → High-level features
2. High-level features → Custom classifier → Task predictions

**Backward Pass Flow:**
1. Loss computed on predictions
2. Gradients flow only to custom classifier (frozen layers ignored)
3. Only classifier weights updated

###### Advantages of This Approach

**Computational Efficiency:**
- Fewer parameters to train (only classifier layers)
- Faster training convergence
- Lower memory requirements

**Performance Benefits:**
- Leverages sophisticated feature representations
- Requires less training data
- Often achieves better accuracy than training from scratch

**Knowledge Transfer:**
- Generic visual features transfer across domains
- Reduces overfitting on small datasets
- Maintains robustness of pre-trained features

#### Key Considerations

**Feature Dimension Matching:**
The custom classifier's first layer input size must exactly match the pre-trained model's feature output size. Common dimensions:
- VGG: 4096 features
- ResNet: 2048 features  
- DenseNet: 1024 features

**Task Similarity:**
Transfer learning works best when the new task shares visual similarities with the pre-training dataset (ImageNet contains diverse natural images).

This approach transforms a powerful general-purpose vision model into a specialized tool for your specific classification task while preserving the valuable learned representations.

---

In [4]:
from collections import OrderedDict

# Freeze parameters so we don't backprop through them
for param in model.parameters():
    param.requires_grad = False

classifier = nn.Sequential(OrderedDict([
                          ('fc1', nn.Linear(1024, 500)),
                          ('relu', nn.ReLU()),
                          ('fc2', nn.Linear(500, 2)),
                          ('output', nn.LogSoftmax(dim=1))
                          ]))
    
model.classifier = classifier

With our model built, we need to train the classifier. However, now we're using a **really deep** neural network. If you try to train this on a CPU like normal, it will take a long, long time. Instead, we're going to use the GPU to do the calculations. The linear algebra computations are done in parallel on the GPU leading to 100x increased training speeds. It's also possible to train on multiple GPUs, further decreasing training time.

PyTorch, along with pretty much every other deep learning framework, uses [CUDA](https://developer.nvidia.com/cuda-zone) to efficiently compute the forward and backwards passes on the GPU. In PyTorch, you move your model parameters and other tensors to the GPU memory using `model.to('cuda')`. You can move them back from the GPU with `model.to('cpu')` which you'll commonly do when you need to operate on the network output outside of PyTorch. As a demonstration of the increased speed, I'll compare how long it takes to perform a forward and backward pass with and without a GPU.

In [7]:
#  check if MPS available 

def get_device():
    """Return the best available device: MPS > CUDA > CPU"""
    if torch.backends.mps.is_available():
        return torch.device("mps")
    elif torch.cuda.is_available():
        return torch.device("cuda")
    else:
        return torch.device("cpu")

In [8]:
import time

# Check available devices
devices = []
devices.append(get_device())

for device in devices:
    criterion = nn.NLLLoss()
    # Only train the classifier parameters, feature parameters are frozen
    optimizer = optim.Adam(model.classifier.parameters(), lr=0.001)

    model.to(device)

    for ii, (inputs, labels) in enumerate(trainloader):
        # Move input and label tensors to the appropriate device
        inputs, labels = inputs.to(device), labels.to(device)

        start = time.time()

        outputs = model.forward(inputs)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if ii==3:
            break
        
    print(f"Device = {device}; Time per batch: {(time.time() - start)/3:.3f} seconds")

Device = mps; Time per batch: 0.033 seconds


You can write device agnostic code which will automatically use CUDA if it's enabled like so:
```python
# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

...

# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)
```

From here, I'll let you finish training the model. The process is the same as before except now your model is much more powerful. You should get better than 95% accuracy easily.

>**Exercise:** Train a pretrained models to classify the cat and dog images. Continue with the DenseNet model, or try ResNet, it's also a good model to try out first. Make sure you are only training the classifier and the parameters for the features part are frozen.

In [9]:
device = get_device()
model = models.densenet121(pretrained=True)

# Freeze parameters so we don't backprop through them
for param in model.parameters():
    param.requires_grad = False
    
model.classifier = nn.Sequential(nn.Linear(1024, 256),
                                 nn.ReLU(),
                                 nn.Dropout(0.2),
                                 nn.Linear(256, 2),
                                 nn.LogSoftmax(dim=1))

criterion = nn.NLLLoss()

# Only train the classifier parameters, feature parameters are frozen
optimizer = optim.Adam(model.classifier.parameters(), lr=0.003)

model.to(device)
print(f"Using device: {device}")



Using device: mps


In [10]:
epochs = 1
steps = 0

running_loss = 0
print_every = 5


for epoch in range(epochs):
    for inputs, labels in trainloader:
        steps += 1
        # Move input and label tensors to the default device
        inputs, labels = inputs.to(device), labels.to(device)
        
        logps = model.forward(inputs)
        loss = criterion(logps, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        
        if steps % print_every == 0:
            test_loss = 0
            accuracy = 0
            model.eval()
            with torch.no_grad():
                for inputs, labels in testloader:
                    inputs, labels = inputs.to(device), labels.to(device)
                    logps = model.forward(inputs)
                    batch_loss = criterion(logps, labels)
                    
                    test_loss += batch_loss.item()
                    
                    # Calculate accuracy
                    ps = torch.exp(logps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
                    
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Test loss: {test_loss/len(testloader):.3f}.. "
                  f"Test accuracy: {100 * accuracy/len(testloader):.1f}%")
            running_loss = 0
            model.train()

Epoch 1/1.. Train loss: 0.684.. Test loss: 0.292.. Test accuracy: 88.2%
Epoch 1/1.. Train loss: 0.335.. Test loss: 0.113.. Test accuracy: 98.2%
Epoch 1/1.. Train loss: 0.297.. Test loss: 0.111.. Test accuracy: 96.0%




Epoch 1/1.. Train loss: 0.266.. Test loss: 0.053.. Test accuracy: 98.3%
Epoch 1/1.. Train loss: 0.280.. Test loss: 0.047.. Test accuracy: 98.4%
Epoch 1/1.. Train loss: 0.187.. Test loss: 0.045.. Test accuracy: 98.8%
Epoch 1/1.. Train loss: 0.240.. Test loss: 0.054.. Test accuracy: 98.3%
Epoch 1/1.. Train loss: 0.182.. Test loss: 0.041.. Test accuracy: 98.9%
Epoch 1/1.. Train loss: 0.119.. Test loss: 0.038.. Test accuracy: 98.8%
Epoch 1/1.. Train loss: 0.152.. Test loss: 0.036.. Test accuracy: 98.7%
Epoch 1/1.. Train loss: 0.183.. Test loss: 0.035.. Test accuracy: 98.8%
Epoch 1/1.. Train loss: 0.129.. Test loss: 0.033.. Test accuracy: 99.1%
Epoch 1/1.. Train loss: 0.113.. Test loss: 0.032.. Test accuracy: 99.0%
Epoch 1/1.. Train loss: 0.168.. Test loss: 0.028.. Test accuracy: 99.3%
Epoch 1/1.. Train loss: 0.136.. Test loss: 0.029.. Test accuracy: 99.0%
