##  Computer vision libraries in PyTorch

| PyTorch module | What does it do? |
| ----- | ----- |
| [`torchvision`](https://pytorch.org/vision/stable/index.html) | Contains datasets, model architectures and image transformations often used for computer vision problems. |
| [`torchvision.datasets`](https://pytorch.org/vision/stable/datasets.html) | Here you'll find many example computer vision datasets for a range of problems from image classification, object detection, image captioning, video classification and more. It also contains [a series of base classes for making custom datasets](https://pytorch.org/vision/stable/datasets.html#base-classes-for-custom-datasets). |
| [`torchvision.models`](https://pytorch.org/vision/stable/models.html) | This module contains well-performing and commonly used computer vision model architectures implemented in PyTorch, you can use these with your own problems. | 
| [`torchvision.transforms`](https://pytorch.org/vision/stable/transforms.html) | Often images need to be transformed (turned into numbers/processed/augmented) before being used with a model, common image transformations are found here. | 
| [`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) | Base dataset class for PyTorch.  | 
| [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#module-torch.utils.data) | Creates a Python iterable over a dataset (created with `torch.utils.data.Dataset`). |    
    

## Typical CNN Framework

![title](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-07-at-4-59-29-pm.png)

### What does a CNN do?

![title](https://ujwlkarn.files.wordpress.com/2016/07/screen-shot-2016-07-24-at-11-25-13-pm.png?w=254&h=230)
![title](https://ujwlkarn.files.wordpress.com/2016/07/screen-shot-2016-07-24-at-11-25-24-pm.png?w=148&h=128)

![title](https://ujwlkarn.files.wordpress.com/2016/07/convolution_schematic.gif)

![title](https://ujwlkarn.files.wordpress.com/2016/08/giphy.gif)

![title](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*eoVNuUXZy5W2zb0CIK5IhQ.png)

In Practice a CNN will learn the values of these filters on its own, during the training process. Although, we would need to define the parameters of `nn.Conv2d()`:

* `in_channels` (int) - Number of channels in the input image.
* `out_channels` (int) - Number of channels produced by the convolution. (Number of filters)
* `kernel_size` (int or tuple) - Size of the convolving kernel/filter.
* `stride` (int or tuple, optional) - How big of a step the convolving kernel takes at a time. Default: 1.
* `padding` (int, tuple, str) - Padding added to all four sides of input. We can add extra pixels around the edges of the input image to make sure the filter properly passes over the edges of the image. A feature of zero padding is that it will allow us to control the spatial size of the output volumes. Used when its required to preserve the spatial size of the input volume so the input and output width and height are the same. Default: 0.

![example of going through the different parameters of a Conv2d layer](https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/03-conv2d-layer.gif)

*Example of what happens when you change the hyperparameters of a `nn.Conv2d()` layer.*

In [1]:
import torch.nn as nn
import torch
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
tensor = torch.randn(1, 3, 64, 64)
output = conv_layer(tensor)
print(conv_layer)
print(output.size())
print()

conv_layer_2 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=4, stride=2, padding=1)
output_2 = conv_layer_2(tensor)
print(conv_layer_2)
print(output_2.size())

Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
torch.Size([1, 16, 64, 64])

Conv2d(3, 16, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
torch.Size([1, 16, 32, 32])


We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by     
    
O = (W−F+2P)/S+1.    
    
For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output

### Pooling Layers

In [2]:

max_pool_layer = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
output_3 = max_pool_layer(tensor)
print(tensor.shape)
print(max_pool_layer)
print(output_3.size())

torch.Size([1, 3, 64, 64])
MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
torch.Size([1, 3, 32, 32])


![title](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-10-at-3-38-39-am.png?w=988)

Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square).

![title](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-07-at-6-11-53-pm.png)

The function of Pooling is to progressively reduce the spatial size of the input representation.
* It reduces the amount of parameters and computation in the network, and hence to also control overfitting.
* It makes the detection of features invariant to small transformations, distortions and translations. (a small distortion in input will not change the output of Pooling – since we take the maximum / average value in a local neighborhood).

Putting it all together:



![title](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-07-at-9-15-21-pm.png)

![title](https://ujwlkarn.files.wordpress.com/2016/08/screen-shot-2016-08-10-at-12-58-30-pm.png?w=484)


Conv Layers - Transformations    
Pooling Layers - Downsampling

Key Principles of Convolutional Neural Networks:

- Sparse Interactions: In traditional neural networks, each output unit interacts with every input unit via matrix multiplication, resulting in dense connections. However, CNNs employ sparse interaction by using smaller kernels compared to the input data. For instance, while processing an image with millions of pixels, kernels can capture meaningful information within a smaller window of tens or hundreds of pixels. This sparse interaction reduces the number of parameters needed, leading to lower memory requirements and enhanced statistical efficiency of the model. It allows CNNs to focus on local patterns, efficiently capturing relevant features while ignoring irrelevant ones.

- Parameter Sharing: Traditional neural networks lack parameter sharing, leading to distinct weights applied only once and never revisited. However, in CNNs, the same set of weights is reused across the input, promoting efficient feature extraction and allowing the network to learn patterns that are applicable in various spatial locations.

- Equivariant Representations: Parameter sharing in CNN layers results in equivariance to translation. This property implies that if a transformation occurs in the input, the output undergoes a corresponding transformation. For instance, if an image is shifted or translated, the learned features or patterns in the output also shift accordingly. CNNs’ equivariant nature ensures that learned features retain their spatial relationships, making them robust and effective in recognizing patterns regardless of their position within the input data.

### Transposed Convolution

Where would you find them? - In the decoder part of an autoencoder, in the generator part of a GAN, in the upsampling part of a U-Net, etc.

Purpose? - To increase the size or spatial resolution of the input.

In [3]:
conv_transpose = nn.ConvTranspose2d(in_channels=16, out_channels=3, kernel_size=3, stride=2, padding=1, output_padding=1)
input = torch.randn(1, 16, 4, 4)
output = conv_transpose(input)

print("Input shape:", input.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([1, 16, 4, 4])
Output shape: torch.Size([1, 3, 8, 8])


#### Is it the reverse of a Convolution?

- Not really, if you think about it, convolution can't be "reverted". You won't be able to recover the original values given the output of a convolutional layer. (Example below: Convolution results in a loss of information (around the edges of the image).)

- Although we cant revert the information lost, we can reverse the size reduction that happens in a convolutional layer. - Which is what transposed convolution does.


![title](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*SEvLW8SoeUqTkNZgZNFCNQ.png)

Example to show that convolution results in a loss of information (around the edges of the image).

![title](https://miro.medium.com/v2/resize:fit:1000/format:webp/1*SpxCUPzNfb9C8TiAcrRr5A.gif)

### Dilated Convolution

It is a technique that expands the kernel (input) by inserting holes between its consecutive elements. In simpler terms, it is the same as convolution but it involves pixel skipping, so as to cover a larger area of the input. 

- Enables the network to have a larger receptive field without increasing the number of parameters.

In [4]:
dilated_conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=1, dilation=2)
# dilation = 2 means there is one pixel of space between the kernel elements

input = torch.randn(1, 1, 8, 8)

# Apply the dilated convolutional layer to the input
output = dilated_conv(input)

print("Input shape:", input.shape)
print("Output shape:", output.shape)

Input shape: torch.Size([1, 1, 8, 8])
Output shape: torch.Size([1, 1, 6, 6])


![title](https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/no_padding_no_strides.gif)    
Convolution   

![title](https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/no_padding_no_strides_transposed.gif)    
Transposed Convolution    

![title](https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/dilation.gif)    
Dilated Convolution

Blue maps are inputs, cyan maps are outputs

# Transfer Learning / Fine-Tuning with PyTorch


What is transfer learning?

**Transfer learning** allows us to take the patterns (also called weights) another model has learned from another problem and use them for our own problem.For example, we can take the patterns a computer vision model has learned from datasets such as [ImageNet](https://www.image-net.org/) (millions of images of different objects) and use them to a Custom Dataset. Or we could take the patterns from a [language model](https://developers.google.com/machine-learning/glossary#masked-language-model) (a model that's been through large amounts of text to learn a representation of language) and use them as the basis of a model to classify different text samples. The premise remains: find a well-performing existing model and apply it to your own problem.<img src="https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/06-transfer-learning-example-overview.png" alt="transfer learning overview on different problems" width=900/>*Example of transfer learning being applied to computer vision and natural language processing (NLP). In the case of computer vision, a computer vision model might learn patterns on millions of images in ImageNet and then use those patterns to infer on another problem. And for NLP, a language model may learn the structure of language by reading all of Wikipedia (and perhaps more) and then apply that knowledge to a different problem.*

## Why use transfer learning?

There are two main benefits to using transfer learning:

1. Can leverage an existing model (usually a neural network architecture) proven to work on problems similar to our own.
2. Can leverage a working model which has **already learned** patterns on similar data to our own. This often results in achieving **great results with less custom data**.




## Where to find pretrained models

| **Location** | **What's there?** | **Link(s)** | 
| ----- | ----- | ----- |
| **PyTorch domain libraries** | Each of the PyTorch domain libraries (`torchvision`, `torchtext`) come with pretrained models of some form. The models there work right within PyTorch. | [`torchvision.models`](https://pytorch.org/vision/stable/models.html), [`torchtext.models`](https://pytorch.org/text/main/models.html), [`torchaudio.models`](https://pytorch.org/audio/stable/models.html), [`torchrec.models`](https://pytorch.org/torchrec/torchrec.models.html) |
| **HuggingFace Hub** | A series of pretrained models on many different domains (vision, text, audio and more) from organizations around the world. There's plenty of different datasets too. | https://huggingface.co/models, https://huggingface.co/datasets | 
| **`timm` (PyTorch Image Models) library** | Almost all of the latest and greatest computer vision models in PyTorch code as well as plenty of other helpful computer vision features. | https://github.com/rwightman/pytorch-image-models|
| **Paperswithcode** | A collection of the latest state-of-the-art machine learning papers with code implementations attached. You can also find benchmarks here of model performance on different tasks. | https://paperswithcode.com/ | 

<img src="https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/06-transfer-learning-where-to-find-pretrained-models.png" alt="different locations to find pretrained neural network models" width=900/>



### Example

![title](https://miro.medium.com/v2/resize:fit:1400/format:webp/0*8Z3To8OAwBBIj66p.jpg)

In [5]:
import torch.nn as nn
import timm

num_classes = 4 # Replace num_classes with the number of classes in your data

# Load pre-trained model from timm
model = timm.create_model('resnet50', pretrained=True)
print(f'Original FC layer: {model.fc}')

# Modify the model head for fine-tuning
num_features = model.fc.in_features
print(f'Feature dimension: {num_features}')

# Additional linear layer and dropout layer
model.fc = nn.Sequential(
    nn.Linear(num_features, 256),  
    nn.ReLU(),         
    nn.Dropout(0.5),               
    nn.Linear(256, num_classes)    
)
print(f'Modified FC layer: {model.fc}')

  from .autonotebook import tqdm as notebook_tqdm


Original FC layer: Linear(in_features=2048, out_features=1000, bias=True)
Feature dimension: 2048
Modified FC layer: Sequential(
  (0): Linear(in_features=2048, out_features=256, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.5, inplace=False)
  (3): Linear(in_features=256, out_features=4, bias=True)
)


### Freezing Full or Partial network

Freezing - fixing the weight of specific layer or entire network during fine tuning process. Network freezing allows us to retain the knowledge captured by the pre-trained model while only updating certain layers to adapt to the target task.

- if the pre-trained model has been trained on a large-scale dataset similar to the target task, freezing the entire network can help preserve the learned representations, preventing them from being overwritten. In this case, only the model’s head is modified and trained from scratch.

- freezing only a portion of the network. This approach is particularly useful when the pre-trained model has been trained on a dataset that is somewhat similar to the target task, but you believe that fine-tuning some of the more abstract or higher-level features could further improve performance. By doing so, you can adjust the deeper layers of the network to better suit the nuances of your specific task while still leveraging the lower-level feature representations learned from the large-scale dataset.

In [6]:
model = timm.create_model('resnet50', pretrained=True)

# Freeze all the layers of the pre-trained model
for param in model.parameters():
    param.requires_grad = False

# Modify the model's head for a new task
num_classes = 10
model.fc = nn.Linear(model.fc.in_features, num_classes)

In [7]:
# Freeze only the convolutional layers of the pre-trained model
for param in model.parameters():
    if isinstance(param, nn.Conv2d):
        param.requires_grad = False

# Modify the model's head for a new task
num_classes = 10
model.fc = nn.Linear(model.fc.in_features, num_classes)

In [8]:
# Freeze specific layers (e.g.,the first two convolutional layers) of the pre-trained model
for name, param in model.named_parameters():
    if 'conv1' in name or 'layer1' in name:
        param.requires_grad = False

# Modify the model's head for a new task
num_classes = 10
model.fc = nn.Linear(model.fc.in_features, num_classes)

Another way of fine-tuning is setting different learning rates for different parts of the network. For example, the learning rate for the pre-trained layers can be set to a smaller value than the learning rate for the new layers. This is because the pre-trained layers may already contain useful representations, and we don’t want to change them too much. On the other hand, the new layers are randomly initialized and need to be trained more aggressively.

In [9]:
import torch
optimizer = torch.optim.Adam([
    {'params': model.conv1.parameters(), 'lr': 1e-4},
    {'params': model.layer1.parameters(), 'lr': 1e-4},
    {'params': model.fc.parameters(), 'lr': 1e-3}
])

In [10]:
from torchsummary import summary
import torch
import torch.nn as nn
# Creating a CNN class
class ConvNeuralNet(nn.Module):
	#  Determine what layers and their order in CNN object 
    def __init__(self, num_classes):
        super(ConvNeuralNet, self).__init__()

        self.conv_layers = nn.ModuleDict({
            'conv1': nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3),
            'conv2': nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3),
            'max_pool1': nn.MaxPool2d(kernel_size = 2, stride = 2),
            'conv3': nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3),
            'conv4': nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3),
            'max_pool2': nn.MaxPool2d(kernel_size = 2, stride = 2)
        })

        self.fc_layers = nn.ModuleDict({
            'fc1': nn.Linear(1600, 128),
            'fc2': nn.Linear(128, num_classes)
        })
    
    # Progresses data across layers    
    def forward(self, x):
        
        print(f"Input shape: {x.shape}")
        print()
        for layer_name, layer in self.conv_layers.items():
            x = layer(x)
            print(f"Shape after {layer_name}: {x.shape}")

        print()
        x = x.reshape(x.size(0), -1)
        print(f"Flattened shape after Conv Layers: {x.shape}")
        print()

        for layer_name, layer in self.fc_layers.items():
            x = layer(x)
            print(f"Shape after {layer_name}: {x.shape}")
        return x

cnn_model = ConvNeuralNet(num_classes=10)  


In [11]:
random_data = torch.randn(16, 3, 32, 32)
output = cnn_model(random_data)

Input shape: torch.Size([16, 3, 32, 32])

Shape after conv1: torch.Size([16, 32, 30, 30])
Shape after conv2: torch.Size([16, 32, 28, 28])
Shape after max_pool1: torch.Size([16, 32, 14, 14])
Shape after conv3: torch.Size([16, 64, 12, 12])
Shape after conv4: torch.Size([16, 64, 10, 10])
Shape after max_pool2: torch.Size([16, 64, 5, 5])

Flattened shape after Conv Layers: torch.Size([16, 1600])

Shape after fc1: torch.Size([16, 128])
Shape after fc2: torch.Size([16, 10])


References at References.md
