This is an implementation of ResNet introduced in the paper **Deep Residual Learning for Image Recognition** found [here](https://arxiv.org/abs/1512.03385). This notebook is just a way for me to have an understanding of what I did in the code found in `resnet.py`.

### Preparing the Project

In [None]:
%pip install torch

In [None]:
import torch
import torch.nn as nn

### Building ResNet

ResNet is basically consists of 5 main layers where 4 of them are residual blocks that have almost identical architecture. We can build these individual block in a reusable component (`class`).

In [None]:
class Block(nn.Module):
    def __init__(self, in_channels, out_channels, downsample=None, stride=1):
        super().__init__()
        self.expansion = 4
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv3 = nn.Conv3d(out_channels, out_channels*self.expansion, kernel_size=1, stride=1, padding=0)
        self.bn3 = nn.BatchNorm2d(out_channels*self.expansion)
        self.relu = nn.ReLU()
        self.downsample = downsample

    def forward(self, x):
        donwsample = x

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.conv3(x)
        x = self.bn3(x)
        x = self.relu(x)

        if self.downsample is not None:
            donwsample = self.downsample(donwsample)

        x += donwsample
        x = self.relu(x)
        return x

Few key things to note.

`in_channels` : the number of channels going into the layer (block in this case as well).

`out_channels` : the number of channels that will be produced by the layer 

Starting from `conv2d`, we input the `out_channel` as `in_channel` because `conv2` operates on the results of `conv1` and since `conv1` returns a feature map with `out_channel` channels, we pass it along as such.

`conv2` has its stride not set to 1 for
- faster convolution given the larger kernel
- reduce computational load for the next layer (reduced spatial dimension)

Regarding the `if` statement that checks if `downsample` is not `None`, this is done to ensure the dimensions of the input will be the same as the dimensions of the output because if it's not, it will require the input to go through a`downsample` function before converging with the output with `+`.

In [None]:
class ResNet(nn.Module):
    def __init__(self, block, layers, image_channels, num_classes):
        super().__init__()
        self.in_channels = 64
        self.conv1 = nn.Conv2d(image_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(block, layers[0], out_channels=64, stride=1)
        self.layer2 = self._make_layer(block, layers[1], out_channels=128, stride=2)
        self.layer3 = self._make_layer(block, layers[2], out_channels=256, stride=2)
        self.layer4 = self._make_layer(block, layers[3], out_channels=512, stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.fc = nn.Linear(512 * 4, num_classes)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)
        return x

    def _make_layer(self, block, num_res_blocks, out_channels, stride):
        downsample = None
        layers = []

        if stride != 1 or self.in_channels != out_channels * 4:
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channels, out_channels*4, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_channels*4)
            )
        
        layers.append(block(self.in_channels, out_channels, downsample, stride))
        self.in_channels = out_channels * 4

        for n in range(num_res_blocks - 1):
            layers.append(block. self.in_channels, out_channels)

        return nn.Sequential(*layers)

The interesting part here is `_make_layer` (everything is else is basically stripped from the diagram).

We assume that downsampling is only needed when...
1. `stride != 1` : there is a spatial dimensionality change
2. `self.in_channels != out_channels * 4` : the input does not match the output spatial dimension

The architecture has it such that each layer has different number of residual blocks, so with `layers`, we can pass in the desired number of residual blocks per layer (index) for our net.

In [None]:
def ResNet50(img_channels=3, num_classes=1000):
    return ResNet(Block, [3, 4, 6, 3], img_channels, num_classes)

def ResNet101(img_channels=3, num_classes=1000):
    return ResNet(Block, [3, 4, 23, 3], img_channels, num_classes)

def ResNet152(img_channels=3, num_classes=1000):
    return ResNet(Block, [3, 8, 36, 3], img_channels, num_classes)