# Introduction to Residual Neural Network(ResNet)

## Residual Learning

For a CNN model, given an input $x$, it outputs a prediction $y$:

![resnet_1](./image/resnet_1.png "resnet_1")

The complex model is like a **composite function**, i.e. $y = f_n(f_{n-1}(f_{n-2}(...f_1(x)...))) = F(x)$. And we want this composite function to  fit a desired underlying mapping $H(x)$. That is, 

$$F(x) := H(x)$$

However, due to the degragation problem, the deeper the network is, the harder to fit the desired underlying mapping. To address this problem, we try to fit another mapping of $F(x) := H(x) − x$. And the original mapping is recast into $F(x) + x$:

![resnet_2](./image/resnet_2.png "resnet_2")

As shown in the above picture, $F(x)$ is the output of the stacked layers, and $F(x) + x$ is realized by `identity shortcut connection`, i.e. the identity $x$ skips the stacked layers and is added to $F(x)$ directly. 

The term $H(x) - x$（i.e. $F(x)$） is called `residual`, and it is proved that the residual learning framework can ease the training of deeper networks.

## ResNet Architectures for ImageNet

The following table shows the ResNet architectures for ImageNet:

![resnet_3](./image/resnet_3.png "resnet_3")

As shown, a ResNet contains serveral **stages**, most stages have a **building block**, and each block contains serveral convolutional layers. 

## Bottleneck

There is a difference between `ResNet50-`(i.e. 18-layer and 34-layer ResNet) and `ResNet50+`(50-layer or more). That is, for ResNet50+, each block is a "bottleneck" building block:

![resnet_4](./image/resnet_4.png "resnet_4")

As shown, suppose we have a 256-d input, i.e. the number of channels is 256, after the first layer, the number of channels changes from 256 to 64. We keep going forward, and when we have gone through the last layer, the number of channels changes from 64 back to 256. The bottleneck design reduces the dimentions and then restores the dimentions to make sure the time complexity is not increased.

## PyTorch Implementation of ResNet

Now we implement ResNet using PyTorch, and here we choose two simple models, i.e. ResNet18 and ResNet50. For ResNet18 and ResNet50, the building block is as follows:

![resnet_5](./image/resnet_5.png "resnet_5")

And we can add Batch Normalization in basic blocks and bottleneck blocks.

In [1]:
import torch
import torch.nn as nn
from torch.hub import load_state_dict_from_url

model_urls = {
    'resnet18': 'https://download.pytorch.org/models/resnet18-5c106cde.pth',
    'resnet50': 'https://download.pytorch.org/models/resnet50-19c8e357.pth'
}

# 3x3 convolution, i.e. kernel_size=3
def  conv3x3(in_planes, out_planes, stride=1, padding=1):
    """
    in_planes: the number of input channels
    out_planes: the number of output channels
    """
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, padding=padding, bias=False)

#  1x1 convolution, i.e. kernel_size=1
def conv1x1(in_planes, out_planes, stride=1):
    # why bias=False? when the convolutional layer is followed by batch normalization layer, there is no
    # need to have a bias, because batch normalization already includes the bias term.
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False) 


# building block for ResNet50-
class BasicBlock(nn.Module):
    
    expansion=1
    
    def __init__(self, inplanes, planes, stride=1, downsample=None, norm_layer=None):
        """
        inplanes: the number of input channels
        planes: the number of output channels
        download: a downsample function
        norm_layer: Batch Normalization layer that we define by hand
        """
        super(BasicBlock, self).__init__()
        if norm_layer is None: 
            # Use the built-in Batch Normalization layer if norm_layer is not defined by hand.
            norm_layer = nn.BatchNorm2d 
        
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
       
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride
        
    
    def forward(self, x):
        identity = x
        
        # first conv layer
        out = self.conv1(x)
        out = self.bn1(out)
        
        out = self.relu(out)
        
        # second conv layer
        out = self.conv2(out)
        out = self.bn2(out)
        
        if self.downsample is not None:
            identity = self.downsample(x) # call the downsample function and do the do the downsample on x
        
        out += identity # x + F(x)
        out = self.relu(out)
        
        return out


# building block for ResNet50+
class BottleNeck(nn.Module):
      
    # take ResNets for ImageNet as an example, for each bottleneck block, 64->256, 128->512,
    # 256->1024, 512->2048, the expansion size is 4.
    expansion = 4 
   
    def __init__(self, inplanes, planes, stride=1, downsample=None, norm_layer=None):
        super(BottleNeck, self).__init__()
        if norm_layer is None: 
            norm_layer = nn.BatchNorm2d 
        
        self.conv1 = conv1x1(inplanes, planes)
        self.bn1 = norm_layer(planes)
        self.conv2 = conv3x3(planes, planes, stride)
        self.bn2 = norm_layer(planes)
        self.conv3 = conv1x1(planes, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride
        
        
    def forward(self, x):
        identity = x
        
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)
        
        out = self.conv3(out)
        out = self.bn3(out)
        
        if self.downsample is not None:
            identity = self.downsample(x)
        
        out += identity
        out = self.relu(out)
        
        return out
    
# build ResNet for ImageNet
class ResNet(nn.Module):
    
    # for ImageNet, num_class=1000
    def __init__(self, block, layers, num_class=1000, norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None: 
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer
        
        self.inplanes = 64
        # for ImageNet, the number of input channels is 3, after the first convolutional layer,
        # there are 64 output channels.
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        
        # each layer is a stage, which contains a block.
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        
        self.fc = nn.Linear(512*block.expansion, num_class)
        
        # initialize parameters
        # traverse all the layers, if it is a conv layer, use kaiming initialization;
        # if it is a batch normalization layer, use 0 and 1 to initialize parameters.
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
    
    
    # building stage from ResNet
    def _make_layer(self, block, planes, blocks, stride=1):
        norm_layer = self._norm_layer
        downsample = None
        
        if stride != 1 or self.inplanes != planes*block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes*block.expansion, stride),
                norm_layer(planes*block.expansion)
            )
        
        # layers=stage
        layers=[]
        layers.append(block(self.inplanes, planes, stride, downsample, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, norm_layer=norm_layer))
        return nn.Sequential(*layers) # *layers to unpack the layers
                
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        
        x = self.avgpool(x)
        x = self.flatten(x,1)
        x = self.fc(x)
        
        return x
    


def _resnet(arch, block, layers, pretrained, progress, **kwargs):
    model = ResNet(block, layers, **kwargs)
    if pretrained: 
        state_dict = load_state_dict_from_url(model_urls[arch], progress=progress)
        model.load_state_dict(state_dict)
    return model


def resnet18(pretrained=False, progress=True, **kwargs):
    """
    progress: if True, when download the ResNet50 model, shows the download progress bar
    """
    # for ResNet18, stage conv2_x contains 2 conv layers, conv3_x contains 2 conv layers...
    return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress, **kwargs) 
        


# ResNet50
def resnet50(pretrained=False, progress=True, **kwargs):   
    # for ResNet50, stage conv2_x contains 3 conv layers, conv3_x contains 4 conv layers...
    return _resnet('resnet50', BottleNeck, [3, 4, 6, 3], pretrained, progress, **kwargs) 

In [2]:
resnet18 = resnet18(pretrained=True)
print(resnet18)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kerne

# Reference

`He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1512.03385.`

[PyTorch Offical ResNet Implement]`https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py`