# ResNet

[CVPR 2016 Best Paper [Deep Residual Learning for Image Recognition]](https://arxiv.org/pdf/1512.03385.pdf)

## 主要贡献
从AlexNet到GoogleNet、VGG，神经网络越变越深；效果也逐渐变好。理论而言神经网络越深，模型复杂度越高，模型能力越强。但是实验发现事实不是如此，越深的网络越容易退化（并不是过拟合或者梯度消失、爆炸）。

ResNet：提出了残差连接(residual shortcut)，希望加深网络不会使原来的浅层网络退化，起码要保持和浅层网络持平（加深的几层训练成恒等变换）。

如下图，假设原始输入为$ x $,希望经过block后得到的理想输出为 $f(x)$, 普通block直接去拟合$ f(x) $ ,residual block拟合残差 $f(x) - x$。 加入残差块加权运算的权重和偏置设为了0，那么$ x : f(x)$的映射就是一种恒等映射。

<div align = center> <img src ='./img/residual-block.svg'></img></div>

## 网络结构

ResNet沿用了VGG完整的 $3 \times 3$ 卷积层设计。 残差块里首先有2个有相同输出通道数的卷积层。 每个卷积层后接一个BN层和ReLU激活函数。 然后我们通过跨层数据通路，跳过这2个卷积运算，将输入直接加在最后的ReLU激活函数前。 这样的设计要求2个卷积层的输出与输入形状一样，从而使它们可以相加。 如果想改变通道数，就需要引入一个额外的 $1 \times 1$
卷积层来将输入变换成需要的形状后在做相加。

<div align = center> <img src ='./img/resnet-block.svg'></img></div>


<div align = center> <img src ='./img/DiffResnet.png'></img></div>


**残差带来的一些问题**：模型参数量虽然不大，但非常占用显存。比起普通的卷积层堆叠(VGG,单路堆叠)，计算完一层输出后，就会释放掉上一层的输出，显存中只保存一层的结果。但是如果有残差连接，需要保存两个OutPut，相加完成后才会释放，因此虽然参数量小，但显存占用大。

In [60]:
# Residual Block实现
import torch
from torch import nn
import torch.nn.functional as F

class Residual(nn.Module):
    def __init__(self, in_channels, out_channels,strides = 1):
        super(Residual,self).__init__()
        self.block = nn.Sequential(nn.Conv2d(in_channels,out_channels,kernel_size = 3, padding = 1,stride = strides,bias=False),
                                   nn.ReLU(),
                                   nn.BatchNorm2d(out_channels),
                                   nn.Conv2d(out_channels,out_channels,kernel_size = 3, padding = 1,bias=False),
                                   nn.BatchNorm2d(out_channels),
                                   nn.ReLU())
        # 1*1卷积 (需要调整通道或者分辨率时)
        if(in_channels != out_channels or strides!=1 ):
            self.conv1_1 = nn.Conv2d(in_channels,out_channels,kernel_size= 1, stride = strides)
        else:
            self.conv1_1 = None
        
    def forward(self, x):
        y = self.block(x)
        if self.conv1_1:
            x = self.conv1_1(x)
        y +=x
        return y

In [61]:
blk = Residual(in_channels = 3, out_channels = 3)
blk2 = Residual(3,6,2)
X = torch.rand(4, 3, 6, 6)
Y = blk(X)
Y2 = blk2(X)
Y.shape,Y2.shape

(torch.Size([4, 3, 6, 6]), torch.Size([4, 6, 3, 3]))

# ResNet18
ResNet18也先用了一个 $7 \times 7$ 输出通道 $64$ 的卷积，后接ReLU和 $3 \times 3 $ 步长为 $ 2 $ 的最大池化层。

ResNet使用4个由残差块组成的模块，每个模块使用若干个同样输出通道数的残差块（ResNet18 每模块2个残差块）。 第一个模块的通道数同输入通道数一致。 由于之前已经使用了步幅为2的最大汇聚层，所以无须减小高和宽。 之后的每个模块的第一个残差块里将上一个模块的通道数翻倍，并将高和宽减半。ResNet在每个卷积层后都有BN。

最后接全局平均池化，接全连接层。


<div align = center> <img src ='./img/resnet18.svg'></img></div>

# ResNet50及以上

ResNet50每个模块处理方式与ResNet18不同，由于模型层数深，因此ResNet提出了Bottle neck结构：残差Block先使用$ 1 \times 1$ 卷积进行降维度（通道数），然后使用$3 \times 3$ 卷积减半分辨率，最后再通过 $1 \times 1$ 卷积升维度（通道数）。这样做能够减少参数量。


<div align = center> <img src ='./img/bottle_neck.png'></img></div>


In [79]:
# ResNet18 
# bias 全设为False，因为后接了BN层
from torchsummary import summary

def resnet_block(in_channels,out_channels,num_residuals,isFirst = False):
    blk = []
    if isFirst:
        for i in range(num_residuals):
            blk.append(Residual(in_channels,out_channels))
    else:
        for i in range(num_residuals):
            if i == 0:
                # 通道数乘二倍，分辨率减半
                blk.append(Residual(in_channels,out_channels,strides = 2))
            else:
                blk.append(Residual(out_channels,out_channels))
    return nn.Sequential(*blk)

class ResNet18(nn.Module):
    def __init__(self, in_channels,num_classes):
        super(ResNet18,self).__init__()
        # (n,c,224,224) -> (n,64,112,112)
        #               -> (n,64,56,56)
        self.block1 = nn.Sequential(nn.Conv2d(in_channels,64,kernel_size=7,padding =3,stride =2,bias= False),
                                    nn.BatchNorm2d(64),
                                    nn.ReLU(),
                                    nn.MaxPool2d(kernel_size= 3, padding=1, stride=2))
        # 第一个残差块通道数、分辨率不变 (n,64,56,56)
        self.resblock1 = resnet_block(64,64,2,isFirst=True)
        # (n,128,28,28)
        self.resblock2 = resnet_block(64,128,2)
        # (n,256,14,14)
        self.resblock3 = resnet_block(128,256,2)
        # (n,512,7,7)
        self.resblock4 = resnet_block(256,512,2)
        # 全局平均池化
        # (n,512,1,1)
        self.pool = nn.AdaptiveAvgPool2d((1,1))
        # (全连接)
        # (n,num_classes)
        self.fc = nn.Sequential(nn.Flatten(),nn.Linear(512,num_classes))

    def forward(self, x):
        x = self.block1(x)
        x = self.resblock1(x)
        x = self.resblock2(x)
        x = self.resblock3(x)
        x = self.resblock4(x)
        x = self.pool(x)
        x = self.fc(x)
        return x

# BottleNeck 结构
class BottleNeck(nn.Module):
    def __init__(self, in_channels, out_channels,strides = 1):
        super(BottleNeck,self).__init__()
        mid_channels = int(in_channels/2)
        # 降维 - 减半 - 升维
        self.block = nn.Sequential(nn.Conv2d(in_channels,mid_channels,kernel_size = 1,bias=False),
                                   nn.BatchNorm2d(mid_channels),
                                   nn.ReLU(),
                                   nn.Conv2d(mid_channels,mid_channels,kernel_size = 3, padding = 1,stride = strides,bias=False),
                                   nn.BatchNorm2d(mid_channels),
                                   nn.ReLU(),
                                   nn.Conv2d(mid_channels,out_channels,kernel_size = 1,bias=False),
                                   nn.BatchNorm2d(out_channels),
                                   nn.ReLU())
        # 1*1卷积 (需要调整通道或者分辨率时)
        if(in_channels != out_channels or strides!=1 ):
            self.conv1_1 = nn.Conv2d(in_channels,out_channels,kernel_size= 1, stride = strides)
        else:
            self.conv1_1 = None
        
    def forward(self, x):
        y = self.block(x)
        if self.conv1_1:
            x = self.conv1_1(x)
        y +=x
        return y

In [80]:
net = ResNet18(3,10)

summary(net,(3,224,224),device="cpu")

bottle_neck = BottleNeck(128,256,2)
X = torch.rand((1,128,56,56),dtype = torch.float32)
Y = bottle_neck(X)
Y.shape


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]          36,864
              ReLU-6           [-1, 64, 56, 56]               0
       BatchNorm2d-7           [-1, 64, 56, 56]             128
            Conv2d-8           [-1, 64, 56, 56]          36,864
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
         Residual-11           [-1, 64, 56, 56]               0
           Conv2d-12           [-1, 64, 56, 56]          36,864
             ReLU-13           [-1, 64, 56, 56]               0
      BatchNorm2d-14           [-1, 64,

torch.Size([1, 256, 28, 28])

这里我加载了官方预训练模型的参数，发现ResNet18卷积层都没有Bias参数，这是为什么呢？

因为BN求均值，完全无视卷积的bias效果，不管bias如何平移数据，BN均值都会将数据中心归零，都会是一个结果，所以卷积的bias无用，可以不加。而BN中beta的作用，就是卷积层中的bias的作用。可以理解为BN相当于是在卷积的卷积和bias两个环节中间，加了均值，方差，gamma三个环节把卷积的bias，变成了自己的beta。

In [67]:
# pytorch的预训练模型
resnet18 = torch.load('./resnet18-5c106cde.pth')
# for weight in resnet18:
#     print(weight,resnet18[weight].data.shape)
print(type(resnet18))

params = net.state_dict()
for weight1,weight2 in zip(params,resnet18):
    print(weight1,'         ',weight2)
# print(net.state_dict())

<class 'collections.OrderedDict'>
block1.0.weight           conv1.weight
block1.1.weight           bn1.running_mean
block1.1.bias           bn1.running_var
block1.1.running_mean           bn1.weight
block1.1.running_var           bn1.bias
block1.1.num_batches_tracked           layer1.0.conv1.weight
resblock1.0.block.0.weight           layer1.0.bn1.running_mean
resblock1.0.block.2.weight           layer1.0.bn1.running_var
resblock1.0.block.2.bias           layer1.0.bn1.weight
resblock1.0.block.2.running_mean           layer1.0.bn1.bias
resblock1.0.block.2.running_var           layer1.0.conv2.weight
resblock1.0.block.2.num_batches_tracked           layer1.0.bn2.running_mean
resblock1.0.block.3.weight           layer1.0.bn2.running_var
resblock1.0.block.4.weight           layer1.0.bn2.weight
resblock1.0.block.4.bias           layer1.0.bn2.bias
resblock1.0.block.4.running_mean           layer1.1.conv1.weight
resblock1.0.block.4.running_var           layer1.1.bn1.running_mean
resblock1.0.bl

## Residual 改进
[论文 Identity Mappings in Deep Residual Networks](https://arxiv.org/pdf/1603.05027.pdf)
<div align = center> <img src ='./img/residual_proposed.png'></img></div>

后续何恺明团队对Residual Block提出了改进，如上图所示。