## 残差网络

按理来说，对神经网络模型添加新的层，充分训练后的模型是否可能更加有效地降低训练误差？
理论上，原模型的解空间只是新模型解的空间的子空间。也就是说，如果我们能将新添加的层训练成恒等映射 $f(x) = x$ ,那么原模型和新模型是同样有效地。由于新模型可能得出更优的解来拟合训练数据集，添加层似乎更容易降低训练误差。然而在实践中，添加过多的层后训练误差往往不降反升。即使利⽤批量归⼀化带来的数值稳定性使得训练深层模型更加容易，这个问题仍然存在。

## 残差块
在残差快中，输入可通过跨层的数据线路更快地向前传播
+ ResNet沿用了VGG全 3\*3 卷积层的设计。残差块中首先有两个同样输出通道数的 3\*3卷积层。每个卷积层后接一个批量归一化层和ReLU激活函数。然后我们将输入跳过这两个卷积运算后直接加在最后的ReLU激活函数前。这样的设计要求两个卷积层的输入输出形状一样，从而可以相加。如果需要改变通道数，可以通过引入1\*1卷积层来将输入变换成需要的形状后再做相加运算

## 残差块，可以设定输出通道数、是否使用额外的1\*1卷积层来修改通道数，以及卷积的步幅

In [1]:
import sys
sys.path.append('../')

In [2]:
import gluonbook as gb
import mxnet as mx
from mxnet import nd,autograd,init
from mxnet.gluon import nn,data as gdata,loss as gloss
from mxnet import gluon

  from ._conv import register_converters as _register_converters


In [5]:
class Residual(nn.Block):
    def __init__(self,num_channels,use_1x1conv=False,strides=1,**kwargs):
        super(Residual,self).__init__(**kwargs)
        self.conv1 = nn.Conv2D(num_channels,kernel_size=3,strides=strides,padding=1)
        self.bn1 = nn.BatchNorm()
        #第二个卷积层步幅固定为1，如果两个卷积都减小尺寸，后面将出现问题
        self.conv2 = nn.Conv2D(num_channels,kernel_size=3,strides=1,padding=1)
        self.bn2 = nn.BatchNorm()
        
        if use_1x1conv:
            self.conv3 = nn.Conv2D(num_channels,kernel_size=1,strides=strides)
        else:
            self.conv3=None
    #实现变形操作，批量归一化 -> 激活 -> 卷积
#     def forward(self,X):
#         Y = nd.relu(self.bn1(self.conv1(X)))
#         Y = self.bn2(self.conv2(Y))
#         if self.conv3:
#             X = self.conv3(X)
#         return nd.relu(Y+X)
    #实现变形操作，批量归一化 -> 激活 -> 卷积
    def forward(self,X): 
        Y = self.conv1(nd.relu(self.bn1(X)))
        Y = self.conv2(nd.relu(self.bn2(Y)))
        if self.conv3:
            X = self.conv3(nd.relu(X))
        return Y+X

In [6]:
blk = Residual(3,use_1x1conv=True)
blk.initialize(ctx=mx.gpu())
X = nd.random.uniform(shape=(1,1,96,96),ctx=mx.gpu())
blk(X).shape

(1, 3, 96, 96)

## ResNet模型

* 和GoogLeNet一样，前两层为7\*7输出通道为64、步幅为2的卷积层，后面接步幅为2的3\*3的最大池化层，不过每个卷积层后都增加了批量归一化层

In [7]:
ResNet = nn.Sequential()

In [8]:
ResNet.add(nn.Conv2D(channels=32,kernel_size=7,strides=2,padding=3),
           nn.BatchNorm(),nn.Activation('relu'),
           nn.MaxPool2D(pool_size=3,strides=2,padding=1)
            )

* GoogLeNet在后面接了由4个Inception块组成的模块。ResNet则使用了四个有残差块组成的模块，每个模块使用了若干个同样输出通道数的残差块。

In [9]:
def resnet_block(num_channels,num_residuals,first_block=False):
    blk  = nn.Sequential()
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.add(Residual(num_channels, use_1x1conv=True, strides=2))
        else:
            blk.add(Residual(num_channels))
    return blk

In [10]:
ResNet.add(resnet_block(32,2,first_block=True),
           resnet_block(64,2),
           resnet_block(128,2),
            )

* 最后像GoogLeNet一样加入全局池化层

In [11]:
ResNet.add(nn.GlobalAvgPool2D(),nn.Dense(10))

In [12]:
X = nd.random.uniform(shape=(1,1,224,224),ctx=mx.gpu())
ResNet.initialize(ctx=mx.gpu(),force_reinit=True)
cnt=0
for layer in ResNet:
    try:
        cnt +=1
        X = layer(X)
        print(cnt,layer.name,'output shape\t',X.shape)
    except mx.base.MXNetError as e:
        print('Error!\t',layer.name)
        print(X.shape)
        print(e)
        break

1 conv6 output shape	 (1, 32, 112, 112)
2 batchnorm4 output shape	 (1, 32, 112, 112)
3 relu0 output shape	 (1, 32, 112, 112)
4 pool0 output shape	 (1, 32, 56, 56)
5 sequential1 output shape	 (1, 32, 56, 56)
6 sequential2 output shape	 (1, 64, 28, 28)
7 sequential3 output shape	 (1, 128, 14, 14)
8 pool1 output shape	 (1, 128, 1, 1)
9 dense0 output shape	 (1, 10)


## 获取数据并且训练模型

In [13]:
lr,num_epochs,batch_size,ctx = 0.1,10,256,gb.try_gpu()
trainer = gluon.Trainer(ResNet.collect_params(),'sgd',{'learning_rate':lr})
ResNet.initialize(init=init.Xavier(),ctx=ctx,force_reinit=True)

In [14]:
train_iter,test_iter = gb.load_data_fashion_mnist(batch_size,resize=96)

In [15]:
gb.train_ch5(ResNet,train_iter,test_iter,batch_size,trainer,ctx,num_epochs)

training on gpu(0)
epoch 1, loss 1.1860, train acc 0.692, test acc 0.841, time 41.7 sec
epoch 2, loss 0.3766, train acc 0.860, test acc 0.881, time 41.5 sec
epoch 3, loss 0.2996, train acc 0.889, test acc 0.898, time 42.1 sec
epoch 4, loss 0.2597, train acc 0.905, test acc 0.904, time 41.4 sec
epoch 5, loss 0.2302, train acc 0.916, test acc 0.905, time 41.1 sec
epoch 6, loss 0.2049, train acc 0.925, test acc 0.911, time 40.9 sec
epoch 7, loss 0.1858, train acc 0.932, test acc 0.912, time 40.8 sec
epoch 8, loss 0.1667, train acc 0.939, test acc 0.913, time 41.1 sec
epoch 9, loss 0.1518, train acc 0.944, test acc 0.917, time 42.3 sec
epoch 10, loss 0.1376, train acc 0.949, test acc 0.908, time 41.3 sec


## 可以发现ResNet收敛速度特别快，基本在第一个epoch完成后就收敛了

## 使用 卷积层+批量归一化+激活 训练结果如下

* training on gpu(0)
* epoch 1, loss 0.5569, train acc 0.796, test acc 0.862, time 41.9 sec
* epoch 2, loss 0.3092, train acc 0.886, test acc 0.900, time 41.0 sec
* epoch 3, loss 0.2519, train acc 0.908, test acc 0.886, time 41.2 sec
* epoch 4, loss 0.2204, train acc 0.919, test acc 0.884, time 42.1 sec
* epoch 5, loss 0.1946, train acc 0.929, test acc 0.901, time 42.0 sec
* epoch 6, loss 0.1743, train acc 0.937, test acc 0.896, time 41.1 sec
* epoch 7, loss 0.1541, train acc 0.944, test acc 0.912, time 40.3 sec
* epoch 8, loss 0.1375, train acc 0.950, test acc 0.917, time 40.9 sec
* epoch 9, loss 0.1216, train acc 0.955, test acc 0.904, time 40.6 sec
* epoch 10, loss 0.1071, train acc 0.961, test acc 0.893, time 41.5 sec

## 使用批量归一化+激活+卷积 训练结果如下所示

论文中主要是 主要是改进模型，更好训练。