## 网络中的网络(NiN)
前几节介绍的$LeNet$、$AlexNet$和$VGG$在设计上的共同之处是：先以由卷积层构成的模块充分抽取空间特征，再以由全连接层构成的模块来输出分类结果。其中，$AlexNet$和$VGG$对$LeNet$的改进主要在于如何对这两个模块加宽（增加通道数）和加深。本节我们介绍网络中的网络$（NiN）[1]$。它提出了另外一个思路，即**串联多个由卷积层和“全连接”层构成的小网络来构建一个深层网络**。

### NiN块
我们知道，卷积层的输入和输出通常是四维数组（样本，通道，高，宽），而全连接层的输入和输出则通常是二维数组（样本，特征）。如果想在全连接层后再接上卷积层，则需要将全连接层的输出变换为四维。回忆在$5.3$节（多输入通道和多输出通道）里介绍的$1\times 1$卷积层。它可以**看成全连接层**，其中空间维度（高和宽）上的每个元素相当于样本，**通道相当于特征**。因此，$NiN$使用$1\times 1$**卷积层来替代全连接层**，从而使空间信息能够自然传递到后面的层中去。下面展示了$NiN$同$AlexNet$和$VGG$等网络在结构上的主要区别。

<center>
    卷积层->卷积层->全连接层->全连接层
</center>

<center>
    卷积层->$1\times 1$卷积层->卷积层->$1\times 1$卷积层
</center>

$NiN$块是$NiN$中的基础块。它由一个卷积层加两个充当全连接层的$1\times 1$卷积层串联而成。其中**第一个卷积层的超参数可以自行设置**，而第二和第三个卷积层的超参数一般是固定的。

In [1]:
import time
import torch
from torch import nn, optim
import torch.nn.functional as F

import sys
sys.path.append("..")
import d2lzh_pytorch as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def nin_block(in_channels, out_channels, kernel_size, stride, padding):
    blk = nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, 1),
        nn.ReLU(),
        nn.Conv2d(out_channels, out_channels, 1),
        nn.ReLU()
    )
    return blk

### NiN模型
$NiN$是在$AlexNet$问世不久后提出的。它们的卷积层设定有类似之处。$NiN$使用卷积窗口形状分别为$11\times 11$、$5\times 5$和$3\times 3$的卷积层，相应的输出通道数也与$AlexNet$中的一致。每个$NiN$块后接一个步幅为$2$、窗口形状为$3\times 3$的最大池化层。

除使用$NiN$块以外，$NiN$还有一个设计与$AlexNet$显著不同：$NiN$去掉了$AlexNet$最后的$3$个全连接层，取而代之地，$NiN$使用了输出通道数等于标签类别数的$NiN$块，然后使用**全局平均池化层**对每个通道中所有元素求平均并直接用于分类。这里的全局平均池化层即窗口形状等于输入空间维形状的平均池化层。$NiN$的这个设计的好处是可以**显著减小模型参数尺寸**，从而缓解过拟合。然而，该设计**有时会造成获得有效模型的训练时间的增加**。

In [2]:
class GlobalAvgPool2d(nn.Module):
      def __init__(self):
          super(GlobalAvgPool2d, self).__init__()
      def forward(self, x):
          return F.avg_pool2d(x, kernel_size=x.size()[2:])

net = nn.Sequential(
    nin_block(1, 96, kernel_size=11, stride=4, padding=0),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nin_block(96, 256, kernel_size=5, stride=1, padding=2),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nin_block(256, 384, kernel_size=3, stride=1, padding=1),
    nn.MaxPool2d(kernel_size=3, stride=2),
    nn.Dropout(0.5),
    
    nin_block(384, 10, kernel_size=3, stride=1, padding=1),
    GlobalAvgPool2d(),
    
    d2l.FlattenLayer()
)

In [3]:
X = torch.rand(1, 1, 224, 224)
for name, blk in net.named_children():
    X = blk(X)
    print(name, 'output shape:', X.shape)

0 output shape: torch.Size([1, 96, 54, 54])
1 output shape: torch.Size([1, 96, 26, 26])
2 output shape: torch.Size([1, 256, 26, 26])
3 output shape: torch.Size([1, 256, 12, 12])
4 output shape: torch.Size([1, 384, 12, 12])
5 output shape: torch.Size([1, 384, 5, 5])
6 output shape: torch.Size([1, 384, 5, 5])
7 output shape: torch.Size([1, 10, 5, 5])
8 output shape: torch.Size([1, 10, 1, 1])
9 output shape: torch.Size([1, 10])


### 获取数据和训练模型
我们依然使用$Fashion-MNIST$数据集来训练模型。$NiN$的训练与$AlexNet$和$VGG$的类似，但这里使用的学习率更大。

In [4]:
batch_size = 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize = 224)

lr, epochs = 0.002, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, epochs)

training on: cpu
step 1, train_acc: 0.0703
step 2, train_acc: 0.0977
step 3, train_acc: 0.0938
step 4, train_acc: 0.0977
step 5, train_acc: 0.1016
step 6, train_acc: 0.1029
step 7, train_acc: 0.1038
step 8, train_acc: 0.0977
step 9, train_acc: 0.0990
step 10, train_acc: 0.0945
step 11, train_acc: 0.0945
step 12, train_acc: 0.0964
step 13, train_acc: 0.0956
step 14, train_acc: 0.0926
step 15, train_acc: 0.0943
step 16, train_acc: 0.0957
step 17, train_acc: 0.0947
step 18, train_acc: 0.0951
step 19, train_acc: 0.0950
step 20, train_acc: 0.0934
step 21, train_acc: 0.0938
step 22, train_acc: 0.0966
step 23, train_acc: 0.0985
step 24, train_acc: 0.0986
step 25, train_acc: 0.0959
step 26, train_acc: 0.0947
step 27, train_acc: 0.0929
step 28, train_acc: 0.0943
step 29, train_acc: 0.0940
step 30, train_acc: 0.0951
step 31, train_acc: 0.0948
step 32, train_acc: 0.0945
step 33, train_acc: 0.0938
step 34, train_acc: 0.0933
step 35, train_acc: 0.0924
step 36, train_acc: 0.0911
step 37, train_acc: 

step 297, train_acc: 0.3024
step 298, train_acc: 0.3037
step 299, train_acc: 0.3047
step 300, train_acc: 0.3061
step 301, train_acc: 0.3072
step 302, train_acc: 0.3082
step 303, train_acc: 0.3095
step 304, train_acc: 0.3105
step 305, train_acc: 0.3117
step 306, train_acc: 0.3131
step 307, train_acc: 0.3141
step 308, train_acc: 0.3154
step 309, train_acc: 0.3165
step 310, train_acc: 0.3177
step 311, train_acc: 0.3188
step 312, train_acc: 0.3200
step 313, train_acc: 0.3213
step 314, train_acc: 0.3224
step 315, train_acc: 0.3235
step 316, train_acc: 0.3247
step 317, train_acc: 0.3259
step 318, train_acc: 0.3269
step 319, train_acc: 0.3279
step 320, train_acc: 0.3290
step 321, train_acc: 0.3299
step 322, train_acc: 0.3309
step 323, train_acc: 0.3319
step 324, train_acc: 0.3330
step 325, train_acc: 0.3341
step 326, train_acc: 0.3352
step 327, train_acc: 0.3363
step 328, train_acc: 0.3374
step 329, train_acc: 0.3385
step 330, train_acc: 0.3395
step 331, train_acc: 0.3406
step 332, train_acc:

step 122, train_acc: 0.7525
step 123, train_acc: 0.7521
step 124, train_acc: 0.7521
step 125, train_acc: 0.7523
step 126, train_acc: 0.7528
step 127, train_acc: 0.7528
step 128, train_acc: 0.7533
step 129, train_acc: 0.7535
step 130, train_acc: 0.7532
step 131, train_acc: 0.7533
step 132, train_acc: 0.7531
step 133, train_acc: 0.7529
step 134, train_acc: 0.7533
step 135, train_acc: 0.7538
step 136, train_acc: 0.7537
step 137, train_acc: 0.7540
step 138, train_acc: 0.7537
step 139, train_acc: 0.7537
step 140, train_acc: 0.7541
step 141, train_acc: 0.7544
step 142, train_acc: 0.7548
step 143, train_acc: 0.7548
step 144, train_acc: 0.7544
step 145, train_acc: 0.7546
step 146, train_acc: 0.7545
step 147, train_acc: 0.7547
step 148, train_acc: 0.7547
step 149, train_acc: 0.7547
step 150, train_acc: 0.7546
step 151, train_acc: 0.7548
step 152, train_acc: 0.7548
step 153, train_acc: 0.7551
step 154, train_acc: 0.7553
step 155, train_acc: 0.7551
step 156, train_acc: 0.7556
step 157, train_acc:

step 415, train_acc: 0.7737
step 416, train_acc: 0.7736
step 417, train_acc: 0.7737
step 418, train_acc: 0.7738
step 419, train_acc: 0.7739
step 420, train_acc: 0.7739
step 421, train_acc: 0.7741
step 422, train_acc: 0.7741
step 423, train_acc: 0.7741
step 424, train_acc: 0.7742
step 425, train_acc: 0.7744
step 426, train_acc: 0.7745
step 427, train_acc: 0.7746
step 428, train_acc: 0.7748
step 429, train_acc: 0.7749
step 430, train_acc: 0.7749
step 431, train_acc: 0.7749
step 432, train_acc: 0.7750
step 433, train_acc: 0.7750
step 434, train_acc: 0.7750
step 435, train_acc: 0.7750
step 436, train_acc: 0.7750
step 437, train_acc: 0.7754
step 438, train_acc: 0.7755
step 439, train_acc: 0.7756
step 440, train_acc: 0.7757
step 441, train_acc: 0.7755
step 442, train_acc: 0.7756
step 443, train_acc: 0.7757
step 444, train_acc: 0.7757
step 445, train_acc: 0.7758
step 446, train_acc: 0.7759
step 447, train_acc: 0.7761
step 448, train_acc: 0.7763
step 449, train_acc: 0.7764
step 450, train_acc:

step 240, train_acc: 0.8071
step 241, train_acc: 0.8070
step 242, train_acc: 0.8070
step 243, train_acc: 0.8070
step 244, train_acc: 0.8069
step 245, train_acc: 0.8070
step 246, train_acc: 0.8070
step 247, train_acc: 0.8069
step 248, train_acc: 0.8068
step 249, train_acc: 0.8069
step 250, train_acc: 0.8071
step 251, train_acc: 0.8066
step 252, train_acc: 0.8066
step 253, train_acc: 0.8064
step 254, train_acc: 0.8067
step 255, train_acc: 0.8068
step 256, train_acc: 0.8069
step 257, train_acc: 0.8069
step 258, train_acc: 0.8068
step 259, train_acc: 0.8070
step 260, train_acc: 0.8073
step 261, train_acc: 0.8073
step 262, train_acc: 0.8073
step 263, train_acc: 0.8074
step 264, train_acc: 0.8073
step 265, train_acc: 0.8072
step 266, train_acc: 0.8074
step 267, train_acc: 0.8074
step 268, train_acc: 0.8075
step 269, train_acc: 0.8075
step 270, train_acc: 0.8076
step 271, train_acc: 0.8075
step 272, train_acc: 0.8076
step 273, train_acc: 0.8075
step 274, train_acc: 0.8074
step 275, train_acc:

step 64, train_acc: 0.8236
step 65, train_acc: 0.8244
step 66, train_acc: 0.8252
step 67, train_acc: 0.8253
step 68, train_acc: 0.8248
step 69, train_acc: 0.8256
step 70, train_acc: 0.8265
step 71, train_acc: 0.8263
step 72, train_acc: 0.8261
step 73, train_acc: 0.8265
step 74, train_acc: 0.8261
step 75, train_acc: 0.8256
step 76, train_acc: 0.8256
step 77, train_acc: 0.8255
step 78, train_acc: 0.8251
step 79, train_acc: 0.8249
step 80, train_acc: 0.8249
step 81, train_acc: 0.8247
step 82, train_acc: 0.8245
step 83, train_acc: 0.8248
step 84, train_acc: 0.8251
step 85, train_acc: 0.8257
step 86, train_acc: 0.8257
step 87, train_acc: 0.8256
step 88, train_acc: 0.8259
step 89, train_acc: 0.8258
step 90, train_acc: 0.8255
step 91, train_acc: 0.8256
step 92, train_acc: 0.8257
step 93, train_acc: 0.8253
step 94, train_acc: 0.8255
step 95, train_acc: 0.8253
step 96, train_acc: 0.8254
step 97, train_acc: 0.8252
step 98, train_acc: 0.8256
step 99, train_acc: 0.8250
step 100, train_acc: 0.8245


step 358, train_acc: 0.8282
step 359, train_acc: 0.8280
step 360, train_acc: 0.8281
step 361, train_acc: 0.8281
step 362, train_acc: 0.8282
step 363, train_acc: 0.8284
step 364, train_acc: 0.8285
step 365, train_acc: 0.8286
step 366, train_acc: 0.8287
step 367, train_acc: 0.8285
step 368, train_acc: 0.8285
step 369, train_acc: 0.8285
step 370, train_acc: 0.8286
step 371, train_acc: 0.8286
step 372, train_acc: 0.8288
step 373, train_acc: 0.8289
step 374, train_acc: 0.8289
step 375, train_acc: 0.8290
step 376, train_acc: 0.8290
step 377, train_acc: 0.8291
step 378, train_acc: 0.8290
step 379, train_acc: 0.8290
step 380, train_acc: 0.8289
step 381, train_acc: 0.8290
step 382, train_acc: 0.8288
step 383, train_acc: 0.8288
step 384, train_acc: 0.8289
step 385, train_acc: 0.8288
step 386, train_acc: 0.8288
step 387, train_acc: 0.8289
step 388, train_acc: 0.8288
step 389, train_acc: 0.8288
step 390, train_acc: 0.8288
step 391, train_acc: 0.8290
step 392, train_acc: 0.8290
step 393, train_acc:

step 183, train_acc: 0.8429
step 184, train_acc: 0.8433
step 185, train_acc: 0.8435
step 186, train_acc: 0.8438
step 187, train_acc: 0.8441
step 188, train_acc: 0.8442
step 189, train_acc: 0.8441
step 190, train_acc: 0.8442
step 191, train_acc: 0.8445
step 192, train_acc: 0.8441
step 193, train_acc: 0.8440
step 194, train_acc: 0.8437
step 195, train_acc: 0.8435
step 196, train_acc: 0.8436
step 197, train_acc: 0.8433
step 198, train_acc: 0.8429
step 199, train_acc: 0.8429
step 200, train_acc: 0.8430
step 201, train_acc: 0.8428
step 202, train_acc: 0.8427
step 203, train_acc: 0.8424
step 204, train_acc: 0.8424
step 205, train_acc: 0.8423
step 206, train_acc: 0.8423
step 207, train_acc: 0.8422
step 208, train_acc: 0.8423
step 209, train_acc: 0.8425
step 210, train_acc: 0.8425
step 211, train_acc: 0.8425
step 212, train_acc: 0.8425
step 213, train_acc: 0.8425
step 214, train_acc: 0.8426
step 215, train_acc: 0.8423
step 216, train_acc: 0.8423
step 217, train_acc: 0.8422
step 218, train_acc:

+ NiN重复使用由卷积层和代替全连接层的1×1卷积层构成的NiN块来构建深层网络。
+ NiN去除了容易造成过拟合的全连接输出层，而是将其替换成输出通道数等于标签类别数的NiN块和全局平均池化层。
+ NiN的以上设计思想影响了后面一系列卷积神经网络的设计。