# pytorch实现神经网络模型

__各种网络模型详解建议看B站up主：霹雳吧啦Wz，用户ID：18161609__   

__B站部分内容__：[ResNet网络结构，BN以及迁移学习详解](https://www.bilibili.com/video/BV1T7411T7wa/?share_source=copy_web&vd_source=af4d80c0a2a9115a896a0378b7093d65)  

**该up主的github也有许多资源**：[霹雳吧啦Wz的github地址](https://github.com/WZMIAOMIAO/deep-learning-for-image-processing)  

**该up的CSDN**：<https://blog.csdn.net/qq_37541097>

### （1）**LeNet**

#### **1.1 简介**

LeNet首次采用了卷积层、池化层这两个全新的神经网络组件，接收灰度图像，并输出其中包含的手写数字，在手写字符识别任务上取得了瞩目的准确率。  
LeNet网络的一系列的版本，以LeNet-5版本最为著名，也是LeNet系列中效果最佳的版本。
  
![LeNet结构](./images/LeNet结构.png)


#### **1.2 模型特点和模型实现**

- 提出卷积神经网络的基本框架：卷积层、池化层、全连接层。  

- 权重共享，参数更少，减少计算量，降低内存。  

- 卷积层的局部连接，保证图像的空间相关性。  

- 空间均值下采样，减少特征数量。  

- 使用非线性激活函数sigmoid和tanh,使用哪种都行。

In [None]:
# lenet代码实现

import torch
import torch.nn as nn
import torch.nn.functional as F


# 注意输入网络的图像维度是N * C * W * H，即batch_size * channel * width * height
# 比如3*1*28*28就是输入三张大小为28*28的1通道图像。
class LeNet5(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            # 输入是N*1*28*28，经过padding之后是N*1*32*32，输出是N*6*28*28
            nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2),
            # 激活和池化的操作可以在前向传播的时候进行,
            # 但是需要使用torch.nn.functional()中的函数,比如relu()、avg_pool2d()，
            # functional()中的函数需要传入参数，而nn中的不用。
            nn.ReLU(),
            # 池化之后N*6*28*28变为N*6*14*14
            nn.AvgPool2d(kernel_size=2, stride=2),
            # 输入N*6*14*14，输出N*16*10*10，注意padding默认值是0不是1。
            nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5),
            nn.ReLU(),
            # 池化后N*16*10*10变为N*16*5*5
            nn.AvgPool2d(kernel_size=2, stride=2),
            # 输入N*16*5*5，默认padding为0，输出是N*120*1*1。
            nn.Conv2d(
                in_channels=16,
                out_channels=120,
                kernel_size=5,
            ),
            # nn.Flatten()将某些连续的维度展平，以便输入全连接层，其默认开始展平的维度是第一维。
            # 也可以在前向传播函数中使用torch.flatten()或者使用view()实现。
            # 而torch.flatten()默认开始展平的维度是第0维。
            # 输入是N*120*1*1,将其从第一个维度（就是120所在维度）到最后一个维度（第二个1所在维度）进行展平，
            # 也就变为N*120，这里120也就是输入全连接层的大小。
            nn.Flatten(start_dim=1, end_dim=-1),
        )

        self.classifier = nn.Sequential(
            # 偏置bias的默认值是True
            nn.Linear(in_features=120 * 1 * 1, out_features=84),
            nn.Linear(in_features=84, out_features=10),
        )

    def forward(self, x):
        x = self.conv(x)
        x = self.classifier(x)
        return x

    def validate(self, x):
        """验证卷积之后输出图像的尺寸"""
        x = self.conv(x)
        return x.shape

In [2]:
lenet5 = LeNet5()

# 28*28是mnist数据集中图像的尺寸，输入模型后经过padding=2填充为32*32。
x = torch.rand([1, 1, 28, 28])  # 相当于输入一张单通道的28*28图像。

val = lenet5.validate(x)
print("卷积后输出尺寸是：", val)

y = lenet5(x)
print(y)

tensor([[ 0.0770,  0.0575,  0.1159, -0.0692, -0.0008,  0.0769, -0.0906, -0.0236,
         -0.1010,  0.0644]], grad_fn=<AddmmBackward0>)


#### __1.3 解惑__ 

 - LeNet5的输入明明是1X1X32X32，为什么上面代码要输入`x = torch.rand([1, 1, 28, 28])`？  
**参考**：[LeNet：第一个卷积神经网络](https://www.ruanx.net/lenet/) 。   
因为28X28是mnist数据集中图像的大小，所以完全是为了使用该数据集。输入这样尺寸的数据后，LeNet5模型中的代码：`nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2)`就是负责将图片填充到32X32的大小（padding=2，上下左右各加两行之后就变成32X32），填充之后再进行卷积，经过卷积层输出后图像的大小计算公式是：$$(N+2P-K)/S+1$$  
**其中N是图像的长或宽，P是填充的行数padding，K是卷积核尺寸kernel_size，S是stride，即卷积核每次走多少像素。**    
所以输入的28X28经填充后变为32X32，再经过第一次卷积之后变为6通道的28X28图像。

### （2）**AlexNet**    

#### __2.1 结构和特点__

1. **结构**  
**参考**：[卷积神经网络经典回顾之AlexNet-知乎](https://zhuanlan.zhihu.com/p/618545757)  
AlexNet输入为RGB三通道的224 × 224 × 3大小的图像（也可填充为227 × 227 × 3 ）。  
AlexNet 共包含5 个卷积层（包含3个池化）和 3 个全连接层。其中，每个卷积层都包含卷积核、偏置项、ReLU激活函数和局部响应归一化（LRN）模块。  
第1、2、5个卷积层后面都跟着一个最大池化层，后三个层为全连接层。最终输出层为softmax，将网络输出转化为概率值，用于预测图像的类别。  

   ![AlexNet结构](./images/AlexNet结构.jpg) 

2. **模型特点**  
- **真正意义上的深度卷积神经网络**。能更好地学习特征，提高分类精度。 

- __首次使用ReLU__。相比于传统的 sigmoid 和 tanh 函数，ReLU 能够在保持计算速度的同时，有效解决梯度消失问题，从而使得训练更加高效。  

- __局部响应归一化LRN的使用__。在卷积层中，每个卷积核都对应一个特征图（feature map），LRN就是对这些特征图进行归一化。具体来说，对于每个特征图上的每个位置，计算该位置周围的像素的平方和，然后将当前位置的像素值除以这个和。LRN本质是抑制邻近神经元的响应，从而增强了神经元的较大响应。这种技术在一定程度上能够避免过拟合，并提高网络的泛化能力。  

- **数据增强和Dropout**。数据增强增加训练数据的多样性，提高模型泛化能力。Dropout是在前向传播过程中让神经元以一定的概率停止工作，使模型不会太依赖某些局部特征，提高模型泛化能力。  

- **大规模分布式训练**。AlexNet在使用GPU进行训练时，可将卷积层和全连接层分别放到不同的GPU上进行并行计算，从而大大加快了训练速度。像这种大规模 GPU 集群进行分布式训练的方法在后来的深度学习中也得到了广泛的应用。

3. __模型实现__

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F  # 在前向传播函数中，激活、池化可以使用，


# 注意输入网络的图像维度是N * C * W * H，即batch_size * channel * width * height
class AlexNet(nn.Module):
    def __init__(self, num_classes=1000, init_weights=False):
        super().__init__()
        self.feature = nn.Sequential(
            # input [N * 3 * 224 * 224]，output [N * 48 * 55 * 55]
            nn.Conv2d(
                in_channels=3, out_channels=48, kernel_size=11, padding=2, stride=4
            ),
            # inplace=True会把输出直接覆盖输入，这样可以节省内存/显存。
            # 之所以可以覆盖是因为在计算ReLU的反向传播时，只需根据输出就能够推算出反向传播的梯度。
            # 但是只有少数的autograd操作支持inplace操作（如tensor.sigmoid_()），
            # 除非你明确地知道自己在做什么，否则一般不要使用inplace操作。
            nn.ReLU(inplace=True),
            # input [N * 48 * 55 * 55]，output [N * 48 * 27 * 27]
            nn.MaxPool2d(kernel_size=3, stride=2),
            # input [N * 48 * 27 * 27]，output [N * 128 * 27 * 27]
            nn.Conv2d(in_channels=48, out_channels=128, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            # input [N * 128 * 27 * 27]，output [N * 128 * 13 * 13]
            nn.MaxPool2d(kernel_size=3, stride=2),
            # input [N * 128 * 13 * 13]，output [N * 192 * 13 * 13]
            nn.Conv2d(in_channels=128, out_channels=192, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input [N * 192 * 13 * 13]，output [N * 192 * 13 * 13]
            nn.Conv2d(in_channels=192, out_channels=192, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input [N * 192 * 13 * 13]，output[N * 128 * 13 * 13]
            nn.Conv2d(in_channels=192, out_channels=128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            # input [N * 128 * 13 * 13]，output [N * 128 * 6 * 6]
            nn.MaxPool2d(kernel_size=3, stride=2),
            # output [N * 4608]=N*(128 * 6 * 6)
            nn.Flatten(),  # 默认开始展平的维度是第1维，不是第0维。
        )
        self.classifier = nn.Sequential(
            # 随机失活一些神经元，要使用的话需要在数据进入每一层节点之前使用。
            nn.Dropout(p=0.5),
            nn.Linear(in_features=128 * 6 * 6, out_features=2048),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(in_features=2048, out_features=2048),
            nn.ReLU(inplace=True),
            nn.Linear(in_features=2048, out_features=num_classes),
        )

        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        x = self.feature(x)
        # 如果在卷积层feature中没有使用"nn.Flatten()"进行展平，
        # 可以在这个位置使用torch.flatten()或view()
        x = self.classifier(x)
        return x

    # 权重初始化函数
    def _initialize_weights(self):
        for m in self.modules():
            # isinstance(object, classinfo) 函数来判断一个对象是否是一个已知的类型
            # object -- 实例对象。
            # classinfo -- 可以是直接或间接类名、基本类型或者由它们组成的元组。
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

In [4]:
alexnet = AlexNet()
x = torch.rand([1, 3, 224, 224])
y = alexnet(x)  # y是个长度为1000的行向量。
print(y.shape)

torch.Size([1, 1000])


#### __2.2 解惑__  


- 模型定义中哪里体现了局部响应归一化LRN？  
**参考**：[AlexNet详细解读-博客园](https://www.cnblogs.com/xiaoboge/p/10465534.html)  
没有理解LRN是怎么在模型定义中体现的。

### （3）__VGG__

#### __3.1 VGG结构和特点__

参考：[一文读懂VGG-知乎](https://zhuanlan.zhihu.com/p/41423739)  

- __VGG原理__  
VGG主要工作是证明了增加网络的深度能够在一定程度上影响网络最终的性能。VGG有两种结构，分别是VGG16和VGG19，两者并没有本质上的区别，只是网络深度不一样。  
VGG16相比AlexNet的一个改进是采用连续的几个3x3的卷积核代替AlexNet中的较大卷积核（11x11，7x7，5x5）。  
对于给定的感受野（与输出有关的输入图片的局部大小），采用堆积的小卷积核是优于采用大的卷积核，因为多层非线性层可以增加网络深度来保证学习更复杂的模式，而且代价还比较小（参数更少）。
- __VGG结构__  

   ![VGG结构](./images/VGG结构.png)  
- __VGG优缺点__  
__优点__：结构简洁，整个网络使用同样大小的卷积核尺寸和最大池化尺寸；几个小滤波器的组合大于一个大滤波器；验证了加深网络结构可提升性能。  
__缺点__：耗费资源，参数更多，主要来自第一个全连接层。

#### __3.2 代码实现__

In [5]:
import torch
import torch.nn as nn


class VGG(nn.Module):
    def __init__(self, features, num_classes=1000, init_weights=False):
        super().__init__()
        # 由于特征提取层比较长，另外编写。
        self.features = features
        self.classifier = nn.Sequential(
            nn.Linear(in_features=512 * 7 * 7, out_features=4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(in_features=4096, out_features=4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(in_features=4096, out_features=num_classes),
        )
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        # input [N * 3 * 224 * 224]
        x = self.features(x)
        # torch.flatten()展平，这个函数默认开始展平的维度是第0维，也就是全部维度展平。
        # 而nn.Flatten()默认开始展平的维度是第1维。
        # 注意张量的维度从0开始，而这里展平维度是从1，也就是张量的第二个维度开始，然后到最后一个维度。
        # input [N * 512 * 7 * 7]
        x = torch.flatten(x, start_dim=1, end_dim=-1)
        # input [N * (512 * 7 * 7)]=[N * 25088]
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                # nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)


def make_features(cfg: list):
    """
    将模型特征提取层的参数存储在列表中，
    方便直接生成多种不同层的模型。
    """
    layers = []
    in_channels = 3

    for v in cfg:
        if v == "M":
            # 遇到列表中的'M'就添加一个最大池化层
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        else:
            # 遇到不是'M'的就添加一个卷积和激活层
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    # 这里'*layers'是解压缩操作，
    return nn.Sequential(*layers)


cfgs = {
    "vgg11": [64, "M", 128, "M", 256, 256, "M", 512, 512, "M", 512, 512, "M"],
    "vgg13": [64, 64, "M", 128, 128, "M", 256, 256, "M", 512, 512, "M", 512, 512, "M"],
    "vgg16": [
        64,
        64,
        "M",
        128,
        128,
        "M",
        256,
        256,
        256,
        "M",
        512,
        512,
        512,
        "M",
        512,
        512,
        512,
        "M",
    ],
    "vgg19": [
        64,
        64,
        "M",
        128,
        128,
        "M",
        256,
        256,
        256,
        256,
        "M",
        512,
        512,
        512,
        512,
        "M",
        512,
        512,
        512,
        512,
        "M",
    ],
}


def vgg(model_name="vgg16", **kwargs):
    """
    生成vgg模型
    """
    # assert后面的条件不成立则打印紧接着的警告内容，条件成立则执行后面的语句。
    assert model_name in cfgs, "Warning model_name:{} not in cfgs dict!".format(
        model_name
    )
    cfg = cfgs[model_name]
    model = VGG(make_features(cfg), **kwargs)
    return model

In [6]:
vgg16 = vgg(model_name="vgg16")
x = torch.rand([1, 3, 224, 224])
y = vgg16(x)
print(y.shape)

torch.Size([1, 1000])


### （4）__InceptionNet(GoogLeNet)__

#### __4.1 InceptionNet结构和特点__

- __InceptionNet提出背景__  
**参考**：[inception-知乎](https://zhuanlan.zhihu.com/p/73857137)  
一般来说，提高网络性能最直接的办法就是增加深度和宽度，但是一味低增加，会导致参数过多，如果训练数据有限，会导致过拟合；参数多，计算量大，难以应用；网络越深，越容易出现梯度弥散，难以优化模型。  

- __InceptionNet核心思想__  
**参考**：[深入理解GoogLeNet结构（原创）-知乎](https://zhuanlan.zhihu.com/p/32702031)  
inception模块的基本机构如下图，整个inception结构就是由多个这样的inception模块串联起来的。inception结构的主要贡献有两个：一是使用1x1的卷积来进行升降维；二是在多个尺寸上同时进行卷积再聚合。  
  
   ![inception模块结构](./images/inception模块结构.jpg)  

使用1X1卷积实现降维，可以明显减少参数量，原理如下图。  
**参考**：[深度学习：GoogLeNet结构详解-CSDN](https://blog.csdn.net/Vermont_/article/details/108836111)  

   ![inception1X1降维](./images/inception1X1降维.png)  

#### **4.2 InceptionNet代码实现**

In [7]:
import torch.nn as nn
import torch
import torch.nn.functional as F


class GoogLeNet(nn.Module):
    def __init__(self, num_classes=1000, aux_logits=True, init_weights=False):
        super(GoogLeNet, self).__init__()
        self.aux_logits = aux_logits

        self.conv1 = BasicConv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.maxpool1 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.conv2 = BasicConv2d(64, 64, kernel_size=1)
        self.conv3 = BasicConv2d(64, 192, kernel_size=3, padding=1)
        self.maxpool2 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception3a = Inception(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = Inception(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception4a = Inception(480, 192, 96, 208, 16, 48, 64)
        self.inception4b = Inception(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = Inception(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = Inception(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = Inception(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception5a = Inception(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = Inception(832, 384, 192, 384, 48, 128, 128)

        if self.aux_logits:
            self.aux1 = InceptionAux(512, num_classes)
            self.aux2 = InceptionAux(528, num_classes)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.4)
        self.fc = nn.Linear(1024, num_classes)
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        # N x 3 x 224 x 224
        x = self.conv1(x)
        # N x 64 x 112 x 112
        x = self.maxpool1(x)
        # N x 64 x 56 x 56
        x = self.conv2(x)
        # N x 64 x 56 x 56
        x = self.conv3(x)
        # N x 192 x 56 x 56
        x = self.maxpool2(x)

        # N x 192 x 28 x 28
        x = self.inception3a(x)
        # N x 256 x 28 x 28
        x = self.inception3b(x)
        # N x 480 x 28 x 28
        x = self.maxpool3(x)
        # N x 480 x 14 x 14
        x = self.inception4a(x)
        # N x 512 x 14 x 14
        if self.training and self.aux_logits:  # eval model lose this layer
            aux1 = self.aux1(x)

        x = self.inception4b(x)
        # N x 512 x 14 x 14
        x = self.inception4c(x)
        # N x 512 x 14 x 14
        x = self.inception4d(x)
        # N x 528 x 14 x 14
        if self.training and self.aux_logits:  # eval model lose this layer
            aux2 = self.aux2(x)

        x = self.inception4e(x)
        # N x 832 x 14 x 14
        x = self.maxpool4(x)
        # N x 832 x 7 x 7
        x = self.inception5a(x)
        # N x 832 x 7 x 7
        x = self.inception5b(x)
        # N x 1024 x 7 x 7

        x = self.avgpool(x)
        # N x 1024 x 1 x 1
        x = torch.flatten(x, 1)
        # N x 1024
        x = self.dropout(x)
        x = self.fc(x)
        # N x 1000 (num_classes)
        if self.training and self.aux_logits:  # eval model lose this layer
            return x, aux2, aux1
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)


class Inception(nn.Module):
    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):
        super(Inception, self).__init__()

        self.branch1 = BasicConv2d(in_channels, ch1x1, kernel_size=1)

        self.branch2 = nn.Sequential(
            BasicConv2d(in_channels, ch3x3red, kernel_size=1),
            BasicConv2d(
                ch3x3red, ch3x3, kernel_size=3, padding=1
            ),  # 保证输出大小等于输入大小
        )

        self.branch3 = nn.Sequential(
            BasicConv2d(in_channels, ch5x5red, kernel_size=1),
            BasicConv2d(
                ch5x5red, ch5x5, kernel_size=5, padding=2
            ),  # 保证输出大小等于输入大小
        )

        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            BasicConv2d(in_channels, pool_proj, kernel_size=1),
        )

    def forward(self, x):
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)

        outputs = [branch1, branch2, branch3, branch4]
        return torch.cat(outputs, 1)


class InceptionAux(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(InceptionAux, self).__init__()
        self.averagePool = nn.AvgPool2d(kernel_size=5, stride=3)
        self.conv = BasicConv2d(
            in_channels, 128, kernel_size=1
        )  # output[batch, 128, 4, 4]

        self.fc1 = nn.Linear(2048, 1024)
        self.fc2 = nn.Linear(1024, num_classes)

    def forward(self, x):
        # aux1: N x 512 x 14 x 14, aux2: N x 528 x 14 x 14
        x = self.averagePool(x)
        # aux1: N x 512 x 4 x 4, aux2: N x 528 x 4 x 4
        x = self.conv(x)
        # N x 128 x 4 x 4
        x = torch.flatten(x, 1)
        x = F.dropout(x, 0.5, training=self.training)
        # N x 2048
        x = F.relu(self.fc1(x), inplace=True)
        x = F.dropout(x, 0.5, training=self.training)
        # N x 1024
        x = self.fc2(x)
        # N x num_classes
        return x


class BasicConv2d(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(BasicConv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, **kwargs)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.conv(x)
        x = self.relu(x)
        return x

In [8]:
googleNet = GoogLeNet()
x = torch.rand([1, 3, 224, 224])
y = googleNet(x)
# print(y)

### （5）**ResNet**

#### 5.1 **ResNet结构**

**参考**：[ResNet网络结构详解与模型的搭建](http://t.csdnimg.cn/F7VpE)  

1. **亮点**：
- 提出**residual结构（残差结构）**，并搭建超深的网络结构(突破1000层)  
- 使用**Batch Normalization**加速训练(丢弃dropout)  

ResNet34结构如下：

![ResNet34结构](./images/ResNet结构.jpeg)

下图是论文中给出的两种残差结构。  
  
![residual结构](./images/ResNet残差结构.png)  

左边的残差结构是针对层数较少网络，例如ResNet18层和ResNet34层网络。右边是针对网络层数较多的网络，例如ResNet101，ResNet152等。  

为什么深层网络要使用右侧的残差结构呢。因为，右侧的残差结构能够减少网络参数与运算量。同样输入、输出一个channel为256的特征矩阵，如果使用左侧的残差结构需要大约1170648个参数，但如果使用右侧的残差结构只需要69632个参数。明显搭建深层网络时，使用右侧的残差结构更合适。  

2. **捷径分支（shortcut）**：  

显然，主分支上的输出矩阵和捷径分支上的输出矩阵能够相加的前提是两者有相同的shape,所以要注意捷径分支上所使用的1X1卷积的通道数以及卷积的stride。  

ResNet50/101/152的残差结构如下图,该残差结构所对应的虚线残差结构如下图右侧所示，在捷径分支上有一层1x1的卷积层，它的卷积核个数与主分支上的第三层卷积层卷积核个数相同，注意每个卷积层的步距。  
（注意：原论文中，在下图右侧虚线残差结构的主分支中，第一个1x1卷积层的步距是2，第二个3x3卷积层步距是1。但在pytorch官方实现过程中是第一个1x1卷积层的步距是1，第二个3x3卷积层步距是2，这么做的好处是能够在top1上提升大概0.5%的准确率。可参考[Resnet v1.5](https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch) 

![虚线残差结构](./images/虚线残差结构.png) 


**不同深度的ResNet网络结构配置如下图**：  

![ResNet结构配置](./images/不同深度的ResNet结构配置.png)  

对于ResNet18/34/50/101/152，表中conv3_x, conv4_x, conv5_x所对应的一系列残差结构的第一层残差结构都是虚线残差结构。因为这一系列残差结构的第一层都有调整输入特征矩阵shape的使命（将特征矩阵的高和宽缩减为原来的一半，将深度channel调整成下一层残差结构所需要的channel）。


#### 5.2 **ResNet代码实现**

In [1]:
import torch.nn as nn
import torch


class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_channel, out_channel, stride=1, downsample=None, **kwargs):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(
            in_channels=in_channel,
            out_channels=out_channel,
            kernel_size=3,
            stride=stride,
            padding=1,
            bias=False,
        )
        self.bn1 = nn.BatchNorm2d(out_channel)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(
            in_channels=out_channel,
            out_channels=out_channel,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=False,
        )
        self.bn2 = nn.BatchNorm2d(out_channel)
        self.downsample = downsample

    def forward(self, x):
        identity = x
        if self.downsample is not None:
            identity = self.downsample(x)

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        out += identity
        out = self.relu(out)

        return out


class Bottleneck(nn.Module):
    """
    注意：原论文中，在虚线残差结构的主分支上，第一个1x1卷积层的步距是2，第二个3x3卷积层步距是1。
    但在pytorch官方实现过程中是第一个1x1卷积层的步距是1，第二个3x3卷积层步距是2，
    这么做的好处是能够在top1上提升大概0.5%的准确率。
    可参考Resnet v1.5 https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch
    """

    expansion = 4

    def __init__(
        self,
        in_channel,
        out_channel,
        stride=1,
        downsample=None,
        groups=1,
        width_per_group=64,
    ):
        super(Bottleneck, self).__init__()

        width = int(out_channel * (width_per_group / 64.0)) * groups

        self.conv1 = nn.Conv2d(
            in_channels=in_channel,
            out_channels=width,
            kernel_size=1,
            stride=1,
            bias=False,
        )  # squeeze channels
        self.bn1 = nn.BatchNorm2d(width)
        # -----------------------------------------
        self.conv2 = nn.Conv2d(
            in_channels=width,
            out_channels=width,
            groups=groups,
            kernel_size=3,
            stride=stride,
            bias=False,
            padding=1,
        )
        self.bn2 = nn.BatchNorm2d(width)
        # -----------------------------------------
        self.conv3 = nn.Conv2d(
            in_channels=width,
            out_channels=out_channel * self.expansion,
            kernel_size=1,
            stride=1,
            bias=False,
        )  # unsqueeze channels
        self.bn3 = nn.BatchNorm2d(out_channel * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample

    def forward(self, x):
        identity = x
        if self.downsample is not None:
            identity = self.downsample(x)

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        out += identity
        out = self.relu(out)

        return out


class ResNet(nn.Module):

    def __init__(
        self,
        block,
        blocks_num,
        num_classes=1000,
        include_top=True,
        groups=1,
        width_per_group=64,
    ):
        super(ResNet, self).__init__()
        self.include_top = include_top
        self.in_channel = 64

        self.groups = groups
        self.width_per_group = width_per_group

        self.conv1 = nn.Conv2d(
            3, self.in_channel, kernel_size=7, stride=2, padding=3, bias=False
        )
        self.bn1 = nn.BatchNorm2d(self.in_channel)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, blocks_num[0])
        self.layer2 = self._make_layer(block, 128, blocks_num[1], stride=2)
        self.layer3 = self._make_layer(block, 256, blocks_num[2], stride=2)
        self.layer4 = self._make_layer(block, 512, blocks_num[3], stride=2)
        if self.include_top:
            self.avgpool = nn.AdaptiveAvgPool2d((1, 1))  # output size = (1, 1)
            self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")

    def _make_layer(self, block, channel, block_num, stride=1):
        downsample = None
        if stride != 1 or self.in_channel != channel * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(
                    self.in_channel,
                    channel * block.expansion,
                    kernel_size=1,
                    stride=stride,
                    bias=False,
                ),
                nn.BatchNorm2d(channel * block.expansion),
            )

        layers = []
        layers.append(
            block(
                self.in_channel,
                channel,
                downsample=downsample,
                stride=stride,
                groups=self.groups,
                width_per_group=self.width_per_group,
            )
        )
        self.in_channel = channel * block.expansion

        for _ in range(1, block_num):
            layers.append(
                block(
                    self.in_channel,
                    channel,
                    groups=self.groups,
                    width_per_group=self.width_per_group,
                )
            )

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        if self.include_top:
            x = self.avgpool(x)
            x = torch.flatten(x, 1)
            x = self.fc(x)

        return x


# 定义一些resnet模型，直接调用对应函数即可。
def resnet34(num_classes=1000, include_top=True):
    # https://download.pytorch.org/models/resnet34-333f7ec4.pth
    return ResNet(
        BasicBlock, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top
    )


def resnet50(num_classes=1000, include_top=True):
    # https://download.pytorch.org/models/resnet50-19c8e357.pth
    return ResNet(
        Bottleneck, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top
    )


def resnet101(num_classes=1000, include_top=True):
    # https://download.pytorch.org/models/resnet101-5d3b4d8f.pth
    return ResNet(
        Bottleneck, [3, 4, 23, 3], num_classes=num_classes, include_top=include_top
    )


def resnext50_32x4d(num_classes=1000, include_top=True):
    # https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth
    groups = 32
    width_per_group = 4
    return ResNet(
        Bottleneck,
        [3, 4, 6, 3],
        num_classes=num_classes,
        include_top=include_top,
        groups=groups,
        width_per_group=width_per_group,
    )


def resnext101_32x8d(num_classes=1000, include_top=True):
    # https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth
    groups = 32
    width_per_group = 8
    return ResNet(
        Bottleneck,
        [3, 4, 23, 3],
        num_classes=num_classes,
        include_top=include_top,
        groups=groups,
        width_per_group=width_per_group,
    )

In [2]:
# 调用定义的resnet34()函数实例化模型。
ResNet34 = resnet34()
x = torch.rand([1, 3, 224, 224])
y = ResNet34(x)
print(y.shape)

torch.Size([1, 1000])


### （6）**MobileNet**

#### 6.1 **MobileNet结构**  
__参考__：[MobileNet(v1、v2)网络详解与模型的搭建](http://t.csdnimg.cn/Bry66)

##### 6.1.1 **MobileNet-V1**

MobileNet网络专门为移动端，嵌入式端而设计。

提出了Depthwise（DW） Convolution结构，大大减少运算量和参数数量。
  
传统卷积和DW卷积的差异如下图。

   ![传统卷积和DW卷积对比](./images/传统卷积和DW卷积的差别.png)  

由于使用DW卷积后输出特征矩阵的channel是与输入特征矩阵的channel相等的，要想自定义输出特征矩阵的channel，需要在DW卷积后接上一个Pointwise（PW）卷积即可，如下图所示。  

其实PW卷积就是卷积核大小为1的普通卷积。DW卷积和PW卷积通常放在一起使用，叫做Depthwise Separable Convolution（深度可分卷积）。  

![DW和PW卷积](./images/深度可分离卷积结构.png)  


__深度可分离卷积和传统卷积计算量对比如下图。__ 

其中Df是输入特征矩阵的宽高（这里假设宽和高相等），Dk是卷积核的大小，M是输入特征矩阵的channel，N是输出特征矩阵的channel。  

卷积计算量近似等于卷积核的高 x 卷积核的宽 x 卷积核的channel x 输入特征矩阵的高 x 输入特征矩阵的宽（这里假设stride等于1），在mobilenet网络中DW卷积都是是使用3x3大小的卷积核。所以理论上普通卷积计算量是DW+PW卷积的8到9倍（公式来源于原论文）。

![深度可分离卷积和传统卷积计算量对比](./images/深度可分离卷积和传统卷积计算量对比.png)

MobileNet-V1网络结构如下。

表中标Conv的表示普通卷积，Conv dw代表DW卷积，s表示步距。

在MobileNet-V1原论文中，还提出了α和β两个超参数，α参数是一个倍率因子，用来调整卷积核的个数，β是控制输入网络的图像尺寸参数，下图右侧给出了使用不同α和β网络的分类准确率，计算量以及模型参数。

![MobileNet-V1结构](./images/MobileNet-V1网络结构.png)


##### 6.1.2 **MobileNet-V2**

由于MobileNet-V1网络中的DW卷积很容易训练废掉，效果并没有那么理想。  

MobileNet-V2提出Inverted residual block（倒残差结构），如下图所示，左侧是ResNet网络中的残差结构，右侧是MobileNet-V2中的倒残差结构，注意两者都有shortcut捷径分支。

使用倒残差结构的解释是高维信息通过ReLU激活函数后丢失的信息更少（注意倒残差结构中基本使用的都是ReLU6激活函数，但是最后一个1x1的卷积层使用的是线性激活函数）。  

![倒残差结构](./images/倒残差结构.png)  

和ResNet的残差结构一样，在使用倒残差结构时需要注意，并不是所有的倒残差结构都有shortcut连接，只有当stride=1且输入特征矩阵与输出特征矩阵shape相同时才有shortcut连接（只有当shape相同时，两个矩阵才能做加法运算，当stride=1时并不能保证输入特征矩阵的channel与输出特征矩阵的channel相同）。

![MobileNet-V2捷径分支存在的条件](./images/MobileNet-V2的shortcut连接.png)  

**下图是MobileNet-V2网络的结构表。**

![MobileNet-V2网络结构表](./images/MobileNet-V2网络结构.png)

##### 6.1.3 **MobileNet-V3**
**参考**：[MobileNetV3 网络结构](http://t.csdnimg.cn/qQEM5)

MobileNet-V3在V2的基础上增加了squeeze-and-excite（SE模块），即 **注意力机制**，还更新了 **激活函数**，如下图。最后 **重新设计了耗时层结构**。

![SE模块](./images/SE模块.png)

- **1. 注意力机制**  

这里的注意力机制想法非常简单，即针对每一个 channel 进行池化处理，就得到了 channel 个数个元素，通过两个全连接层，得到输出的这个向量。

值得注意的是，第一个全连接层的节点个数等于 channel 个数的 1/4，然后第二个全连接层的节点就和 channel 保持一致。这个得到的输出就相当于对原始的特征矩阵的每个 channel 分析出来了其重要程度，越重要的赋予越大的权重，越不重要的就赋予越小的权重。

我们用下图来进行理解，首先采用平均池化将每一个 channel 变为一个值，然后经过两个全连接层之后得到通道权重的输出，值得注意的是第二个全连接层使用 Hard-Sigmoid 激活函数。然后将通道的权重乘回原来的特征矩阵就得到了新的特征矩阵。
![SE模块原理](./images/SE模块原理.png)

- **2. 激活函数h-swish**  

在说 h-swish 之前，首先要了解 h-sigmoid 激活函数，它其实是 $ReLU6 ( x + 3 ) / 6$。

h-sigmoid 和 sigmoid 非常接近，但是计算公式和求导简单很多。由于 swish 是 x 乘上 sigmoid，自然地得到 h-swish 是 x 乘上 h-sigmoid。

从下图可以看到 swish 激活函数的曲线和 h-swish 激活函数的曲线非常相似。作者在原论文中提到，经过将 swish 激活函数替换为 h-swish，sigmoid 激活函数替换为 h-sigmoid 激活函数，对网络的推理速度是有帮助的，并且对量化过程也是很友好的。

注意，h-swish 实现虽然比 swish 快，但仍比 ReLU 慢不少。

![h-swish激活函数](./images/h-swish激活函数.png)

- **3. 重新设计耗时层结构**

首先将第一层卷积核个数从 32 减少到 16，减少参数量，降低推理时间。

然后是精简最后的结构last stage，如下图。

![重设耗时层](./images/MobileNet-V3重新设计耗时层.png)

- **4. MobileNet-V3结构**

下图是 MobileNetV3-large 的网络结构。

Input 表示输入当前层的特征矩阵的 shape，#out 代表的就是输出的通道大小。exp size 表示 bneck 中第一个升维的 1 × 1 卷积输出的维度，SE 表示是否使用注意力机制，NL 表示当前使用的非线性激活函数，s 为步距 stride。

bneck 后面跟的就是 DW 卷积的卷积核大小。注意最后有一个 NBN 表示分类器部分的卷积不会去使用 BN 层。

还需要注意的是第一个 bneck 结构，它的 exp size 和输出维度是一样的，也就是第一个 1 × 1 卷积并没有做升维处理，所以在 pytorch 和 tensorflow 的官方实现中，第一个 bneck 结构中就没有使用 1 × 1 卷积了，直接就是 DW 卷积了。

![MobileNet-V3-Large结构](./images/MobileNet-V3-Large.png)


**MobileNet-V3-Small结构如下图。**

![MobileNet-V3-Small结构](./images/MobileNet-V3-Small.png)

#### 6.2 **MobileNet代码实现**

##### 6.2.1 **MobileNet-V2**

In [3]:
from torch import nn
import torch


def _make_divisible(ch, divisor=8, min_ch=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_ch is None:
        min_ch = divisor
    new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_ch < 0.9 * ch:
        new_ch += divisor
    return new_ch


class ConvBNReLU(nn.Sequential):
    def __init__(self, in_channel, out_channel, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(
                in_channel,
                out_channel,
                kernel_size,
                stride,
                padding,
                groups=groups,
                bias=False,
            ),
            nn.BatchNorm2d(out_channel),
            nn.ReLU6(inplace=True),
        )


class InvertedResidual(nn.Module):
    def __init__(self, in_channel, out_channel, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        hidden_channel = in_channel * expand_ratio
        self.use_shortcut = stride == 1 and in_channel == out_channel

        layers = []
        if expand_ratio != 1:
            # 1x1 pointwise conv
            layers.append(ConvBNReLU(in_channel, hidden_channel, kernel_size=1))
        layers.extend(
            [
                # 3x3 depthwise conv
                ConvBNReLU(
                    hidden_channel, hidden_channel, stride=stride, groups=hidden_channel
                ),
                # 1x1 pointwise conv(linear)
                nn.Conv2d(hidden_channel, out_channel, kernel_size=1, bias=False),
                nn.BatchNorm2d(out_channel),
            ]
        )

        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_shortcut:
            return x + self.conv(x)
        else:
            return self.conv(x)


class MobileNetV2(nn.Module):
    def __init__(self, num_classes=1000, alpha=1.0, round_nearest=8):
        super(MobileNetV2, self).__init__()
        block = InvertedResidual
        input_channel = _make_divisible(32 * alpha, round_nearest)
        last_channel = _make_divisible(1280 * alpha, round_nearest)

        inverted_residual_setting = [
            # t, c, n, s
            [1, 16, 1, 1],
            [6, 24, 2, 2],
            [6, 32, 3, 2],
            [6, 64, 4, 2],
            [6, 96, 3, 1],
            [6, 160, 3, 2],
            [6, 320, 1, 1],
        ]

        features = []
        # conv1 layer
        features.append(ConvBNReLU(3, input_channel, stride=2))
        # building inverted residual residual blockes
        for t, c, n, s in inverted_residual_setting:
            output_channel = _make_divisible(c * alpha, round_nearest)
            for i in range(n):
                stride = s if i == 0 else 1
                features.append(
                    block(input_channel, output_channel, stride, expand_ratio=t)
                )
                input_channel = output_channel
        # building last several layers
        features.append(ConvBNReLU(input_channel, last_channel, 1))
        # combine feature layers
        self.features = nn.Sequential(*features)

        # building classifier
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Sequential(
            nn.Dropout(0.2), nn.Linear(last_channel, num_classes)
        )

        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

In [4]:
mobilenet_v2 = MobileNetV2()
x = torch.rand([1, 3, 224, 224])
y = mobilenet_v2(x)
print(y.shape)

torch.Size([1, 1000])


##### 6.2.2 **MobileNet-V3**

In [5]:
from typing import Callable, List, Optional

import torch
from torch import nn, Tensor
from torch.nn import functional as F
from functools import partial


def _make_divisible(ch, divisor=8, min_ch=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_ch is None:
        min_ch = divisor
    new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_ch < 0.9 * ch:
        new_ch += divisor
    return new_ch


class ConvBNActivation(nn.Sequential):
    def __init__(
        self,
        in_planes: int,
        out_planes: int,
        kernel_size: int = 3,
        stride: int = 1,
        groups: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
        activation_layer: Optional[Callable[..., nn.Module]] = None,
    ):
        padding = (kernel_size - 1) // 2
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if activation_layer is None:
            activation_layer = nn.ReLU6
        super(ConvBNActivation, self).__init__(
            nn.Conv2d(
                in_channels=in_planes,
                out_channels=out_planes,
                kernel_size=kernel_size,
                stride=stride,
                padding=padding,
                groups=groups,
                bias=False,
            ),
            norm_layer(out_planes),
            activation_layer(inplace=True),
        )


class SqueezeExcitation(nn.Module):
    def __init__(self, input_c: int, squeeze_factor: int = 4):
        super(SqueezeExcitation, self).__init__()
        squeeze_c = _make_divisible(input_c // squeeze_factor, 8)
        self.fc1 = nn.Conv2d(input_c, squeeze_c, 1)
        self.fc2 = nn.Conv2d(squeeze_c, input_c, 1)

    def forward(self, x: Tensor) -> Tensor:
        scale = F.adaptive_avg_pool2d(x, output_size=(1, 1))
        scale = self.fc1(scale)
        scale = F.relu(scale, inplace=True)
        scale = self.fc2(scale)
        scale = F.hardsigmoid(scale, inplace=True)
        return scale * x


class InvertedResidualConfig:
    def __init__(
        self,
        input_c: int,
        kernel: int,
        expanded_c: int,
        out_c: int,
        use_se: bool,
        activation: str,
        stride: int,
        width_multi: float,
    ):
        self.input_c = self.adjust_channels(input_c, width_multi)
        self.kernel = kernel
        self.expanded_c = self.adjust_channels(expanded_c, width_multi)
        self.out_c = self.adjust_channels(out_c, width_multi)
        self.use_se = use_se
        self.use_hs = activation == "HS"  # whether using h-swish activation
        self.stride = stride

    @staticmethod
    def adjust_channels(channels: int, width_multi: float):
        return _make_divisible(channels * width_multi, 8)


class InvertedResidual(nn.Module):
    def __init__(
        self, cnf: InvertedResidualConfig, norm_layer: Callable[..., nn.Module]
    ):
        super(InvertedResidual, self).__init__()

        if cnf.stride not in [1, 2]:
            raise ValueError("illegal stride value.")

        self.use_res_connect = cnf.stride == 1 and cnf.input_c == cnf.out_c

        layers: List[nn.Module] = []
        activation_layer = nn.Hardswish if cnf.use_hs else nn.ReLU

        # expand
        if cnf.expanded_c != cnf.input_c:
            layers.append(
                ConvBNActivation(
                    cnf.input_c,
                    cnf.expanded_c,
                    kernel_size=1,
                    norm_layer=norm_layer,
                    activation_layer=activation_layer,
                )
            )

        # depthwise
        layers.append(
            ConvBNActivation(
                cnf.expanded_c,
                cnf.expanded_c,
                kernel_size=cnf.kernel,
                stride=cnf.stride,
                groups=cnf.expanded_c,
                norm_layer=norm_layer,
                activation_layer=activation_layer,
            )
        )

        if cnf.use_se:
            layers.append(SqueezeExcitation(cnf.expanded_c))

        # project
        layers.append(
            ConvBNActivation(
                cnf.expanded_c,
                cnf.out_c,
                kernel_size=1,
                norm_layer=norm_layer,
                activation_layer=nn.Identity,
            )
        )

        self.block = nn.Sequential(*layers)
        self.out_channels = cnf.out_c
        self.is_strided = cnf.stride > 1

    def forward(self, x: Tensor) -> Tensor:
        result = self.block(x)
        if self.use_res_connect:
            result += x

        return result


class MobileNetV3(nn.Module):
    def __init__(
        self,
        inverted_residual_setting: List[InvertedResidualConfig],
        last_channel: int,
        num_classes: int = 1000,
        block: Optional[Callable[..., nn.Module]] = None,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
    ):
        super(MobileNetV3, self).__init__()

        if not inverted_residual_setting:
            raise ValueError("The inverted_residual_setting should not be empty.")
        elif not (
            isinstance(inverted_residual_setting, List)
            and all(
                [
                    isinstance(s, InvertedResidualConfig)
                    for s in inverted_residual_setting
                ]
            )
        ):
            raise TypeError(
                "The inverted_residual_setting should be List[InvertedResidualConfig]"
            )

        if block is None:
            block = InvertedResidual

        if norm_layer is None:
            norm_layer = partial(nn.BatchNorm2d, eps=0.001, momentum=0.01)

        layers: List[nn.Module] = []

        # building first layer
        firstconv_output_c = inverted_residual_setting[0].input_c
        layers.append(
            ConvBNActivation(
                3,
                firstconv_output_c,
                kernel_size=3,
                stride=2,
                norm_layer=norm_layer,
                activation_layer=nn.Hardswish,
            )
        )
        # building inverted residual blocks
        for cnf in inverted_residual_setting:
            layers.append(block(cnf, norm_layer))

        # building last several layers
        lastconv_input_c = inverted_residual_setting[-1].out_c
        lastconv_output_c = 6 * lastconv_input_c
        layers.append(
            ConvBNActivation(
                lastconv_input_c,
                lastconv_output_c,
                kernel_size=1,
                norm_layer=norm_layer,
                activation_layer=nn.Hardswish,
            )
        )
        self.features = nn.Sequential(*layers)
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.classifier = nn.Sequential(
            nn.Linear(lastconv_output_c, last_channel),
            nn.Hardswish(inplace=True),
            nn.Dropout(p=0.2, inplace=True),
            nn.Linear(last_channel, num_classes),
        )

        # initial weights
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def _forward_impl(self, x: Tensor) -> Tensor:
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


# 以函数定义mobilenet_v3_large和mobilenet_v3_small
# 定义mobilenet_v3_large模型
def mobilenet_v3_large(
    num_classes: int = 1000, reduced_tail: bool = False
) -> MobileNetV3:
    """
    Constructs a large MobileNetV3 architecture from
    "Searching for MobileNetV3" <https://arxiv.org/abs/1905.02244>.

    weights_link:
    https://download.pytorch.org/models/mobilenet_v3_large-8738ca79.pth

    Args:
        num_classes (int): number of classes
        reduced_tail (bool): If True, reduces the channel counts of all feature layers
            between C4 and C5 by 2. It is used to reduce the channel redundancy in the
            backbone for Detection and Segmentation.
    """
    width_multi = 1.0
    bneck_conf = partial(InvertedResidualConfig, width_multi=width_multi)
    adjust_channels = partial(
        InvertedResidualConfig.adjust_channels, width_multi=width_multi
    )

    reduce_divider = 2 if reduced_tail else 1

    inverted_residual_setting = [
        # input_c, kernel, expanded_c, out_c, use_se, activation, stride
        bneck_conf(16, 3, 16, 16, False, "RE", 1),
        bneck_conf(16, 3, 64, 24, False, "RE", 2),  # C1
        bneck_conf(24, 3, 72, 24, False, "RE", 1),
        bneck_conf(24, 5, 72, 40, True, "RE", 2),  # C2
        bneck_conf(40, 5, 120, 40, True, "RE", 1),
        bneck_conf(40, 5, 120, 40, True, "RE", 1),
        bneck_conf(40, 3, 240, 80, False, "HS", 2),  # C3
        bneck_conf(80, 3, 200, 80, False, "HS", 1),
        bneck_conf(80, 3, 184, 80, False, "HS", 1),
        bneck_conf(80, 3, 184, 80, False, "HS", 1),
        bneck_conf(80, 3, 480, 112, True, "HS", 1),
        bneck_conf(112, 3, 672, 112, True, "HS", 1),
        bneck_conf(112, 5, 672, 160 // reduce_divider, True, "HS", 2),  # C4
        bneck_conf(
            160 // reduce_divider,
            5,
            960 // reduce_divider,
            160 // reduce_divider,
            True,
            "HS",
            1,
        ),
        bneck_conf(
            160 // reduce_divider,
            5,
            960 // reduce_divider,
            160 // reduce_divider,
            True,
            "HS",
            1,
        ),
    ]
    last_channel = adjust_channels(1280 // reduce_divider)  # C5

    return MobileNetV3(
        inverted_residual_setting=inverted_residual_setting,
        last_channel=last_channel,
        num_classes=num_classes,
    )


# 定义mobilenet_v3_small模型
def mobilenet_v3_small(
    num_classes: int = 1000, reduced_tail: bool = False
) -> MobileNetV3:
    """
    Constructs a large MobileNetV3 architecture from
    "Searching for MobileNetV3" <https://arxiv.org/abs/1905.02244>.

    weights_link:
    https://download.pytorch.org/models/mobilenet_v3_small-047dcff4.pth

    Args:
        num_classes (int): number of classes
        reduced_tail (bool): If True, reduces the channel counts of all feature layers
            between C4 and C5 by 2. It is used to reduce the channel redundancy in the
            backbone for Detection and Segmentation.
    """
    width_multi = 1.0
    bneck_conf = partial(InvertedResidualConfig, width_multi=width_multi)
    adjust_channels = partial(
        InvertedResidualConfig.adjust_channels, width_multi=width_multi
    )

    reduce_divider = 2 if reduced_tail else 1

    inverted_residual_setting = [
        # input_c, kernel, expanded_c, out_c, use_se, activation, stride
        bneck_conf(16, 3, 16, 16, True, "RE", 2),  # C1
        bneck_conf(16, 3, 72, 24, False, "RE", 2),  # C2
        bneck_conf(24, 3, 88, 24, False, "RE", 1),
        bneck_conf(24, 5, 96, 40, True, "HS", 2),  # C3
        bneck_conf(40, 5, 240, 40, True, "HS", 1),
        bneck_conf(40, 5, 240, 40, True, "HS", 1),
        bneck_conf(40, 5, 120, 48, True, "HS", 1),
        bneck_conf(48, 5, 144, 48, True, "HS", 1),
        bneck_conf(48, 5, 288, 96 // reduce_divider, True, "HS", 2),  # C4
        bneck_conf(
            96 // reduce_divider,
            5,
            576 // reduce_divider,
            96 // reduce_divider,
            True,
            "HS",
            1,
        ),
        bneck_conf(
            96 // reduce_divider,
            5,
            576 // reduce_divider,
            96 // reduce_divider,
            True,
            "HS",
            1,
        ),
    ]
    last_channel = adjust_channels(1024 // reduce_divider)  # C5

    return MobileNetV3(
        inverted_residual_setting=inverted_residual_setting,
        last_channel=last_channel,
        num_classes=num_classes,
    )

In [7]:
# 以定义的函数实例化模型。
mobilenet_v3_large = mobilenet_v3_large()
mobilenet_v3_small = mobilenet_v3_small()

x = torch.rand([1, 3, 224, 224])
large_output = mobilenet_v3_large(x)
small_output = mobilenet_v3_small(x)

print(large_output.shape)
print(small_output.shape)

torch.Size([1, 1000])
torch.Size([1, 1000])


### （7）**ShuffleNet**
**参考1**：[ShuffleNet系列讲解-CSDN](http://t.csdnimg.cn/vrRcJ)  

**参考2**：[ShuffleNet详解-CSDN](http://t.csdnimg.cn/cp4pp)

#### 7.1 **ShuffleNet_V1结构**

- **背景** 

像Xception和ResNeXt在小网络模型中效率很低，因为大量的 1X1 卷积浪费了计算资源。

所以提出 **逐点组卷积** 降低 1X1 卷积的计算复杂度，而且为了克服逐点组卷积带来的副作用，又提出 **通道混洗（channel shuffle）** 来帮助信息在特征通道中流动。

- **分组卷积 到 通道混洗**  

**分组卷积(Group Convolution)** 的概念首先是在 AlexNet 中引入，用于将模型分布到两块 GPU 上 。在 Xception 和 MobileNet 中使用的 深度可分离卷积(depthwise separable convolution) 也都印证了它的有效性。分组卷积实现如下图。深度可分离卷积包括depthwise convolution 和 pointwise convolution。

   ![分组卷积实现](./images/分组卷积实现方式.png)

在小型网络中，计算量较大的逐点卷积会导致满足复杂度约束的通道数量有限，严重影响精度。

采用 **通道稀疏连接（channel sparse connections）** ，比如分组卷积可以大大降低计算成本。

但是这样分组之后会导致每个组只利用到一小部分的输入通道，阻止了通道之间的信息流，从而削弱了神经网络的表达能力。

所以又提出了 **通道混洗（channel shuffle）** 以解决通道之间的信息流通问题。其允许 分组卷积 从不同的组中获取输入数据，从而实现输入通道和输出通道相关联。原理如下图。

![通道混洗原理](./images/通道混洗原理.png)  

**（a）分组卷积**；    **（b）通道混洗，更好地获取全局信息**；    **（c）和（b）等效**  

**通道混洗单元（shuffle unit）** 的设计如下图所示。

![通道混洗单元](./images/通道混洗单元.png)  

**图a** 是ResNet中的bottleneck unit，但用 3×3 的DepthWiseConv代替原来的 3×3 Conv；

**图b** 将图a中的两端 1×1 Conv 换成了Group Conv；同时在DepthWiseConv之前使用了Channel Shuffle，该单元没有对图像大小进行调整；

**图c** 中的DepthWiseConv的步长设置为2，同时旁路连接中添加了一个步长为2的平均池化，并在最后使用Concatenate相连两条分支，这种设计在扩大了通道维度的同时并没有增加很多的计算量。

**ShuffleNet_V1网络结构** 如下图，常用的是其中 g=3 的结构。

![ShuffleNet_V1网络结构](./images/ShuffleNet_V1网络结构.png)







#### 7.2 **ShuffleNet_V2结构**

- **FLOPS**：注意全大写，是floating point operations per second的缩写，意指 **每秒浮点运算次数** ，理解为计算速度。是一个衡量硬件性能的指标。

- **FLOPs**：注意s小写，是floating point operations的缩写（s表复数），意指 **浮点运算数** ，理解为 **计算量** 。可以用来衡量算法/模型的复杂度。

通常以计算量FLOPs衡量神经网络模型的计算复杂度，从而估计模型的时间消耗。

但是模型的速度还取决于内存访问和平台等，所以应该以模型部署在实际芯片上消耗的实际时间为准，不是片面的追求FLOPs的理论减少。

作者从几个方面对运行时间进行了分析：

- **G1**：同等输入输出通道下，可最小化MAC（Memory Access Cost，内存访问成本）。

- **G2**：过多使用组卷积会增加内存访问成本MAC。

- **G3**：碎片化的网络会降低并行度。  
在GoogLeNet等网络中，常采用多路结构提高网络精度，但多路结构会造成网络的碎片化，使得网络速度变慢。

- **G4**：元素级操作不能忽视。  
对于ReLU、TensorAdd、BiasAdd等元素级操作，它们的FLOPs较少，但MAC较大。经过作者实验证明，将残差网络的残差单元中的ReLU和短接移除，速度会有20%的提升。

遵循以上几个原则设计出了ShuffleNet_V2，其和ShuffleNet_V1的对比如下图。

![ShuffleNet_V2和V1结构对比](./images/ShuffleNet_V2和V1结构对比.png)

**其中(c)、(d)是V2的结构**

ShuffleNet V2提出了一个新的操作：**通道拆分（channel split）**。

- 在每个单元的开始将通道拆分为2个分支，一个分支做恒等映射，符合G3原则，另一个分支经过多层卷积保证输入通道数与输出通道数相同，符合G1原则；

- ShuffleNet V2的 1×1 没有再使用分组卷积，符合G2原则；

- 两条分支最后使用通道级联 concatenate 操作，没有使用TensorAdd，符合G4原则。

ShuffleNet_V2 的结构如下图。

![V2结构表](./images/ShuffleNet_V2结构表.png)

#### 7.3 **ShuffleNet代码实现**

In [6]:
from typing import List, Callable

import torch
from torch import Tensor
import torch.nn as nn


def channel_shuffle(x: Tensor, groups: int) -> Tensor:
    """通道混洗"""
    batch_size, num_channels, height, width = x.size()
    channels_per_group = num_channels // groups

    # reshape
    # [batch_size, num_channels, height, width] -> [batch_size, groups, channels_per_group, height, width]
    x = x.view(batch_size, groups, channels_per_group, height, width)

    # contiguous()将转置后的底层存储空间变为连续的。
    x = torch.transpose(x, 1, 2).contiguous()

    # flatten
    x = x.view(batch_size, -1, height, width)

    return x


class InvertedResidual(nn.Module):
    def __init__(self, input_c: int, output_c: int, stride: int):
        super(InvertedResidual, self).__init__()

        if stride not in [1, 2]:
            raise ValueError("illegal stride value.")
        self.stride = stride

        assert output_c % 2 == 0
        branch_features = output_c // 2
        # 当stride为1时，input_channel应该是branch_features的两倍
        # python中 '<<' 是位运算，可理解为计算×2的快速方法
        assert (self.stride != 1) or (input_c == branch_features << 1)

        if self.stride == 2:
            self.branch1 = nn.Sequential(
                self.depthwise_conv(
                    input_c, input_c, kernel_s=3, stride=self.stride, padding=1
                ),
                nn.BatchNorm2d(input_c),
                nn.Conv2d(
                    input_c,
                    branch_features,
                    kernel_size=1,
                    stride=1,
                    padding=0,
                    bias=False,
                ),
                nn.BatchNorm2d(branch_features),
                nn.ReLU(inplace=True),
            )
        else:
            self.branch1 = nn.Sequential()

        self.branch2 = nn.Sequential(
            nn.Conv2d(
                input_c if self.stride > 1 else branch_features,
                branch_features,
                kernel_size=1,
                stride=1,
                padding=0,
                bias=False,
            ),
            nn.BatchNorm2d(branch_features),
            nn.ReLU(inplace=True),
            self.depthwise_conv(
                branch_features,
                branch_features,
                kernel_s=3,
                stride=self.stride,
                padding=1,
            ),
            nn.BatchNorm2d(branch_features),
            nn.Conv2d(
                branch_features,
                branch_features,
                kernel_size=1,
                stride=1,
                padding=0,
                bias=False,
            ),
            nn.BatchNorm2d(branch_features),
            nn.ReLU(inplace=True),
        )

    @staticmethod
    def depthwise_conv(
        input_c: int,
        output_c: int,
        kernel_s: int,
        stride: int = 1,
        padding: int = 0,
        bias: bool = False,
    ) -> nn.Conv2d:
        return nn.Conv2d(
            in_channels=input_c,
            out_channels=output_c,
            kernel_size=kernel_s,
            stride=stride,
            padding=padding,
            bias=bias,
            groups=input_c,
        )

    def forward(self, x: Tensor) -> Tensor:
        if self.stride == 1:
            x1, x2 = x.chunk(2, dim=1)
            out = torch.cat((x1, self.branch2(x2)), dim=1)
        else:
            out = torch.cat((self.branch1(x), self.branch2(x)), dim=1)

        out = channel_shuffle(out, 2)

        return out


class ShuffleNetV2(nn.Module):
    def __init__(
        self,
        stages_repeats: List[int],
        stages_out_channels: List[int],
        num_classes: int = 1000,
        inverted_residual: Callable[..., nn.Module] = InvertedResidual,
    ):
        super(ShuffleNetV2, self).__init__()

        if len(stages_repeats) != 3:
            raise ValueError("expected stages_repeats as list of 3 positive ints")
        if len(stages_out_channels) != 5:
            raise ValueError("expected stages_out_channels as list of 5 positive ints")
        self._stage_out_channels = stages_out_channels

        # input RGB image
        input_channels = 3
        output_channels = self._stage_out_channels[0]

        self.conv1 = nn.Sequential(
            nn.Conv2d(
                input_channels,
                output_channels,
                kernel_size=3,
                stride=2,
                padding=1,
                bias=False,
            ),
            nn.BatchNorm2d(output_channels),
            nn.ReLU(inplace=True),
        )
        input_channels = output_channels

        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # Static annotations for mypy
        self.stage2: nn.Sequential
        self.stage3: nn.Sequential
        self.stage4: nn.Sequential

        stage_names = ["stage{}".format(i) for i in [2, 3, 4]]
        for name, repeats, output_channels in zip(
            stage_names, stages_repeats, self._stage_out_channels[1:]
        ):
            seq = [inverted_residual(input_channels, output_channels, 2)]
            for i in range(repeats - 1):
                seq.append(inverted_residual(output_channels, output_channels, 1))
            setattr(self, name, nn.Sequential(*seq))
            input_channels = output_channels

        output_channels = self._stage_out_channels[-1]
        self.conv5 = nn.Sequential(
            nn.Conv2d(
                input_channels,
                output_channels,
                kernel_size=1,
                stride=1,
                padding=0,
                bias=False,
            ),
            nn.BatchNorm2d(output_channels),
            nn.ReLU(inplace=True),
        )

        self.fc = nn.Linear(output_channels, num_classes)

    def _forward_impl(self, x: Tensor) -> Tensor:
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.maxpool(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.stage4(x)
        x = self.conv5(x)
        x = x.mean([2, 3])  # global pool
        x = self.fc(x)
        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


def shufflenet_v2_x0_5(num_classes=1000):
    """
    Constructs a ShuffleNetV2 with 0.5x output channels, as described in
    `"ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
    <https://arxiv.org/abs/1807.11164>`.
    weight: https://download.pytorch.org/models/shufflenetv2_x0.5-f707e7126e.pth

    :param num_classes:
    :return:
    """
    model = ShuffleNetV2(
        stages_repeats=[4, 8, 4],
        stages_out_channels=[24, 48, 96, 192, 1024],
        num_classes=num_classes,
    )

    return model


def shufflenet_v2_x1_0(num_classes=1000):
    """
    Constructs a ShuffleNetV2 with 1.0x output channels, as described in
    `"ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
    <https://arxiv.org/abs/1807.11164>`.
    weight: https://download.pytorch.org/models/shufflenetv2_x1-5666bf0f80.pth

    :param num_classes:
    :return:
    """
    model = ShuffleNetV2(
        stages_repeats=[4, 8, 4],
        stages_out_channels=[24, 116, 232, 464, 1024],
        num_classes=num_classes,
    )

    return model


def shufflenet_v2_x1_5(num_classes=1000):
    """
    Constructs a ShuffleNetV2 with 1.0x output channels, as described in
    `"ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
    <https://arxiv.org/abs/1807.11164>`.
    weight: https://download.pytorch.org/models/shufflenetv2_x1_5-3c479a10.pth

    :param num_classes:
    :return:
    """
    model = ShuffleNetV2(
        stages_repeats=[4, 8, 4],
        stages_out_channels=[24, 176, 352, 704, 1024],
        num_classes=num_classes,
    )

    return model


def shufflenet_v2_x2_0(num_classes=1000):
    """
    Constructs a ShuffleNetV2 with 1.0x output channels, as described in
    `"ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
    <https://arxiv.org/abs/1807.11164>`.
    weight: https://download.pytorch.org/models/shufflenetv2_x2_0-8be3c8ee.pth

    :param num_classes:
    :return:
    """
    model = ShuffleNetV2(
        stages_repeats=[4, 8, 4],
        stages_out_channels=[24, 244, 488, 976, 2048],
        num_classes=num_classes,
    )

    return model

In [7]:
shufflenet = shufflenet_v2_x0_5()
x = torch.rand([1, 3, 224, 224])
y = shufflenet(x)
print(y.shape)

torch.Size([1, 1000])


### （8）**EfficientNet**
**参考**：[EfficientNet网络详解](http://t.csdnimg.cn/39lIK)

#### 8.1 **EfficientNet_V1 结构**

EfficientNet这篇论文主要是用 **NAS（Neural Architecture Search）** 技术来搜索网络的图像输入 **分辨率 r** ，网络的 **深度 depth** 以及 **channel 的宽度 width** 三个参数的合理化配置。

以往增加网络深度、宽度、图像分辨率会带来一些问题：

- 增加网络的深度 depth 能够得到更加丰富、复杂的特征并且能够很好的应用到其它任务中。但网络的深度过深会面临梯度消失，训练困难的问题。

- 增加网络的 width 能够获得更高细粒度的特征并且也更容易训练，但对于 width 很大而深度较浅的网络往往很难学习到更深层次的特征。

- 增加输入网络的图像分辨率能够潜在得获得更高细粒度的特征模板，但对于非常高的输入分辨率，准确率的增益也会减小。并且大分辨率图像会增加计算量。

EfficientNet_B0结构如下图。

![EfficientNet_B0结构表](./images/EfficientNet_B0结构表.png)

（B1-B7就是在B0的基础上修改Resolution，Channels以及Layers），可以看出网络总共分成了9个Stage，第一个 Stage 就是一个卷积核大小为3x3步距为2的普通卷积层（包含 BN 和激活函数 Swish ），

Stage2～Stage8 都是在重复堆叠 MBConv 结构（最后一列的 Layers 表示该 Stage 重复 MBConv 结构多少次），

而 Stage9 由一个普通的1x1的卷积层（包含BN和激活函数Swish）一个平均池化层和一个全连接层组成。

表格中每个 MBConv 后会跟一个数字1或6，这里的1或6就是倍率因子n即 MBConv 中第一个1x1的卷积层会将输入特征矩阵的 channels 扩充为n倍，其中k3x3或k5x5表示 MBConv 中 Depthwise Conv 所采用的卷积核大小。Channels 表示通过该 Stage 后输出特征矩阵的 Channels 。

**MBConv** 其实就是 MobileNetV3 网络中的 InvertedResidualBlock，但也有些区别。一个是采用的激活函数不一样（EfficientNet的MBConv中使用的都是Swish激活函数），另一个是在每个 MBConv 中都加入了SE（Squeeze-and-Excitation）模块。下图是MBConv结构。

![MBConv结构](./images/MBConv结构.png)

如图所示，MBConv 结构主要由一个1x1的普通卷积（升维作用，包含 BN 和 Swish），一个kxk的Depthwise Conv卷积（包含 BN 和 Swish）k的具体值可看 EfficientNet-B0 的网络框架主要有3x3和5x5两种情况，一个SE模块，一个1x1的普通卷积（降维作用，包含BN），一个 Droupout 层构成。

搭建过程中还需要注意几点：
- 第一个升维的 1x1 卷积层，它的卷积核个数是输入特征矩阵 channel 的 n 倍，n ∈ { 1 , 6 }。

- 当n = 1 n=1n=1时，不要第一个升维的1x1卷积层，即Stage2中的MBConv结构都没有第一个升维的1x1卷积层（这和MobileNetV3网络类似）。

- 关于shortcut连接，仅当输入MBConv结构的特征矩阵与输出的特征矩阵shape相同时才存在（代码中可通过stride==1 and inputc_channels==output_channels条件来判断）。

- SE模块如下所示，由一个全局平均池化，两个全连接层组成。第一个全连接层的节点个数是输入该MBConv特征矩阵channels的四分之一 ，且使用Swish激活函数。第二个全连接层的节点个数等于Depthwise Conv层输出的特征矩阵channels，且使用Sigmoid激活函数。

- Dropout层的dropout_rate在tensorflow的keras源码中对应的是drop_connect_rate后面会细讲（注意，在源码实现中只有使用shortcut的时候才有Dropout层）。

![SE模块](./images/EfficientNet中的SE模块.png)

EfficientNet各版本的详细参数如下：

![EfficientNet各版本的详细参数](./images/EfficientNet各版本的详细参数.png)

- `input_size`代表训练网络时输入网络的图像大小

- `width_coefficient`代表`channel`维度上的倍率因子，比如在 `EfficientNetB0`中`Stage1`的3x3卷积层所使用的卷积核个数是32，那么在B6中就是32 × 1.8 = 57.6 接着取整到离它最近的8的整数倍即56，其它`Stage`同理。

- `depth_coefficient`代表`depth`维度上的倍率因子（仅针对Stage2到Stage8），比如在`EfficientNetB0`中`Stage7`的L =4，那么在B6中就是4 × 2.6 = 10.4 接着向上取整即11。

- `drop_connect_rate`是在`MBConv`结构中dropout层使用的drop_rate，在官方keras模块的实现中MBConv结构的drop_rate是从0递增到`drop_connect_rate`的（具体实现可以看下官方源码，**注意，在源码实现中只有使用shortcut的时候才有Dropout层**）。  
还需要注意的是，这里的Dropout层是Stochastic Depth，即会随机丢掉整个block的主分支（只剩捷径分支，相当于直接跳过了这个block）也可以理解为减少了网络的深度。具体可参考Deep Networks with Stochastic Depth`这篇文章。

- `dropout_rate`是最后一个全连接层前的dropout层（在stage9的Pooling与FC之间）的dropout_rate。

### 8.3 **EfficientNet_V1 代码实现**

In [8]:
import math
import copy
from functools import partial
from collections import OrderedDict
from typing import Optional, Callable

import torch
import torch.nn as nn
from torch import Tensor
from torch.nn import functional as F


def _make_divisible(ch, divisor=8, min_ch=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_ch is None:
        min_ch = divisor
    new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_ch < 0.9 * ch:
        new_ch += divisor
    return new_ch


def drop_path(x, drop_prob: float = 0.0, training: bool = False):
    """
    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
    "Deep Networks with Stochastic Depth", https://arxiv.org/pdf/1603.09382.pdf

    This function is taken from the rwightman.
    It can be seen here:
    https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/layers/drop.py#L140
    """
    if drop_prob == 0.0 or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (
        x.ndim - 1
    )  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output


class DropPath(nn.Module):
    """
    Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    "Deep Networks with Stochastic Depth", https://arxiv.org/pdf/1603.09382.pdf
    """

    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)


class ConvBNActivation(nn.Sequential):
    def __init__(
        self,
        in_planes: int,
        out_planes: int,
        kernel_size: int = 3,
        stride: int = 1,
        groups: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
        activation_layer: Optional[Callable[..., nn.Module]] = None,
    ):
        padding = (kernel_size - 1) // 2
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if activation_layer is None:
            activation_layer = nn.SiLU  # alias Swish  (torch>=1.7)

        super(ConvBNActivation, self).__init__(
            nn.Conv2d(
                in_channels=in_planes,
                out_channels=out_planes,
                kernel_size=kernel_size,
                stride=stride,
                padding=padding,
                groups=groups,
                bias=False,
            ),
            norm_layer(out_planes),
            activation_layer(),
        )


class SqueezeExcitation(nn.Module):
    def __init__(
        self,
        input_c: int,  # block input channel
        expand_c: int,  # block expand channel
        squeeze_factor: int = 4,
    ):
        super(SqueezeExcitation, self).__init__()
        squeeze_c = input_c // squeeze_factor
        self.fc1 = nn.Conv2d(expand_c, squeeze_c, 1)
        self.ac1 = nn.SiLU()  # alias Swish
        self.fc2 = nn.Conv2d(squeeze_c, expand_c, 1)
        self.ac2 = nn.Sigmoid()

    def forward(self, x: Tensor) -> Tensor:
        scale = F.adaptive_avg_pool2d(x, output_size=(1, 1))
        scale = self.fc1(scale)
        scale = self.ac1(scale)
        scale = self.fc2(scale)
        scale = self.ac2(scale)
        return scale * x


class InvertedResidualConfig:
    # kernel_size, in_channel, out_channel, exp_ratio, strides, use_SE, drop_connect_rate
    def __init__(
        self,
        kernel: int,  # 3 or 5
        input_c: int,
        out_c: int,
        expanded_ratio: int,  # 1 or 6
        stride: int,  # 1 or 2
        use_se: bool,  # True
        drop_rate: float,
        index: str,  # 1a, 2a, 2b, ...
        width_coefficient: float,
    ):
        self.input_c = self.adjust_channels(input_c, width_coefficient)
        self.kernel = kernel
        self.expanded_c = self.input_c * expanded_ratio
        self.out_c = self.adjust_channels(out_c, width_coefficient)
        self.use_se = use_se
        self.stride = stride
        self.drop_rate = drop_rate
        self.index = index

    @staticmethod
    def adjust_channels(channels: int, width_coefficient: float):
        return _make_divisible(channels * width_coefficient, 8)


class InvertedResidual(nn.Module):
    def __init__(
        self, cnf: InvertedResidualConfig, norm_layer: Callable[..., nn.Module]
    ):
        super(InvertedResidual, self).__init__()

        if cnf.stride not in [1, 2]:
            raise ValueError("illegal stride value.")

        self.use_res_connect = cnf.stride == 1 and cnf.input_c == cnf.out_c

        layers = OrderedDict()
        activation_layer = nn.SiLU  # alias Swish

        # expand
        if cnf.expanded_c != cnf.input_c:
            layers.update(
                {
                    "expand_conv": ConvBNActivation(
                        cnf.input_c,
                        cnf.expanded_c,
                        kernel_size=1,
                        norm_layer=norm_layer,
                        activation_layer=activation_layer,
                    )
                }
            )

        # depthwise
        layers.update(
            {
                "dwconv": ConvBNActivation(
                    cnf.expanded_c,
                    cnf.expanded_c,
                    kernel_size=cnf.kernel,
                    stride=cnf.stride,
                    groups=cnf.expanded_c,
                    norm_layer=norm_layer,
                    activation_layer=activation_layer,
                )
            }
        )

        if cnf.use_se:
            layers.update({"se": SqueezeExcitation(cnf.input_c, cnf.expanded_c)})

        # project
        layers.update(
            {
                "project_conv": ConvBNActivation(
                    cnf.expanded_c,
                    cnf.out_c,
                    kernel_size=1,
                    norm_layer=norm_layer,
                    activation_layer=nn.Identity,
                )
            }
        )

        self.block = nn.Sequential(layers)
        self.out_channels = cnf.out_c
        self.is_strided = cnf.stride > 1

        # 只有在使用shortcut连接时才使用dropout层
        if self.use_res_connect and cnf.drop_rate > 0:
            self.dropout = DropPath(cnf.drop_rate)
        else:
            self.dropout = nn.Identity()

    def forward(self, x: Tensor) -> Tensor:
        result = self.block(x)
        result = self.dropout(result)
        if self.use_res_connect:
            result += x

        return result


class EfficientNet(nn.Module):
    def __init__(
        self,
        width_coefficient: float,
        depth_coefficient: float,
        num_classes: int = 1000,
        dropout_rate: float = 0.2,
        drop_connect_rate: float = 0.2,
        block: Optional[Callable[..., nn.Module]] = None,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
    ):
        super(EfficientNet, self).__init__()

        # kernel_size, in_channel, out_channel, exp_ratio, strides, use_SE, drop_connect_rate, repeats
        default_cnf = [
            [3, 32, 16, 1, 1, True, drop_connect_rate, 1],
            [3, 16, 24, 6, 2, True, drop_connect_rate, 2],
            [5, 24, 40, 6, 2, True, drop_connect_rate, 2],
            [3, 40, 80, 6, 2, True, drop_connect_rate, 3],
            [5, 80, 112, 6, 1, True, drop_connect_rate, 3],
            [5, 112, 192, 6, 2, True, drop_connect_rate, 4],
            [3, 192, 320, 6, 1, True, drop_connect_rate, 1],
        ]

        def round_repeats(repeats):
            """Round number of repeats based on depth multiplier."""
            return int(math.ceil(depth_coefficient * repeats))

        if block is None:
            block = InvertedResidual

        if norm_layer is None:
            norm_layer = partial(nn.BatchNorm2d, eps=1e-3, momentum=0.1)

        adjust_channels = partial(
            InvertedResidualConfig.adjust_channels, width_coefficient=width_coefficient
        )

        # build inverted_residual_setting
        bneck_conf = partial(
            InvertedResidualConfig, width_coefficient=width_coefficient
        )

        b = 0
        num_blocks = float(sum(round_repeats(i[-1]) for i in default_cnf))
        inverted_residual_setting = []
        for stage, args in enumerate(default_cnf):
            cnf = copy.copy(args)
            for i in range(round_repeats(cnf.pop(-1))):
                if i > 0:
                    # strides equal 1 except first cnf
                    cnf[-3] = 1  # strides
                    cnf[1] = cnf[2]  # input_channel equal output_channel

                cnf[-1] = args[-2] * b / num_blocks  # update dropout ratio
                index = str(stage + 1) + chr(i + 97)  # 1a, 2a, 2b, ...
                inverted_residual_setting.append(bneck_conf(*cnf, index))
                b += 1

        # create layers
        layers = OrderedDict()

        # first conv
        layers.update(
            {
                "stem_conv": ConvBNActivation(
                    in_planes=3,
                    out_planes=adjust_channels(32),
                    kernel_size=3,
                    stride=2,
                    norm_layer=norm_layer,
                )
            }
        )

        # building inverted residual blocks
        for cnf in inverted_residual_setting:
            layers.update({cnf.index: block(cnf, norm_layer)})

        # build top
        last_conv_input_c = inverted_residual_setting[-1].out_c
        last_conv_output_c = adjust_channels(1280)
        layers.update(
            {
                "top": ConvBNActivation(
                    in_planes=last_conv_input_c,
                    out_planes=last_conv_output_c,
                    kernel_size=1,
                    norm_layer=norm_layer,
                )
            }
        )

        self.features = nn.Sequential(layers)
        self.avgpool = nn.AdaptiveAvgPool2d(1)

        classifier = []
        if dropout_rate > 0:
            classifier.append(nn.Dropout(p=dropout_rate, inplace=True))
        classifier.append(nn.Linear(last_conv_output_c, num_classes))
        self.classifier = nn.Sequential(*classifier)

        # initial weights
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out")
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def _forward_impl(self, x: Tensor) -> Tensor:
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


def efficientnet_b0(num_classes=1000):
    # input image size 224x224
    return EfficientNet(
        width_coefficient=1.0,
        depth_coefficient=1.0,
        dropout_rate=0.2,
        num_classes=num_classes,
    )


def efficientnet_b1(num_classes=1000):
    # input image size 240x240
    return EfficientNet(
        width_coefficient=1.0,
        depth_coefficient=1.1,
        dropout_rate=0.2,
        num_classes=num_classes,
    )


def efficientnet_b2(num_classes=1000):
    # input image size 260x260
    return EfficientNet(
        width_coefficient=1.1,
        depth_coefficient=1.2,
        dropout_rate=0.3,
        num_classes=num_classes,
    )


def efficientnet_b3(num_classes=1000):
    # input image size 300x300
    return EfficientNet(
        width_coefficient=1.2,
        depth_coefficient=1.4,
        dropout_rate=0.3,
        num_classes=num_classes,
    )


def efficientnet_b4(num_classes=1000):
    # input image size 380x380
    return EfficientNet(
        width_coefficient=1.4,
        depth_coefficient=1.8,
        dropout_rate=0.4,
        num_classes=num_classes,
    )


def efficientnet_b5(num_classes=1000):
    # input image size 456x456
    return EfficientNet(
        width_coefficient=1.6,
        depth_coefficient=2.2,
        dropout_rate=0.4,
        num_classes=num_classes,
    )


def efficientnet_b6(num_classes=1000):
    # input image size 528x528
    return EfficientNet(
        width_coefficient=1.8,
        depth_coefficient=2.6,
        dropout_rate=0.5,
        num_classes=num_classes,
    )


def efficientnet_b7(num_classes=1000):
    # input image size 600x600
    return EfficientNet(
        width_coefficient=2.0,
        depth_coefficient=3.1,
        dropout_rate=0.5,
        num_classes=num_classes,
    )

In [9]:
efficientnet = efficientnet_b0()
x = torch.rand([1, 3, 224, 224])
y = efficientnet(x)
print(y.shape)

torch.Size([1, 1000])
