# ShuffleNet V2: 关键观察与创新点总结

ShuffleNet V2 是轻量级卷积神经网络 ShuffleNet 的改进版本，旨在进一步提升移动端和嵌入式设备上的模型效率与精度。相比第一代 ShuffleNet，V2 在结构设计上更加注重实际推理速度与模型准确率的平衡。

## 关键观察（Key Observations）

在设计 ShuffleNet V2 时，作者提出了以下几点影响模型效率的重要因素：

| 观察点 | 说明 |
|--------|------|
| 1. 输入输出通道数相等时内存访问成本更低 | 当输入和输出通道数相近时，内存访问效率更高（即 MAE 最小） |
| 2. 过多的分组卷积会增加内存访问开销 | 分组卷积虽然减少了计算量，但增加了内存访问成本（MAC），可能成为瓶颈 |
| 3. 网络碎片化会影响并行能力 | 类似 Inception 中的多分支结构会降低 GPU/TPU 的并行效率 |
| 4. Element-wise 操作不可忽视 | 如 Add、ReLU 等操作在轻量模型中占用较大时间比例 |

这些观察是通过大量实验和理论分析得出的，指导了 ShuffleNet V2 的模块设计。

---

## 创新点（Innovations）

基于上述观察，ShuffleNet V2 提出了以下几个关键创新点：

<img src="resources/shufflenet_v1_block.png" alt="drawing" width="80%"/>

### 1. 改进的模块结构（Linear Bottleneck + Channel Split）

- 使用 **Channel Split** 将输入通道分为两部分：
  - 一部分进行深度可分离卷积（Depthwise Convolution）和通道混洗（Channel Shuffle）
  - 另一部分直接传递到输出，实现恒等映射（Identity Mapping）

> ✅ 避免了过多的分组卷积  
> ✅ 减少了计算和内存访问开销  
> ✅ 提升了模型效率和训练稳定性

### 2. 去除模块中的 Element-wise Add 操作

- 使用 Concat 替代 Add，减少 Element-wise 操作的时间占比
- 同时保留信息流，保持模型表达能力

### 3. 更加高效的全局结构设计

- 整体网络由多个堆叠的 Shuffle Unit 组成
- 每个阶段逐渐增加通道数，合理分配计算资源
- 引入深度可分离卷积以进一步压缩参数量

---

## 总结

ShuffleNet V2 的成功在于它不仅仅关注 FLOPs 或参数数量，而是从**实际硬件执行效率**出发，综合考虑了内存访问、并行性和 Element-wise 操作的影响。

### 主要优势：

- 更快的推理速度
- 更高的准确率
- 更适合部署在移动端和嵌入式设备
- 模块结构简洁高效，利于工程实现

如果你正在构建一个高效的移动端视觉任务系统，ShuffleNet V2 是一个非常值得尝试的骨干网络。

---

In [2]:
# 自动重新加载外部module，使得修改代码之后无需重新import
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

from hdd.device.utils import get_device

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# 设置训练数据的路径
DATA_ROOT = "~/workspace/hands-dirty-on-dl/dataset"
# 设置TensorBoard的路径
TENSORBOARD_ROOT = "~/workspace/hands-dirty-on-dl/dataset"
# 设置预训练模型参数路径
TORCH_HUB_PATH = "~/workspace/hands-dirty-on-dl/pretrained_models"
torch.hub.set_dir(TORCH_HUB_PATH)
# 挑选最合适的训练设备
DEVICE = get_device(["cuda", "cpu"])
print("Device: ", DEVICE)

Device:  cuda


In [3]:
from hdd.dataset.imagenette_in_memory import ImagenetteInMemory
from hdd.data_util.auto_augmentation import ImageNetPolicy

from hdd.data_util.transforms import RandomResize
from torch.utils.data import DataLoader

TRAIN_MEAN = [0.4625, 0.4580, 0.4295]
TRAIN_STD = [0.2452, 0.2390, 0.2469]
train_dataset_transforms = transforms.Compose(
    [
        RandomResize([256, 296, 384]),  # 随机在三个size中选择一个进行resize
        transforms.RandomCrop(224),
        transforms.RandomHorizontalFlip(),
        ImageNetPolicy(),
        transforms.ToTensor(),
        transforms.Normalize(mean=TRAIN_MEAN, std=TRAIN_STD),
    ]
)
val_dataset_transforms = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=TRAIN_MEAN, std=TRAIN_STD),
    ]
)
train_dataset = ImagenetteInMemory(
    root=DATA_ROOT,
    split="train",
    size="full",
    download=True,
    transform=train_dataset_transforms,
)
val_dataset = ImagenetteInMemory(
    root=DATA_ROOT,
    split="val",
    size="full",
    download=True,
    transform=val_dataset_transforms,
)


def build_dataloader(batch_size, train_dataset, val_dataset):
    train_dataloader = DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True, num_workers=8
    )
    val_dataloader = DataLoader(
        val_dataset, batch_size=batch_size, shuffle=False, num_workers=8
    )
    return train_dataloader, val_dataloader

In [8]:
import torchsummary
from hdd.models.cnn.shufflenetv2 import ShuffleNetV2

net = ShuffleNetV2(
    num_classes=1000,
    stage_layers=[4, 8, 4],
    stage_out_channels=[24, 116, 232, 464, 1024],
).to(DEVICE)
torchsummary.summary(net, (3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 24, 112, 112]             648
       BatchNorm2d-2         [-1, 24, 112, 112]              48
              ReLU-3         [-1, 24, 112, 112]               0
           _Conv2d-4         [-1, 24, 112, 112]               0
         MaxPool2d-5           [-1, 24, 56, 56]               0
            Conv2d-6           [-1, 24, 28, 28]             216
       BatchNorm2d-7           [-1, 24, 28, 28]              48
           _Conv2d-8           [-1, 24, 28, 28]               0
            Conv2d-9           [-1, 58, 28, 28]           1,392
      BatchNorm2d-10           [-1, 58, 28, 28]             116
             ReLU-11           [-1, 58, 28, 28]               0
          _Conv2d-12           [-1, 58, 28, 28]               0
           Conv2d-13           [-1, 58, 56, 56]           1,392
      BatchNorm2d-14           [-1, 58,

In [9]:
from hdd.train.classification_utils import (
    naive_train_classification_model,
    eval_image_classifier,
)
from hdd.models.nn_utils import count_trainable_parameter


def train_net(
    train_dataloader,
    val_dataloader,
    net,
    lr=1e-3,
    weight_decay=0,
    max_epochs=200,
) -> dict[str, list[float]]:

    print(f"#Parameter: {count_trainable_parameter(net)}")
    criteria = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(net.parameters(), lr=lr, weight_decay=weight_decay)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, max_epochs, eta_min=lr / 100
    )
    training_stats = naive_train_classification_model(
        net,
        criteria,
        max_epochs,
        train_dataloader,
        val_dataloader,
        DEVICE,
        optimizer,
        scheduler,
        verbose=True,
    )
    return training_stats


train_dataloader, val_dataloader = build_dataloader(64, train_dataset, val_dataset)

net = ShuffleNetV2(
    num_classes=10,
    stage_layers=[4, 8, 4],
    stage_out_channels=[24, 116, 232, 464, 1024],
).to(DEVICE)
width_multiplier_1 = train_net(
    train_dataloader,
    val_dataloader,
    net,
    lr=0.001,
    weight_decay=0,
)

eval_result = eval_image_classifier(net, val_dataloader.dataset, DEVICE)
ss = [result.gt_label == result.predicted_label for result in eval_result]
print(f"#Parameter: {count_trainable_parameter(net)} Accuracy: {sum(ss) / len(ss)}")

#Parameter: 1263854
Epoch: 1/200 Train Loss: 2.1699 Accuracy: 0.2207 Time: 4.27057  | Val Loss: 2.1491 Accuracy: 0.2599
Epoch: 2/200 Train Loss: 1.9332 Accuracy: 0.3273 Time: 4.32623  | Val Loss: 1.6790 Accuracy: 0.4125
Epoch: 3/200 Train Loss: 1.7936 Accuracy: 0.3863 Time: 4.35719  | Val Loss: 1.4286 Accuracy: 0.5208
Epoch: 4/200 Train Loss: 1.7058 Accuracy: 0.4188 Time: 4.32647  | Val Loss: 1.4172 Accuracy: 0.5246
Epoch: 5/200 Train Loss: 1.6045 Accuracy: 0.4559 Time: 4.38447  | Val Loss: 1.1889 Accuracy: 0.6166
Epoch: 6/200 Train Loss: 1.5400 Accuracy: 0.4865 Time: 4.40782  | Val Loss: 1.1843 Accuracy: 0.6084
Epoch: 7/200 Train Loss: 1.4893 Accuracy: 0.5057 Time: 4.40190  | Val Loss: 1.0707 Accuracy: 0.6540
Epoch: 8/200 Train Loss: 1.4105 Accuracy: 0.5267 Time: 4.37572  | Val Loss: 1.0230 Accuracy: 0.6703
Epoch: 9/200 Train Loss: 1.3553 Accuracy: 0.5484 Time: 4.42825  | Val Loss: 1.1059 Accuracy: 0.6448
Epoch: 10/200 Train Loss: 1.3013 Accuracy: 0.5610 Time: 4.25508  | Val Loss: 0.9