# SqueezeNet 总结

[SqueezeNet](https://arxiv.org/abs/1602.07360)  是由 DeepScale、UC Berkeley 和 Stanford 联合提出的一种轻量级卷积神经网络，旨在在保持与 AlexNet 相当精度的同时，大幅减少模型参数数量。它非常适合应用于资源受限的设备（如移动端或嵌入式系统）。

## 关键创新点

### 1. **Fire Module**
SqueezeNet 的核心结构是 **Fire Module**，由两个部分组成：

- **Squeeze Layer**：使用 1x1 卷积核进行降维，减少输入通道数。
- **Expand Layer**：并行使用 1x1 和 3x3 卷积核提取特征。

```python
class Fire(nn.Module):
    def __init__(self, inplanes: int, squeeze_planes: int, expand1x1_planes: int, expand3x3_planes: int) -> None:
        super().__init__()
        self.inplanes = inplanes
        self.squeeze = nn.Conv2d(inplanes, squeeze_planes, kernel_size=1)
        self.squeeze_activation = nn.ReLU(inplace=True)
        self.expand1x1 = nn.Conv2d(squeeze_planes, expand1x1_planes, kernel_size=1)
        self.expand1x1_activation = nn.ReLU(inplace=True)
        self.expand3x3 = nn.Conv2d(squeeze_planes, expand3x3_planes, kernel_size=3, padding=1)
        self.expand3x3_activation = nn.ReLU(inplace=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.squeeze_activation(self.squeeze(x))
        return torch.cat(
            [self.expand1x1_activation(self.expand1x1(x)), self.expand3x3_activation(self.expand3x3(x))], 1
        )
```

这种模块设计有效减少了参数数量，同时保留了丰富的特征表达能力。

### 2. **大量使用 1x1 卷积**
相比传统的 3x3 或更大卷积核，1x1 卷积可以显著减少计算量和参数量，同时实现跨通道的信息整合。

### 3. **延迟下采样（Delayed Downsampling）**
通过在网络前几层避免过早地进行下采样（如 Strided Convolution），提升了最终特征图的空间分辨率，有助于提高准确率。

### 4. **模型小型化**
SqueezeNet 最终模型大小可压缩到小于 **0.5MB**，仅为 AlexNet 的 1/50 左右，非常适合部署在存储受限的设备上。

### 5. **易于压缩**
SqueezeNet 在提出时就考虑了后续模型压缩的可能性（如使用 Deep Compression 技术），可以在几乎不损失精度的前提下进一步减小模型体积。

![alt text](resources/squeezenet_arch.png "Title")

---

## 缺点与局限性

### 1. **推理速度未必更快**
虽然参数少，但由于频繁使用多个卷积操作（尤其是在 Fire Module 中的并行分支），实际推理速度不一定比结构更简单的模型快。

### 2. **对硬件优化要求高**
为了充分发挥其轻量优势，需要良好的硬件支持和高效的卷积实现（例如针对 1x1 卷积的优化）。

### 3. **精度上限有限**
尽管在轻量模型中表现不错，但在 ImageNet 等大型数据集上的 Top-1 准确率仍落后于更深、更复杂的模型（如 ResNet、DenseNet 等）。

### 4. **扩展性不如现代轻量化模型**
相比 MobileNet、ShuffleNet 等后来提出的轻量级模型，SqueezeNet 在移动端优化和性能平衡方面略显不足。

---

## 总结

SqueezeNet 是早期探索轻量级 CNN 的代表作之一，其 Fire Module 设计理念为后续模型提供了重要启发。尽管存在一些性能瓶颈，但其在模型压缩和部署方面的潜力仍然值得肯定，尤其适用于对模型体积敏感的应用场景。

In [1]:
# 自动重新加载外部module，使得修改代码之后无需重新import
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

from hdd.device.utils import get_device
from hdd.dataset.imagenette_in_memory import ImagenetteInMemory

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# 设置训练数据的路径
DATA_ROOT = "~/workspace/hands-dirty-on-dl/dataset"
# 设置TensorBoard的路径
TENSORBOARD_ROOT = "~/workspace/hands-dirty-on-dl/dataset"
# 设置预训练模型参数路径
TORCH_HUB_PATH = "~/workspace/hands-dirty-on-dl/pretrained_models"
torch.hub.set_dir(TORCH_HUB_PATH)
# 挑选最合适的训练设备
DEVICE = get_device(["cuda", "cpu"])
print("Use device: ", DEVICE)

Use device:  cuda


In [2]:
from hdd.data_util.transforms import RandomResize
from torch.utils.data import DataLoader

TRAIN_MEAN = [0.4625, 0.4580, 0.4295]
TRAIN_STD = [0.2452, 0.2390, 0.2469]
train_dataset_transforms = transforms.Compose(
    [
        RandomResize([256, 296, 384]),  # 随机在三个size中选择一个进行resize
        transforms.RandomRotation(10),
        transforms.RandomCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=TRAIN_MEAN, std=TRAIN_STD),
    ]
)
val_dataset_transforms = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=TRAIN_MEAN, std=TRAIN_STD),
    ]
)
train_dataset = ImagenetteInMemory(
    root=DATA_ROOT,
    split="train",
    size="full",
    download=True,
    transform=train_dataset_transforms,
)
val_dataset = ImagenetteInMemory(
    root=DATA_ROOT,
    split="val",
    size="full",
    download=True,
    transform=val_dataset_transforms,
)


def build_dataloader(batch_size, train_dataset, val_dataset):
    train_dataloader = DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True, num_workers=8
    )
    val_dataloader = DataLoader(
        val_dataset, batch_size=batch_size, shuffle=False, num_workers=8
    )
    return train_dataloader, val_dataloader

In [3]:
from typing import Tuple
from hdd.models.cnn.squeezenet import SqueezeNet
from hdd.train.classification_utils import (
    naive_train_classification_model,
    eval_image_classifier,
    _train_classifier_naive,
)
from hdd.models.nn_utils import count_trainable_parameter


def train_net(
    train_dataloader,
    val_dataloader,
    add_norm,
    dropout,
    lr=1e-3,
    weight_decay=1e-5,
    max_epochs=150,
    train_classifier=None,
) -> tuple[SqueezeNet, dict[str, list[float]]]:
    net = SqueezeNet(num_classes=10, add_norm=add_norm, dropout=dropout).to(DEVICE)
    print(f"#Parameter: {count_trainable_parameter(net)}")
    criteria = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(
        net.parameters(), lr=lr, momentum=0.9, weight_decay=weight_decay
    )

    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, max_epochs, eta_min=lr / 100
    )
    if train_classifier is None:
        train_classifier = _train_classifier_naive
    training_stats = naive_train_classification_model(
        net,
        criteria,
        max_epochs,
        train_dataloader,
        val_dataloader,
        DEVICE,
        optimizer,
        scheduler,
        verbose=True,
        train_classifier=train_classifier,
    )
    return net, training_stats


train_dataloader, val_dataloader = build_dataloader(128, train_dataset, val_dataset)


def train_classifier_with_gradient_clipping(
    net: nn.Module,
    criteria: nn.CrossEntropyLoss,
    optimizer: optim.Optimizer,
    train_loader: torch.utils.data.DataLoader,
    device: torch.device,
) -> Tuple[float, float]:
    """Naive training procedure to train classifier for one epoch.

    Args:
        net: network instance.
        criteria: Loss function. Typically nn.CrossEntropyLoss
        optimizer: optimizer.
        train_loader: train data
        device: device to run the training.

    Returns:
        avg train loss and train accuracy.
    """

    train_loss = 0.0
    correct_items = 0
    total_items = 0
    net.train()
    for Xs, ys in train_loader:
        Xs, ys = Xs.to(device), ys.to(device)
        optimizer.zero_grad()
        logits = net(Xs)
        loss = criteria(logits, ys)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=1.0)
        optimizer.step()
        train_loss += loss.item()
        correct_items += torch.sum(torch.argmax(logits, dim=1) == ys).item()
        total_items += Xs.shape[0]

    avg_train_loss = train_loss / len(train_loader)
    accuracy = correct_items / total_items
    return avg_train_loss, accuracy


# 在不添加batch norm的情况下,这里用了gradient clipping,否则会有梯度爆炸
net, no_norm_stats = train_net(
    train_dataloader,
    val_dataloader,
    add_norm=False,
    dropout=0,
    lr=0.01,
    weight_decay=0,
    max_epochs=250,
    train_classifier=train_classifier_with_gradient_clipping,
)

eval_result = eval_image_classifier(net, val_dataloader.dataset, DEVICE)
ss = [result.gt_label == result.predicted_label for result in eval_result]
print(f"#Parameter: {count_trainable_parameter(net)} Accuracy: {sum(ss) / len(ss)}")

#Parameter: 727626
Epoch: 1/250 Train Loss: 6.4859 Accuracy: 0.1574 Time: 6.43627  | Val Loss: 2.1918 Accuracy: 0.1740
Epoch: 2/250 Train Loss: 2.1915 Accuracy: 0.1795 Time: 6.24756  | Val Loss: 2.1425 Accuracy: 0.2003
Epoch: 3/250 Train Loss: 2.1748 Accuracy: 0.1951 Time: 6.22408  | Val Loss: 2.1112 Accuracy: 0.2191
Epoch: 4/250 Train Loss: 2.1465 Accuracy: 0.2179 Time: 6.21362  | Val Loss: 2.0906 Accuracy: 0.2306
Epoch: 5/250 Train Loss: 2.0998 Accuracy: 0.2355 Time: 6.28024  | Val Loss: 2.0212 Accuracy: 0.2810
Epoch: 6/250 Train Loss: 2.0417 Accuracy: 0.2709 Time: 6.22656  | Val Loss: 1.9415 Accuracy: 0.2971
Epoch: 7/250 Train Loss: 2.0223 Accuracy: 0.2744 Time: 6.32947  | Val Loss: 1.9426 Accuracy: 0.3057
Epoch: 8/250 Train Loss: 2.0173 Accuracy: 0.2756 Time: 6.23133  | Val Loss: 1.9583 Accuracy: 0.3126
Epoch: 9/250 Train Loss: 2.0000 Accuracy: 0.3018 Time: 6.23780  | Val Loss: 1.9481 Accuracy: 0.2973
Epoch: 10/250 Train Loss: 1.9419 Accuracy: 0.3363 Time: 6.29902  | Val Loss: 1.79

In [4]:
train_dataloader, val_dataloader = build_dataloader(128, train_dataset, val_dataset)
net, norm_stats = train_net(
    train_dataloader,
    val_dataloader,
    add_norm=True,
    dropout=0,
    lr=0.01,
    weight_decay=0,
    max_epochs=250,
)

eval_result = eval_image_classifier(net, val_dataloader.dataset, DEVICE)
ss = [result.gt_label == result.predicted_label for result in eval_result]
print(f"#Parameter: {count_trainable_parameter(net)} Accuracy: {sum(ss) / len(ss)}")

#Parameter: 733514
Epoch: 1/250 Train Loss: 3.0885 Accuracy: 0.2329 Time: 7.63528  | Val Loss: 2.0716 Accuracy: 0.3004
Epoch: 2/250 Train Loss: 2.0264 Accuracy: 0.3179 Time: 7.60864  | Val Loss: 1.9494 Accuracy: 0.3549
Epoch: 3/250 Train Loss: 1.8853 Accuracy: 0.3781 Time: 7.59596  | Val Loss: 1.8920 Accuracy: 0.3814
Epoch: 4/250 Train Loss: 1.7319 Accuracy: 0.4372 Time: 7.57057  | Val Loss: 1.8101 Accuracy: 0.4125
Epoch: 5/250 Train Loss: 1.6501 Accuracy: 0.4624 Time: 7.59889  | Val Loss: 1.5483 Accuracy: 0.4922
Epoch: 6/250 Train Loss: 1.4677 Accuracy: 0.5215 Time: 7.60585  | Val Loss: 1.4188 Accuracy: 0.5271
Epoch: 7/250 Train Loss: 1.3366 Accuracy: 0.5610 Time: 7.57479  | Val Loss: 1.3409 Accuracy: 0.5661
Epoch: 8/250 Train Loss: 1.2924 Accuracy: 0.5825 Time: 7.57574  | Val Loss: 1.2044 Accuracy: 0.6140
Epoch: 9/250 Train Loss: 1.2035 Accuracy: 0.6147 Time: 7.60885  | Val Loss: 1.3576 Accuracy: 0.5824
Epoch: 10/250 Train Loss: 1.1685 Accuracy: 0.6252 Time: 7.58097  | Val Loss: 1.22