### 1. **历史背景与解决的问题**
#### **背景：CNN的长期垄断**
- 在ViT出现前，计算机视觉（CV）任务由卷积神经网络（CNN）主导（如ResNet、YOLO系列），其核心是通过局部卷积操作提取空间特征，依赖归纳偏置（局部性、平移不变性）简化学习过程。
- Transformer架构自2017年提出后在自然语言处理（NLP）领域（如BERT、GPT）取得突破，但其在CV的应用受限于图像的非序列特性——直接处理像素会导致序列过长（如224×224图像有50,176个像素），计算复杂度高达O(N²)。

#### **核心问题：全局依赖建模的瓶颈**
- CNN的局部感受野难以建模图像中的**长距离依赖关系**（例如跨区域物体关联），需堆叠多层扩大感受野，效率较低。
- 早期将Transformer引入CV的尝试（如DETR）需依赖CNN backbone提取特征，未实现真正的"纯Transformer"架构。

#### **ViT的提出**
- 2020年Google Research团队发表论文《An Image is Worth 16x16 Words》，首次提出**纯Transformer的视觉模型ViT**，并在ImageNet分类任务上超越CNN，引发CV范式变革。

---

### 2. **模型的创新性与影响**
#### **核心创新**
- **图像分块序列化**  
  将输入图像分割为固定大小的块（如16×16像素），每个块展平为向量，视为"视觉单词"。例如224×224图像被转化为196个块序列，大幅降低序列长度。
- **位置编码保留空间信息**  
  引入可学习的位置编码（Positional Encoding），附加到块嵌入向量中，使模型感知空间结构。
- **全局自注意力机制**  
  通过多头自注意力（MHSA）层建模所有块之间的关系，实现**全局上下文感知**。例如在分类任务中，模型可同时关联图像角落与中心的关键特征。

#### **突破性影响**
- **打破CNN垄断，确立新范式**  
  ViT证明在大规模数据（如JFT-300M）预训练下，纯Transformer在ImageNet准确率超越ResNet，验证了"注意力机制足以替代卷积"。
- **推动多模态融合**  
  ViT成为视觉-语言多模态模型（如CLIP）的基础，实现跨模态对齐（图像-文本），支撑零样本检索、生成式AI等应用。
- **激发高效架构创新**  
  - **层级设计**：Swin Transformer引入局部窗口注意力，降低计算复杂度  
  - **过拟合优化**：DropKey通过随机丢弃Key缓解小数据过拟合  
  - **模型压缩**：ViT-Slim联合优化分块、注意力头等维度

#### **局限与后续改进**
| **挑战**               | **解决方案**                          | **代表工作**      |
|------------------------|--------------------------------------|------------------|
| 数据需求高             | 千亿级数据集预训练                   | Google DeepMind  |
| 计算复杂度高（O(N²)）  | 局部注意力、渐进式Token缩减          | Swin, As-ViT     |
| 位置编码灵活性不足     | 相对位置编码、可学习动态编码         | OCR-ViT          |

In [1]:
# 自动重新加载外部module，使得修改代码之后无需重新import
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

from hdd.device.utils import get_device

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# 设置训练数据的路径
DATA_ROOT = "~/workspace/hands-dirty-on-dl/dataset"
# 设置TensorBoard的路径
TENSORBOARD_ROOT = "~/workspace/hands-dirty-on-dl/dataset"
# 设置预训练模型参数路径
TORCH_HUB_PATH = "~/workspace/hands-dirty-on-dl/pretrained_models"
torch.hub.set_dir(TORCH_HUB_PATH)
# 挑选最合适的训练设备
DEVICE = get_device(["cuda", "cpu"])
print("Use device: ", DEVICE)

Use device:  cuda


In [2]:
from hdd.data_util.auto_augmentation import CIFAR10Policy

CIFAR_10_MEAN = [0.4914, 0.4822, 0.4465]
CIFAR_10_STD = [0.2470, 0.2435, 0.2616]
train_transform = transforms.Compose(
    [
        transforms.RandomCrop(size=32, padding=4),
        transforms.RandomHorizontalFlip(),
        CIFAR10Policy(),
        transforms.ToTensor(),
        transforms.Normalize(CIFAR_10_MEAN, CIFAR_10_STD),
    ]
)
test_transform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize(CIFAR_10_MEAN, CIFAR_10_STD),
    ]
)
BATCH_SIZE = 128

train_dataloader = torch.utils.data.DataLoader(
    datasets.CIFAR10(
        root=DATA_ROOT, train=True, download=True, transform=train_transform
    ),
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=8,
    pin_memory=True,
)

val_dataloader = torch.utils.data.DataLoader(
    datasets.CIFAR10(
        root=DATA_ROOT, train=False, download=True, transform=test_transform
    ),
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=8,
    pin_memory=True,
)

Files already downloaded and verified
Files already downloaded and verified


In [None]:
from hdd.models.transformer.vit import ViT
from hdd.train.classification_utils import naive_train_classification_model

net = ViT(
    num_classes=10,
    image_size=32,
    patch_size=8,
    embed_dim=384,
    n_heads=12,
    diff_dim=384,
    dropout=0.0,
    num_layers=7,
).to(DEVICE)
criteria = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = torch.optim.Adam(
    net.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=5e-5
)
max_epochs = 250
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, max_epochs, eta_min=1e-5
)

no_warmup = naive_train_classification_model(
    net,
    criteria,
    max_epochs,
    train_dataloader,
    val_dataloader,
    DEVICE,
    optimizer,
    scheduler,
    verbose=True,
)

Epoch: 1/500 Train Loss: 2.2178 Accuracy: 0.2067 Time: 3.59992  | Val Loss: 1.9592 Accuracy: 0.3120
Epoch: 2/500 Train Loss: 2.0401 Accuracy: 0.2692 Time: 3.62594  | Val Loss: 1.8889 Accuracy: 0.3595
Epoch: 3/500 Train Loss: 2.0095 Accuracy: 0.2852 Time: 3.66299  | Val Loss: 1.8547 Accuracy: 0.3617
Epoch: 4/500 Train Loss: 1.9932 Accuracy: 0.2947 Time: 3.47408  | Val Loss: 1.8284 Accuracy: 0.3777
Epoch: 5/500 Train Loss: 1.9751 Accuracy: 0.3043 Time: 3.54693  | Val Loss: 1.8417 Accuracy: 0.3712
Epoch: 6/500 Train Loss: 1.9617 Accuracy: 0.3129 Time: 3.62829  | Val Loss: 1.8122 Accuracy: 0.3892
Epoch: 7/500 Train Loss: 1.9416 Accuracy: 0.3221 Time: 3.75675  | Val Loss: 1.8119 Accuracy: 0.3830
Epoch: 8/500 Train Loss: 1.9299 Accuracy: 0.3302 Time: 3.73331  | Val Loss: 1.7764 Accuracy: 0.4050
Epoch: 9/500 Train Loss: 1.9148 Accuracy: 0.3344 Time: 3.45934  | Val Loss: 1.7344 Accuracy: 0.4242
Epoch: 10/500 Train Loss: 1.9057 Accuracy: 0.3405 Time: 3.41721  | Val Loss: 1.7087 Accuracy: 0.4431

In [5]:
from hdd.train.warmup_scheduler import GradualWarmupScheduler
from hdd.models.transformer.vit import ViT
from hdd.train.classification_utils import naive_train_classification_model
from hdd.models.nn_utils import count_trainable_parameter

net = ViT(
    num_classes=10,
    image_size=32,
    patch_size=8,
    embed_dim=384,
    n_heads=12,
    diff_dim=384,
    dropout=0.0,
    num_layers=7,
).to(DEVICE)
print(f"#Parameter: {count_trainable_parameter(net)}")
criteria = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = torch.optim.Adam(
    net.parameters(), lr=1e-3, betas=(0.9, 0.999), weight_decay=5e-5
)

base_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, max_epochs, eta_min=1e-5
)
scheduler = GradualWarmupScheduler(
    optimizer,
    multiplier=1.0,
    total_epoch=10,
    after_scheduler=base_scheduler,
)
warm_up = naive_train_classification_model(
    net,
    criteria,
    max_epochs,
    train_dataloader,
    val_dataloader,
    DEVICE,
    optimizer,
    scheduler,
    verbose=True,
)

#Parameter: 6306442
Epoch: 1/500 Train Loss: 3.0377 Accuracy: 0.0939 Time: 3.64167  | Val Loss: 3.1738 Accuracy: 0.0803
Epoch: 2/500 Train Loss: 2.1973 Accuracy: 0.2169 Time: 3.79464  | Val Loss: 1.9497 Accuracy: 0.3179
Epoch: 3/500 Train Loss: 2.0488 Accuracy: 0.2719 Time: 3.82010  | Val Loss: 1.9190 Accuracy: 0.3490
Epoch: 4/500 Train Loss: 1.9540 Accuracy: 0.3182 Time: 3.75939  | Val Loss: 1.7680 Accuracy: 0.4115
Epoch: 5/500 Train Loss: 1.8845 Accuracy: 0.3555 Time: 3.83421  | Val Loss: 1.7415 Accuracy: 0.4184
Epoch: 6/500 Train Loss: 1.8396 Accuracy: 0.3786 Time: 3.80307  | Val Loss: 1.6645 Accuracy: 0.4621
Epoch: 7/500 Train Loss: 1.8113 Accuracy: 0.3914 Time: 3.91323  | Val Loss: 1.6416 Accuracy: 0.4726
Epoch: 8/500 Train Loss: 1.7870 Accuracy: 0.4013 Time: 3.52725  | Val Loss: 1.6124 Accuracy: 0.4789
Epoch: 9/500 Train Loss: 1.7705 Accuracy: 0.4137 Time: 3.71879  | Val Loss: 1.5908 Accuracy: 0.4971
Epoch: 10/500 Train Loss: 1.7639 Accuracy: 0.4116 Time: 3.87672  | Val Loss: 1.6

KeyboardInterrupt: 