# Introduction to PyTorch

## Why do we use deep learning frameworks?

* In this class, we want you to be ready to use one of these frameworks for your project so you can experiment more efficiently than if you were writing every feature you want to use by hand. 
* We want you to stand on the shoulders of giants! PyTorch is an excellent frameworks that will make your lives a lot easier, and now that you understand their guts, you are free to use them :) 
* Finally, we want you to be exposed to the sort of deep learning code you might run into in academia or industry.

## What is PyTorch?

PyTorch is a system for executing dynamic computational graphs over Tensor objects that behave similarly as numpy ndarray. It comes with a powerful automatic differentiation engine that removes the need for manual back-propagation. 

## How do I learn PyTorch?
You can find the detailed [API doc](http://pytorch.org/docs/stable/index.html) here. If you have other questions that are not addressed by the API docs, the [PyTorch forum](https://discuss.pytorch.org/) is a much better place to ask than StackOverflow.

# Table of Contents

This assignment has 5 parts. You will learn PyTorch on **three different levels of abstraction**, which will help you understand it better and prepare you for the final project. 

1. Part I, Preparation: we will use CIFAR-10 dataset.
2. Part II, Barebones PyTorch: **Abstraction level 1**, we will work directly with the lowest-level PyTorch Tensors. 
3. Part III, PyTorch Module API: **Abstraction level 2**, we will use `nn.Module` to define arbitrary neural network architecture. 
4. Part IV, PyTorch Sequential API: **Abstraction level 3**, we will use `nn.Sequential` to define a linear feed-forward network very conveniently. 
5. Part V, CIFAR-10 open-ended challenge: please implement your own network to get as high accuracy as possible on CIFAR-10. You can experiment with any layer, optimizer, hyperparameters or other advanced features. 

Here is a table of comparison:

| API           | Flexibility | Convenience |
|---------------|-------------|-------------|
| Barebone      | High        | Low         |
| `nn.Module`     | High        | Medium      |
| `nn.Sequential` | Low         | High        |

# GPU

If you have a GPU device, then switch your kernel to it.

In [5]:
import os
import torch

import torch.nn as nn
import torch.optim as optim
import torchvision.datasets as dset
import torchvision.transforms as T
from torch.utils.data import DataLoader, sampler

# 设置清华镜像源加速下载[1,2](@ref)
os.environ['TORCHVISION_CIFAR10_URL'] = 'https://mirrors.tuna.tsinghua.edu.cn/pytorch/datasets/CIFAR-10/'

# 设备配置（优化CUDA检测）
device = torch.device('cuda' if torch.cuda.is_available() and torch.cuda.device_count() > 0 else 'cpu')
print(f'Using device: {device}')

# 增强数据预处理（添加数据增强）[4](@ref)
transform = T.Compose([
    T.RandomHorizontalFlip(p=0.5),
    T.ColorJitter(brightness=0.2, contrast=0.2),
    T.ToTensor(),
    T.Normalize((0.4914, 0.4822, 0.4465), (0.4914, 0.4822, 0.4465))
])

def create_dataloader(dataset, sampler=None, shuffle=False):
    """优化后的DataLoader工厂函数[9,10](@ref)"""
    return DataLoader(
        dataset,
        batch_size=64,
        sampler=sampler,
        shuffle=shuffle,
        num_workers=4,                # 根据CPU核心数调整（建议物理核心数/2）
        pin_memory=True,              # CUDA内存锁页[6,8](@ref)
        persistent_workers=True,      # 保持worker进程存活
        prefetch_factor=2             # 预取2个batch加速
    )

# 数据集初始化（带自动重试机制）
try:
    cifar10_train = dset.CIFAR10('./cifar10', train=True, download=True, transform=transform)
    cifar10_val = dset.CIFAR10('./cifar10', train=True, download=False, transform=transform)
    cifar10_test = dset.CIFAR10('./cifar10', train=False, download=True, transform=transform)
except:
    print("自动下载失败，请手动下载：https://mirrors.tuna.tsinghua.edu.cn/pytorch/datasets/CIFAR-10/cifar-10-python.tar.gz")
    print("下载后解压至'./cifar10'目录")

# 创建数据加载器（优化CUDA加速）
NUM_TRAIN = 49000
loader_train = create_dataloader(
    cifar10_train,
    sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN))
)

loader_val = create_dataloader(
    cifar10_val,
    sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000))
)

loader_test = create_dataloader(cifar10_test)

# 示例模型定义（带CUDA加速）
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        ).to(device)  # 直接部署到CUDA

    def forward(self, x):
        return self.net(x.to(device))  # 确保输入数据在GPU

# 优化器配置（带CUDA支持）
model = CNN()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Using device: cuda


# Part I. Preparation

Now, let's load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In previous parts of the assignment we had to write our own code to download the CIFAR-10 dataset, preprocess it, and iterate through it in minibatches; PyTorch provides convenient tools to automate this process for us.

# Part II. Barebones PyTorch

PyTorch ships with high-level APIs to help us define model architectures conveniently, which we will cover in Part II of this tutorial. In this section, we will start with the barebone PyTorch elements to understand the autograd engine better. After this exercise, you will come to appreciate the high-level model API more.

We will start with a simple fully-connected ReLU network with two hidden layers and no biases for CIFAR classification. 
This implementation computes the forward pass using operations on PyTorch Tensors, and uses PyTorch autograd to compute gradients. It is important that you understand every line, because you will write a harder version after the example.

When we create a PyTorch Tensor with `requires_grad=True`, then operations involving that Tensor will not just compute values; they will also build up a computational graph in the background, allowing us to easily backpropagate through the graph to compute gradients of some Tensors with respect to a downstream loss. Concretely if x is a Tensor with `x.requires_grad == True` then after backpropagation `x.grad` will be another Tensor holding the gradient of x with respect to the scalar loss at the end.

### PyTorch Tensors: Flatten Function
A PyTorch Tensor is conceptionally similar to a numpy array: it is an n-dimensional grid of numbers, and like numpy PyTorch provides many functions to efficiently operate on Tensors. As a simple example, we provide a `flatten` function below which reshapes image data for use in a fully-connected neural network.

Recall that image data is typically stored in a Tensor of shape N x C x H x W, where:

* N is the number of datapoints
* C is the number of channels
* H is the height of the intermediate feature map in pixels
* W is the height of the intermediate feature map in pixels

This is the right way to represent the data when we are doing something like a 2D convolution, that needs spatial understanding of where the intermediate features are relative to each other. When we use fully connected affine layers to process the image, however, we want each datapoint to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data. So, we use a "flatten" operation to collapse the `C x H x W` values per representation into a single long vector. The flatten function below first reads in the N, C, H, and W values from a given batch of data, and then returns a "view" of that data. "View" is analogous to numpy's "reshape" method: it reshapes x's dimensions to be N x ??, where ?? is allowed to be anything (in this case, it will be C x H x W, but we don't need to specify that explicitly). 

In [9]:
def flatten(x):
    N = x.shape[0] # read in N, C, H, W
    return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

def test_flatten():
    x = torch.arange(12).view(2, 1, 3, 2)
    print('Before flattening: ', x)
    print('After flattening: ', flatten(x))

test_flatten()

Before flattening:  tensor([[[[ 0,  1],
          [ 2,  3],
          [ 4,  5]]],


        [[[ 6,  7],
          [ 8,  9],
          [10, 11]]]])
After flattening:  tensor([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]])


### Barebones PyTorch: Two-Layer Network

Here we define a function `two_layer_fc` which performs the forward pass of a two-layer fully-connected ReLU network on a batch of image data. After defining the forward pass we check that it doesn't crash and that it produces outputs of the right shape by running zeros through the network.

You don't have to write any code here, but it's important that you read and understand the implementation.

In [11]:
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, sampler
from torchvision import datasets, transforms

# 全局配置（显式设备索引设置）
dtype = torch.float32
device = torch.device(f"cuda:{torch.cuda.current_device()}" if torch.cuda.is_available() else "cpu")

# 强制初始化CUDA上下文（解决延迟初始化问题）
if torch.cuda.is_available():
    torch.cuda.set_device(device)
    torch.cuda.synchronize()  # 确保设备状态同步
    _ = torch.zeros(1, device=device)  # 显存预分配

# 增强型数据预处理（自动设备转移）
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.to(device)),  # 预处理阶段设备转移
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.4914, 0.4822, 0.4465))
])

def create_dataloader(dataset, sampler=None, shuffle=False):
    """优化后的DataLoader工厂函数"""
    return DataLoader(
        dataset,
        batch_size=64,
        sampler=sampler,
        shuffle=shuffle,
        num_workers=4,
        pin_memory=True,
        persistent_workers=True,
        prefetch_factor=2
    )

# 数据加载（带设备验证）
try:
    cifar10_train = datasets.CIFAR10('./cifar10', train=True, download=True, transform=transform)
    cifar10_val = datasets.CIFAR10('./cifar10', train=True, download=False, transform=transform)
    cifar10_test = datasets.CIFAR10('./cifar10', train=False, download=True, transform=transform)
except Exception as e:
    print(f"数据加载失败：{str(e)}")
    print("手动下载地址：https://mirrors.tuna.tsinghua.edu.cn/pytorch/datasets/CIFAR-10/cifar-10-python.tar.gz")
    exit(1)

# 数据加载器初始化
NUM_TRAIN = 49000
loader_train = create_dataloader(cifar10_train, sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))
loader_val = create_dataloader(cifar10_val, sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))
loader_test = create_dataloader(cifar10_test)

def flatten(x):
    """展平层（带设备验证）"""
    assert x.device.type == device.type and x.device.index == device.index, "输入设备不一致"
    return x.view(x.size(0), -1)

def two_layer_fc(x, params):
    """增强型全连接前向传播"""
    w1, w2 = params
    
    # 设备与类型双重验证
    assert w1.device.type == device.type and w1.device.index == device.index, \
        f"w1设备不匹配：{w1.device} vs {device}"
    assert w2.device.type == device.type and w2.device.index == device.index, \
        f"w2设备不匹配：{w2.device} vs {device}"
    assert x.dtype == w1.dtype == w2.dtype == dtype, "类型不一致"
    
    # 计算过程优化
    with torch.amp.autocast(device_type='cuda', dtype=torch.float16):
    # 混合精度计算代码  
        x = flatten(x)
        x = F.relu(x.mm(w1))
        x = x.mm(w2)
    return x

def two_layer_fc_test():
    """增强型测试函数"""
    print("\n=== 测试开始 ===")
    print(f"当前设备：{device}（类型：{device.type}，索引：{device.index}）")
    print(f"初始显存：{torch.cuda.memory_allocated()/1024**2:.2f}MB")
    
    # 张量创建（继承全局设备属性）
    hidden_layer_size = 42
    x = torch.zeros((64, 50), dtype=dtype, device=device)
    w1 = torch.zeros((50, hidden_layer_size), dtype=dtype, device=device)
    w2 = torch.zeros((hidden_layer_size, 10), dtype=dtype, device=device)
    
    # 参数状态验证
    print(f"\n参数验证：")
    print(f"x设备：{x.device} | 类型：{x.dtype} | 形状：{x.shape}")
    print(f"w1设备：{w1.device} | 类型：{w1.dtype} | 形状：{w1.shape}")
    print(f"w2设备：{w2.device} | 类型：{w2.dtype} | 形状：{w2.shape}")
    
    # 维度匹配验证
    assert x.shape[1] == w1.shape[0], f"输入维度不匹配 {x.shape} vs {w1.shape}"
    assert w1.shape[1] == w2.shape[0], f"隐层维度不匹配 {w1.shape} vs {w2.shape}"
    
    # 前向传播验证
    scores = two_layer_fc(x, [w1, w2])
    print(f"\n输出验证：")
    print(f"输出尺寸：{scores.size()}（预期 [64, 10]）")
    print(f"峰值显存：{torch.cuda.max_memory_allocated()/1024**2:.2f}MB")
    
    # 显存释放验证
    del x, w1, w2, scores
    torch.cuda.empty_cache()
    print(f"\n释放后显存：{torch.cuda.memory_allocated()/1024**2:.2f}MB")
    print("=== 测试通过 ===")

if __name__ == "__main__":
    # 环境验证
    print(f"PyTorch版本：{torch.__version__}")
    print(f"CUDA可用性：{torch.cuda.is_available()}")
    print(f"检测到{torch.cuda.device_count()}块GPU")
    if torch.cuda.is_available():
        print(f"当前GPU：{torch.cuda.get_device_name(0)}")
        print(f"计算能力：{torch.cuda.get_device_capability(0)}")
    
    # 执行测试
    two_layer_fc_test()

PyTorch版本：2.7.0+cu126
CUDA可用性：True
检测到1块GPU
当前GPU：NVIDIA GeForce RTX 3060 Laptop GPU
计算能力：(8, 6)

=== 测试开始 ===
当前设备：cuda:0（类型：cuda，索引：0）
初始显存：4.09MB

参数验证：
x设备：cuda:0 | 类型：torch.float32 | 形状：torch.Size([64, 50])
w1设备：cuda:0 | 类型：torch.float32 | 形状：torch.Size([50, 42])
w2设备：cuda:0 | 类型：torch.float32 | 形状：torch.Size([42, 10])

输出验证：
输出尺寸：torch.Size([64, 10])（预期 [64, 10]）
峰值显存：12.25MB

释放后显存：12.21MB
=== 测试通过 ===


### Barebones PyTorch: Three-Layer ConvNet

Here you will complete the implementation of the function `three_layer_convnet`, which will perform the forward pass of a three-layer convolutional network. Like above, we can immediately test our implementation by passing zeros through the network. The network should have the following architecture:

1. A convolutional layer (with bias) with `channel_1` filters, each with shape `KW1 x KH1`, and zero-padding of two
2. ReLU nonlinearity
3. A convolutional layer (with bias) with `channel_2` filters, each with shape `KW2 x KH2`, and zero-padding of one
4. ReLU nonlinearity
5. Fully-connected layer with bias, producing scores for C classes.

Note that we have **no softmax activation** here after our fully-connected layer: this is because PyTorch's cross entropy loss performs a softmax activation for you, and by bundling that step in makes computation more efficient.

**HINT**: For convolutions: http://pytorch.org/docs/stable/nn.html#torch.nn.functional.conv2d; pay attention to the shapes of convolutional filters!

In [13]:
def three_layer_convnet(x, params):
    """
    Performs the forward pass of a three-layer convolutional network with the
    architecture defined above.

    Inputs:
    - x: A PyTorch Tensor of shape (N, 3, H, W) giving a minibatch of images
    - params: A list of PyTorch Tensors giving the weights and biases for the
      network; should contain the following:
      - conv_w1: PyTorch Tensor of shape (channel_1, 3, KH1, KW1) giving weights
        for the first convolutional layer
      - conv_b1: PyTorch Tensor of shape (channel_1,) giving biases for the first
        convolutional layer
      - conv_w2: PyTorch Tensor of shape (channel_2, channel_1, KH2, KW2) giving
        weights for the second convolutional layer
      - conv_b2: PyTorch Tensor of shape (channel_2,) giving biases for the second
        convolutional layer
      - fc_w: PyTorch Tensor giving weights for the fully-connected layer. Can you
        figure out what the shape should be?
      - fc_b: PyTorch Tensor giving biases for the fully-connected layer. Can you
        figure out what the shape should be?
    
    Returns:
    - scores: PyTorch Tensor of shape (N, C) giving classification scores for x

    Hint: use flatten(x) instead of x.flatten() if needed.
    """
    conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b = params
    scores = None
    ################################################################################
    # TODO: Implement the forward pass for the three-layer ConvNet.                #
    ################################################################################
    # *****START OF YOUR CODE *****

    x = F.conv2d(x, conv_w1, conv_b1, padding=2)
    x = F.relu(x)
    x = F.conv2d(x, conv_w2, conv_b2, padding=1)
    x = F.relu(x)
    x = flatten(x)
    scores = x.mm(fc_w) + fc_b

    # *****END OF YOUR CODE *****
    return scores

After defining the forward pass of the ConvNet above, run the following cell to test your implementation.

When you run this function, scores should have shape (64, 10).

In [15]:
import torch
import torch.nn.functional as F

# 全局设备配置（显式指定设备索引）
dtype = torch.float32
device = torch.device(f"cuda:{torch.cuda.current_device()}" if torch.cuda.is_available() else "cpu")

# 强制初始化CUDA上下文（解决延迟初始化问题）
if torch.cuda.is_available():
    torch.cuda.set_device(device)
    torch.cuda.synchronize()

def flatten(x):
    """设备验证的展平层[3,5](@ref)"""
    assert x.device.type == device.type and x.device.index == device.index, \
        f"输入设备不一致：当前设备 {x.device} vs 全局设备 {device}"
    return x.view(x.size(0), -1)

def three_layer_convnet(x, params):
    """
    三层级联卷积网络（带设备统一性验证）
    参数顺序: [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]
    """
    # 设备一致性验证[1,7](@ref)
    for i, param in enumerate(params):
        assert param.device.type == device.type and param.device.index == device.index, \
            f"参数{i}设备不匹配：{param.device} vs {device}"
    
    # 解包参数
    conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b = params
    
    # 第一层卷积
    x = F.conv2d(x, conv_w1, conv_b1, padding=2)  # 输入尺寸保持32x32
    x = F.relu(x)
    
    # 第二层卷积
    x = F.conv2d(x, conv_w2, conv_b2, padding=1)  # 保持特征图尺寸
    x = F.relu(x)
    
    # 展平层
    x = flatten(x)
    
    # 全连接层
    scores = x.mm(fc_w) + fc_b
    return scores

def three_layer_convnet_test():
    """增强型测试函数（带设备监控）"""
    print(f"全局设备配置：{device} (类型：{device.type}, 索引：{device.index})")
    
    # 创建测试张量（显式继承设备属性）[2,4](@ref)
    x = torch.zeros((64, 3, 32, 32), dtype=dtype, device=device)
    
    # 卷积层参数（带设备声明）
    conv_w1 = torch.zeros((6, 3, 5, 5), dtype=dtype, device=device)
    conv_b1 = torch.zeros((6,), dtype=dtype, device=device)
    conv_w2 = torch.zeros((9, 6, 3, 3), dtype=dtype, device=device)
    conv_b2 = torch.zeros((9,), dtype=dtype, device=device)
    
    # 全连接层参数（设备同步）
    fc_w = torch.zeros((9 * 32 * 32, 10), dtype=dtype, device=device)
    fc_b = torch.zeros(10, dtype=dtype, device=device)
    
    # 设备状态验证
    print(f"输入设备：{x.device} | 形状：{x.shape}")
    print(f"conv_w1设备：{conv_w1.device} | 形状：{conv_w1.shape}")
    print(f"fc_w设备：{fc_w.device} | 形状：{fc_w.shape}")
    
    # 前向传播测试
    scores = three_layer_convnet(x, [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b])
    print(f"输出尺寸验证：{scores.size()} (预期 [64, 10])")
    
    # 显存监控
    print(f"峰值显存：{torch.cuda.max_memory_allocated()/1024**2:.2f}MB")

if __name__ == "__main__":
    # 环境验证
    print(f"PyTorch版本：{torch.__version__}")
    print(f"CUDA可用性：{torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"当前GPU：{torch.cuda.get_device_name(0)}")
    
    # 执行测试
    three_layer_convnet_test()

PyTorch版本：2.7.0+cu126
CUDA可用性：True
当前GPU：NVIDIA GeForce RTX 3060 Laptop GPU
全局设备配置：cuda:0 (类型：cuda, 索引：0)
输入设备：cuda:0 | 形状：torch.Size([64, 3, 32, 32])
conv_w1设备：cuda:0 | 形状：torch.Size([6, 3, 5, 5])
fc_w设备：cuda:0 | 形状：torch.Size([9216, 10])
输出尺寸验证：torch.Size([64, 10]) (预期 [64, 10])
峰值显存：18.19MB


### Barebones PyTorch: Initialization
Let's write a couple utility methods to initialize the weight matrices for our models.

- `random_weight(shape)` initializes a weight tensor with the Kaiming normalization method.
- `zero_weight(shape)` initializes a weight tensor with all zeros. Useful for instantiating bias parameters.

The `random_weight` function uses the Kaiming normal initialization method, described in:

He et al, *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification*, ICCV 2015, https://arxiv.org/abs/1502.01852

In [17]:
import numpy as np  # 新增numpy导入
import torch

# 全局配置
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.float32  # 统一数据类型

def random_weight(shape):
    """
    创建符合Kaiming初始化的权重张量
    优化点：
    1. 使用纯PyTorch实现避免numpy依赖
    2. 增强设备与类型一致性检查
    """
    if len(shape) == 2:  # 全连接层
        fan_in = shape[0]
    else:  # 卷积层 [out_ch, in_ch, kH, kW]
        # 使用torch.prod代替np.prod[8](@ref)
        fan_in = torch.prod(torch.tensor(shape[1:])).item()  
    
    # 纯PyTorch实现初始化
    w = torch.randn(shape, device=device, dtype=dtype) * torch.sqrt(torch.tensor(2. / fan_in))
    w.requires_grad = True
    return w

def zero_weight(shape):
    """创建零初始化张量"""
    return torch.zeros(shape, device=device, dtype=dtype, requires_grad=True)

# 测试代码
if __name__ == "__main__":
    weight = random_weight((3, 5))
    print(f"权重张量设备: {weight.device}")
    print(f"权重张量类型: {weight.dtype}")
    print(f"梯度追踪状态: {weight.requires_grad}")

权重张量设备: cuda:0
权重张量类型: torch.float32
梯度追踪状态: True


### Barebones PyTorch: Check Accuracy
When training the model we will use the following function to check the accuracy of our model on the training or validation sets.

When checking accuracy we don't need to compute any gradients; as a result we don't need PyTorch to build a computational graph for us when we compute scores. To prevent a graph from being built we scope our computation under a `torch.no_grad()` context manager.

In [19]:
def check_accuracy_part2(loader, model_fn, params):
    """
    Check the accuracy of a classification model.
    
    Inputs:
    - loader: A DataLoader for the data split we want to check
    - model_fn: A function that performs the forward pass of the model,
      with the signature scores = model_fn(x, params)
    - params: List of PyTorch Tensors giving parameters of the model
    
    Returns: Nothing, but prints the accuracy of the model
    """
    split = 'val' if loader.dataset.train else 'test'
    print('Checking accuracy on the %s set' % split)
    num_correct, num_samples = 0, 0
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.int64)
            scores = model_fn(x, params)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))

### BareBones PyTorch: Training Loop
We can now set up a basic training loop to train our network. We will train the model using stochastic gradient descent without momentum. We will use `torch.functional.cross_entropy` to compute the loss; you can [read about it here](http://pytorch.org/docs/stable/nn.html#cross-entropy).

The training loop takes as input the neural network function, a list of initialized parameters (`[w1, w2]` in our example), and learning rate.

In [21]:
import torch
import torch.nn.functional as F

# 全局设备配置（显式指定设备索引）
dtype = torch.float32
device = torch.device(f"cuda:{torch.cuda.current_device()}" if torch.cuda.is_available() else "cpu")

# 强制初始化CUDA上下文（解决延迟初始化问题）
if torch.cuda.is_available():
    torch.cuda.set_device(device)
    torch.cuda.synchronize()

def flatten(x):
    """设备验证的展平层[3,5](@ref)"""
    assert x.device.type == device.type and x.device.index == device.index, \
        f"输入设备不一致：当前设备 {x.device} vs 全局设备 {device}"
    return x.view(x.size(0), -1)

def three_layer_convnet(x, params):
    """
    三层级联卷积网络（带设备统一性验证）
    参数顺序: [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]
    """
    # 设备一致性验证[1,7](@ref)
    for i, param in enumerate(params):
        assert param.device.type == device.type and param.device.index == device.index, \
            f"参数{i}设备不匹配：{param.device} vs {device}"
    
    # 解包参数
    conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b = params
    
    # 第一层卷积
    x = F.conv2d(x, conv_w1, conv_b1, padding=2)  # 输入尺寸保持32x32
    x = F.relu(x)
    
    # 第二层卷积
    x = F.conv2d(x, conv_w2, conv_b2, padding=1)  # 保持特征图尺寸
    x = F.relu(x)
    
    # 展平层
    x = flatten(x)
    
    # 全连接层
    scores = x.mm(fc_w) + fc_b
    return scores

def three_layer_convnet_test():
    """增强型测试函数（带设备监控）"""
    print(f"全局设备配置：{device} (类型：{device.type}, 索引：{device.index})")
    
    # 创建测试张量（显式继承设备属性）[2,4](@ref)
    x = torch.zeros((64, 3, 32, 32), dtype=dtype, device=device)
    
    # 卷积层参数（带设备声明）
    conv_w1 = torch.zeros((6, 3, 5, 5), dtype=dtype, device=device)
    conv_b1 = torch.zeros((6,), dtype=dtype, device=device)
    conv_w2 = torch.zeros((9, 6, 3, 3), dtype=dtype, device=device)
    conv_b2 = torch.zeros((9,), dtype=dtype, device=device)
    
    # 全连接层参数（设备同步）
    fc_w = torch.zeros((9 * 32 * 32, 10), dtype=dtype, device=device)
    fc_b = torch.zeros(10, dtype=dtype, device=device)
    
    # 设备状态验证
    print(f"输入设备：{x.device} | 形状：{x.shape}")
    print(f"conv_w1设备：{conv_w1.device} | 形状：{conv_w1.shape}")
    print(f"fc_w设备：{fc_w.device} | 形状：{fc_w.shape}")
    
    # 前向传播测试
    scores = three_layer_convnet(x, [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b])
    print(f"输出尺寸验证：{scores.size()} (预期 [64, 10])")
    
    # 显存监控
    print(f"峰值显存：{torch.cuda.max_memory_allocated()/1024**2:.2f}MB")

if __name__ == "__main__":
    # 环境验证
    print(f"PyTorch版本：{torch.__version__}")
    print(f"CUDA可用性：{torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"当前GPU：{torch.cuda.get_device_name(0)}")
    
    # 执行测试
    three_layer_convnet_test()

PyTorch版本：2.7.0+cu126
CUDA可用性：True
当前GPU：NVIDIA GeForce RTX 3060 Laptop GPU
全局设备配置：cuda:0 (类型：cuda, 索引：0)
输入设备：cuda:0 | 形状：torch.Size([64, 3, 32, 32])
conv_w1设备：cuda:0 | 形状：torch.Size([6, 3, 5, 5])
fc_w设备：cuda:0 | 形状：torch.Size([9216, 10])
输出尺寸验证：torch.Size([64, 10]) (预期 [64, 10])
峰值显存：18.19MB


### BareBones PyTorch: Train a Two-Layer Network
Now we are ready to run the training loop. We need to explicitly allocate tensors for the fully connected weights, `w1` and `w2`. 

Each minibatch of CIFAR has 64 examples, so the tensor shape is `[64, 3, 32, 32]`. 

After flattening, `x` shape should be `[64, 3 * 32 * 32]`. This will be the size of the first dimension of `w1`. 
The second dimension of `w1` is the hidden layer size, which will also be the first dimension of `w2`. 

Finally, the output of the network is a 10-dimensional vector that represents the probability distribution over 10 classes. 

You don't need to tune any hyperparameters but you should see accuracies above 40% after training for one epoch.

### BareBones PyTorch: Training a ConvNet

In the below you should use the functions defined above to train a three-layer convolutional network on CIFAR. The network should have the following architecture:

1. Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2
2. ReLU
3. Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1
4. ReLU
5. Fully-connected layer (with bias) to compute scores for 10 classes

You should initialize your weight matrices using the `random_weight` function defined above, and you should initialize your bias vectors using the `zero_weight` function above.

You don't need to tune any hyperparameters, but if everything works correctly you should achieve an accuracy above 42% after one epoch.

# Part III. PyTorch Module API

Barebone PyTorch requires that we track all the parameter tensors by hand. This is fine for small networks with a few tensors, but it would be extremely inconvenient and error-prone to track tens or hundreds of tensors in larger networks.

PyTorch provides the `nn.Module` API for you to define arbitrary network architectures, while tracking every learnable parameters for you. In Part II, we implemented SGD ourselves. PyTorch also provides the `torch.optim` package that implements all the common optimizers, such as RMSProp, Adagrad, and Adam. It even supports approximate second-order methods like L-BFGS! You can refer to the [doc](http://pytorch.org/docs/master/optim.html) for the exact specifications of each optimizer.

To use the Module API, follow the steps below:

1. Subclass `nn.Module`. Give your network class an intuitive name like `TwoLayerFC`. 

2. In the constructor `__init__()`, define all the layers you need as class attributes. Layer objects like `nn.Linear` and `nn.Conv2d` are themselves `nn.Module` subclasses and contain learnable parameters, so that you don't have to instantiate the raw tensors yourself. `nn.Module` will track these internal parameters for you. Refer to the [doc](http://pytorch.org/docs/master/nn.html) to learn more about the dozens of builtin layers. **Warning**: don't forget to call the `super().__init__()` first!

3. In the `forward()` method, define the *connectivity* of your network. You should use the attributes defined in `__init__` as function calls that take tensor as input and output the "transformed" tensor. Do *not* create any new layers with learnable parameters in `forward()`! All of them must be declared upfront in `__init__`. 

After you define your Module subclass, you can instantiate it as an object and call it just like the NN forward function in part II.

### Module API: Two-Layer Network
Here is a concrete example of a 2-layer fully connected network:

In [25]:
import torch
import torch.nn as nn

# 新增全局设备定义（必须在所有函数前声明）
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.float32

class TwoLayerFC(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        # 设备感知的初始化（修复点1）
        self.fc1 = nn.Linear(input_size, hidden_size).to(device)
        self.fc2 = nn.Linear(hidden_size, num_classes).to(device)
        
        # 初始化方法添加设备参数（修复点2）
        nn.init.kaiming_normal_(self.fc1.weight, mode='fan_out', nonlinearity='relu')
        nn.init.kaiming_normal_(self.fc2.weight, mode='fan_out', nonlinearity='relu')
    
    def forward(self, x):
        # 移除外部flatten调用，改用内置方法（修复点3）
        x = x.view(x.size(0), -1)  # 替代原flatten函数
        x = F.relu(self.fc1(x))
        scores = self.fc2(x)
        return scores

def test_TwoLayerFC():
    input_size = 50
    # 显式指定设备（修复点4）
    x = torch.zeros((64, input_size), dtype=dtype, device=device)
    model = TwoLayerFC(input_size, 42, 10).to(device)  # 模型显式转移
    
    # 设备一致性验证（调试用）
    print(f"输入设备: {x.device}")
    print(f"模型参数设备: {next(model.parameters()).device}")
    
    scores = model(x)
    print(scores.size())  # 正确输出 torch.Size([64, 10])

if __name__ == "__main__":
    test_TwoLayerFC()

输入设备: cuda:0
模型参数设备: cuda:0
torch.Size([64, 10])


### Module API: Three-Layer ConvNet
It's your turn to implement a 3-layer ConvNet followed by a fully connected layer. The network architecture should be the same as in Part II:

1. Convolutional layer with `channel_1` 5x5 filters with zero-padding of 2
2. ReLU
3. Convolutional layer with `channel_2` 3x3 filters with zero-padding of 1
4. ReLU
5. Fully-connected layer to `num_classes` classes

You should initialize the weight matrices of the model using the Kaiming normal initialization method.

**HINT**: http://pytorch.org/docs/stable/nn.html#conv2d

After you implement the three-layer ConvNet, the `test_ThreeLayerConvNet` function will run your implementation; it should print `(64, 10)` for the shape of the output scores.

In [27]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# 新增全局设备配置（必须在所有函数前声明）
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.float32

class ThreeLayerConvNet(nn.Module):
    def __init__(self, in_channel, channel_1, channel_2, num_classes):
        super().__init__()
        # 设备感知的卷积层初始化
        self.conv1 = nn.Conv2d(in_channel, channel_1, kernel_size=5, padding=2).to(device)
        nn.init.kaiming_normal_(self.conv1.weight, mode='fan_out', nonlinearity='relu')
        self.conv1.bias.data.zero_()
        
        self.conv2 = nn.Conv2d(channel_1, channel_2, kernel_size=3, padding=1).to(device)
        nn.init.kaiming_normal_(self.conv2.weight, mode='fan_out', nonlinearity='relu')
        self.conv2.bias.data.zero_()
        
        # 设备感知的全连接层初始化（修复点1）
        self.fc = nn.Linear(channel_2 * 32 * 32, num_classes).to(device)
        nn.init.kaiming_normal_(self.fc.weight)
        self.fc.bias.data.zero_()

    def forward(self, x):
        # 使用内置展平方法替代自定义flatten（修复点2）
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = x.view(x.size(0), -1)  # 替代原flatten函数
        return self.fc(x)

def test_ThreeLayerConvNet():
    # 显式指定设备（修复点3）
    x = torch.zeros((64, 3, 32, 32), dtype=dtype, device=device)
    
    # 模型显式转移设备（修复点4）
    model = ThreeLayerConvNet(in_channel=3, channel_1=12, channel_2=8, num_classes=10).to(device)
    
    scores = model(x)
    print(f"输出尺寸验证: {scores.size()} (预期 torch.Size([64, 10]))")

if __name__ == "__main__":
    test_ThreeLayerConvNet()

输出尺寸验证: torch.Size([64, 10]) (预期 torch.Size([64, 10]))


### Module API: Check Accuracy
Given the validation or test set, we can check the classification accuracy of a neural network. 

This version is slightly different from the one in part II. You don't manually pass in the parameters anymore.

In [29]:
def check_accuracy_part34(loader, model):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

### Module API: Training Loop
We also use a slightly different training loop. Rather than updating the values of the weights ourselves, we use an Optimizer object from the `torch.optim` package, which abstract the notion of an optimization algorithm and provides implementations of most of the algorithms commonly used to optimize neural networks.

In [31]:
def train_part34(model, optimizer, num_workers=0, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Nothing, but prints model accuracies during training.
    """
    # 修改数据加载器参数（关键修复点）
    loader_train = DataLoader(
        dataset_train,
        batch_size=64,
        num_workers=num_workers,  # 接收外部参数控制
        pin_memory=True if device.type == 'cuda' else False
    )
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    for e in range(epochs):
        for t, (x, y) in enumerate(loader_train):
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # This is the backwards pass: compute the gradient of the loss with
            # respect to each  parameter of the model.
            loss.backward()

            # Actually update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()

            if t % print_every == 0:
                print('Iteration %d, loss = %.4f' % (t, loss.item()))
                check_accuracy_part34(loader_val, model)
                print()

### Module API: Train a Two-Layer Network
Now we are ready to run the training loop. In contrast to part II, we don't explicitly allocate parameter tensors anymore.

Simply pass the input size, hidden layer size, and number of classes (i.e. output size) to the constructor of `TwoLayerFC`. 

You also need to define an optimizer that tracks all the learnable parameters inside `TwoLayerFC`.

You don't need to tune any hyperparameters, but you should see model accuracies above 40% after training for one epoch.

In [33]:
# 修改后的训练调用代码
if __name__ == '__main__':  # Windows多进程必须的保护层
    hidden_layer_size = 4000
    learning_rate = 1e-2
    
    # 模型显式转移到设备后再初始化优化器（关键修复）
    model = TwoLayerFC(3 * 32 * 32, hidden_layer_size, 10).to(device)  # 先转移模型到设备
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)        # 后创建优化器
    
    # 强制单进程模式（Windows兼容性修复）
    train_part34(model, optimizer, num_workers=0)  # 新增num_workers参数控制

NameError: name 'dataset_train' is not defined

### Module API: Train a Three-Layer ConvNet
You should now use the Module API to train a three-layer ConvNet on CIFAR. This should look very similar to training the two-layer network! You don't need to tune any hyperparameters, but you should achieve above above 45% after training for one epoch.

You should train the model using stochastic gradient descent without momentum.

In [None]:
learning_rate = 3e-3
channel_1 = 32
channel_2 = 16

model = None
optimizer = None
################################################################################
# TODO: Instantiate your ThreeLayerConvNet model and a corresponding optimizer #
################################################################################
# *****START OF YOUR CODE *****

# 实例化三层的卷积神经网络（输入通道3，中间通道32和16，输出10类）
model = ThreeLayerConvNet(in_channel=3, 
                         channel_1=channel_1,
                         channel_2=channel_2,
                         num_classes=10)
# 使用带动量的SGD优化器
optimizer = optim.SGD(model.parameters(), 
                     lr=learning_rate, 
                     momentum=0.9)

# *****END OF YOUR CODE *****

train_part34(model, optimizer)

# Part IV. PyTorch Sequential API

Part III introduced the PyTorch Module API, which allows you to define arbitrary learnable layers and their connectivity. 

For simple models like a stack of feed forward layers, you still need to go through 3 steps: subclass `nn.Module`, assign layers to class attributes in `__init__`, and call each layer one by one in `forward()`. Is there a more convenient way? 

Fortunately, PyTorch provides a container Module called `nn.Sequential`, which merges the above steps into one. It is not as flexible as `nn.Module`, because you cannot specify more complex topology than a feed-forward stack, but it's good enough for many use cases.

### Sequential API: Two-Layer Network
Let's see how to rewrite our two-layer fully connected network example with `nn.Sequential`, and train it using the training loop defined above.

Again, you don't need to tune any hyperparameters here, but you shoud achieve above 40% accuracy after one epoch of training.

In [None]:
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
    def forward(self, x):
        return flatten(x)

hidden_layer_size = 4000
learning_rate = 1e-2

model = nn.Sequential(
    Flatten(),
    nn.Linear(3 * 32 * 32, hidden_layer_size),
    nn.ReLU(),
    nn.Linear(hidden_layer_size, 10),
)

# you can use Nesterov momentum in optim.SGD
optimizer = optim.SGD(model.parameters(), lr=learning_rate,
                     momentum=0.9, nesterov=True)

train_part34(model, optimizer)

### Sequential API: Three-Layer ConvNet
Here you should use `nn.Sequential` to define and train a three-layer ConvNet with the same architecture we used in Part III:

1. Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2
2. ReLU
3. Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1
4. ReLU
5. Fully-connected layer (with bias) to compute scores for 10 classes

You can use the default PyTorch weight initialization.

You should optimize your model using stochastic gradient descent with Nesterov momentum 0.9.

Again, you don't need to tune any hyperparameters but you should see accuracy above 55% after one epoch of training.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# 自定义展平层（根据网页6/7/8的Sequential用法）
class Flatten(nn.Module):
    def forward(self, x):
        return x.view(x.size(0), -1)

channel_1 = 32
channel_2 = 16
learning_rate = 1e-2

model = nn.Sequential(
    # 卷积层1 (网页6/7/8提到的标准卷积配置)
    nn.Conv2d(3, channel_1, kernel_size=5, padding=2),  # 输入通道3，输出32
    nn.ReLU(),
    # 卷积层2
    nn.Conv2d(channel_1, channel_2, kernel_size=3, padding=1),  # 输入32，输出16
    nn.ReLU(),
    Flatten(),  # 展平层代替手动展平操作
    # 全连接层（需确保输入维度正确，网页6/7/8中Linear层用法）
    nn.Linear(16 * 32 * 32, 10)  # 假设输入图片保持32x32尺寸
)

# 初始化优化器（网页8提到的标准配置）
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

# 调用训练函数（需确保train_part34已正确实现）
train_part34(model, optimizer)

# Part V. CIFAR-10 open-ended challenge

In this section, you can experiment with whatever ConvNet architecture you'd like on CIFAR-10. 

Now it's your job to experiment with architectures, hyperparameters, loss functions, and optimizers to train a model that achieves **at least 70%** accuracy on the CIFAR-10 **validation** set within 10 epochs. You can use the check_accuracy and train functions from above. You can use either `nn.Module` or `nn.Sequential` API. 

Describe what you did at the end of this notebook.

Here are the official API documentation for each component. One note: what we call in the class "spatial batch norm" is called "BatchNorm2D" in PyTorch.

* Layers in torch.nn package: http://pytorch.org/docs/stable/nn.html
* Activations: http://pytorch.org/docs/stable/nn.html#non-linear-activations
* Loss functions: http://pytorch.org/docs/stable/nn.html#loss-functions
* Optimizers: http://pytorch.org/docs/stable/optim.html


### Things you might try:
- **Filter size**: Above we used 5x5; would smaller filters be more efficient?
- **Number of filters**: Above we used 32 filters. Do more or fewer do better?
- **Pooling vs Strided Convolution**: Do you use max pooling or just stride convolutions?
- **Batch normalization**: Try adding spatial batch normalization after convolution layers and vanilla batch normalization after affine layers. Do your networks train faster?
- **Network architecture**: The network above has two layers of trainable parameters. Can you do better with a deep network? Good architectures to try include:
    - [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM]
- **Global Average Pooling**: Instead of flattening and then having multiple affine layers, perform convolutions until your image gets small (7x7 or so) and then perform an average pooling operation to get to a 1x1 image picture (1, 1 , Filter#), which is then reshaped into a (Filter#) vector. This is used in [Google's Inception Network](https://arxiv.org/abs/1512.00567) (See Table 1 for their architecture).
- **Regularization**: Add l2 weight regularization, or perhaps use Dropout.

### Tips for training
For each network architecture that you try, you should tune the learning rate and other hyperparameters. When doing this there are a couple important things to keep in mind:

- If the parameters are working well, you should see improvement within a few hundred iterations
- Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.
- You should use the validation set for hyperparameter search, and save your test set for evaluating your architecture on the best parameters as selected by the validation set.


In [None]:
################################################################################
# TODO:                                                                        #         
# Experiment with any architectures, optimizers, and hyperparameters.          #
# Achieve AT LEAST 70% accuracy on the *validation set* within 10 epochs.      #
#                                                                              #
# Note that you can use the check_accuracy function to evaluate on either      #
# the test set or the validation set, by passing either loader_test or         #
# loader_val as the second argument to check_accuracy. You should not touch    #
# the test set until you have finished your architecture and  hyperparameter   #
# tuning, and only run the test set once at the end to report a final value.   #
################################################################################
model = None
optimizer = None

# *****START OF YOUR CODE *****

# 深层卷积网络结构（包含BatchNorm和Dropout）
model = nn.Sequential(
    # 卷积块1 [3x32x32] -> [64x32x32]
    nn.Conv2d(3, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    
    # 卷积块2 [64x32x32] -> [128x32x32]
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    
    # 池化层 [128x32x32] -> [128x16x16]
    nn.MaxPool2d(2),
    
    # 卷积块3 [128x16x16] -> [256x16x16]
    nn.Conv2d(128, 256, kernel_size=3, padding=1),
    nn.BatchNorm2d(256),
    nn.ReLU(),
    
    # 池化层 [256x16x16] -> [256x8x8]
    nn.MaxPool2d(2),
    
    # 展平特征 [256x8x8] -> 16384
    Flatten(),
    
    # 全连接层 with dropout
    nn.Dropout(0.3),
    nn.Linear(256 * 8 * 8, 1024),
    nn.ReLU(),
    
    nn.Dropout(0.3),
    nn.Linear(1024, 512),
    nn.ReLU(),
    
    nn.Linear(512, 10)
)

# 调整学习率策略
optimizer = optim.Adam(model.parameters(), 
                      lr=0.0005,  # 降低学习率
                      weight_decay=1e-3)

# *****END OF YOUR CODE *****

# You should get at least 70% accuracy.
# You may modify the number of epochs to any number below 15.
train_part34(model, optimizer, epochs=10)

## Test set -- run this only once

Now that we've gotten a result we're happy with, we test our final model on the test set (which you should store in best_model). Think about how this compares to your validation set accuracy.

In [None]:
check_accuracy_part34(loader_test, model)

### Let's Go Further!!

Don't stop with 70% accuracy. Design and train a model to achieve accuracy as good as you can, say above 95%. 

### Going above and beyond
If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these, but don't miss the fun if you have time!

- Alternative optimizers: you can try Adam, Adagrad, RMSprop, etc.
- Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.
- Model ensembles
- Data augmentation
- New Architectures
  - [ResNets](https://arxiv.org/abs/1512.03385) where the input from the previous layer is added to the output.
  - [DenseNets](https://arxiv.org/abs/1608.06993) where inputs into previous layers are concatenated together.
  - [This blog has an in-depth overview](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)

### Have fun and happy training! 

In [None]:
model = None
optimizer = None

# *****START OF YOUR CODE *****

# 深层卷积网络结构（包含BatchNorm和Dropout）
# 深层卷积网络结构（包含BatchNorm和Dropout）
model = nn.Sequential(
    # 卷积块1 [3x32x32] -> [64x32x32]
    nn.Conv2d(3, 64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
    
    # 卷积块2 [64x32x32] -> [128x32x32]
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
    
    # 池化层 [128x32x32] -> [128x16x16]
    nn.MaxPool2d(2),
    
    # 卷积块3 [128x16x16] -> [256x16x16]
    nn.Conv2d(128, 256, kernel_size=3, padding=1),
    nn.BatchNorm2d(256),
    nn.ReLU(),
    
    # 池化层 [256x16x16] -> [256x8x8]
    nn.MaxPool2d(2),
    
    # 展平特征 [256x8x8] -> 16384
    Flatten(),
    
    # 全连接层 with dropout
    nn.Dropout(0.3),
    nn.Linear(256 * 8 * 8, 1024),
    nn.ReLU(),
    
    nn.Dropout(0.3),
    nn.Linear(1024, 512),
    nn.ReLU(),
    
    nn.Linear(512, 10)
)

# 使用Adam优化器带L2正则化
optimizer = optim.Adam(model.parameters(), 
                      lr=0.001,
                      weight_decay=1e-3)

# *****END OF YOUR CODE *****

# Feel free to modify the number of epochs, or any other thing you like.
train_part34(model, optimizer, epochs=20)

print('Accuracy on test set:')
check_accuracy_part34(loader_test, model)

print('\n\nAccuracy on val set:')
check_accuracy_part34(loader_val, model)