*Accompanying code examples of the book "Introduction to Artificial Neural Networks and Deep Learning: A Practical Guide with Applications in Python" by [Sebastian Raschka](https://sebastianraschka.com). All code examples are released under the [MIT license](https://github.com/rasbt/deep-learning-book/blob/master/LICENSE). If you find this content useful, please consider supporting the work by buying a [copy of the book](https://leanpub.com/ann-and-deeplearning).*
  
Other code examples and content are available on [GitHub](https://github.com/rasbt/deep-learning-book). The PDF and ebook versions of the book are available through [Leanpub](https://leanpub.com/ann-and-deeplearning).

In [7]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p torch

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Author: Sebastian Raschka

Python implementation: CPython
Python version       : 3.11.11
IPython version      : 9.0.2

torch: 2.6.0+cu126



# Model Zoo -- ResNet-34 CIFAR-10 Classifier with Pinned Memory

This is an example notebook comparing the speed of model training with and without using page-locked memory.  
这是一个示例笔记本，比较使用和不使用页锁定内存时模型训练的速度。

Page-locked memory can be enabled by setting `pin_memory=True` in PyTorch's `DataLoader` class (disabled by default).  
可以通过在PyTorch的`DataLoader`类中设置`pin_memory=True`来启用页锁定内存（默认情况下是禁用的）。

Theoretically, pinning the memory should speed up the data transfer rate but minimizing the data transfer cost between CPU and the CUDA device; hence, enabling `pin_memory=True` should make the model training faster by some small margin.  
从理论上讲，锁定内存应该加快数据传输速率，并减少CPU和CUDA设备之间的数据传输成本；因此，启用`pin_memory=True`应该会使模型训练速度有所提高。

> Host (CPU) data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below... (Source: https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/)  
> 主机（CPU）数据分配默认是可分页的。GPU无法直接从可分页的主机内存访问数据，因此当从可分页的主机内存向设备内存传输数据时，CUDA驱动程序必须首先分配一个临时的页锁定或“固定”主机数组，将主机数据复制到固定数组中，然后将数据从固定数组传输到设备内存，如下所示……（来源：https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/）

After the Model preamble, this Notebook is divided into two subsections, "Training Without Pinned Memory" and "Training with Pinned Memory" to investigate whether there is a noticeable training time difference when toggling `pin_memory` on and off.  
在模型前言之后，本笔记本分为两个子部分，“不使用页锁定内存的训练”和“使用页锁定内存的训练”，以调查在开启和关闭`pin_memory`时是否存在显著的训练时间差异。

### Network Architecture

The network in this notebook is an implementation of the ResNet-34 [1] architecture on the MNIST digits dataset (http://yann.lecun.com/exdb/mnist/) to train a handwritten digit classifier.  
本笔记本中的网络是基于MNIST数字数据集（http://yann.lecun.com/exdb/mnist/）实现的ResNet-34 [1]架构，用于训练手写数字分类器。

References  
参考文献

- [1] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). ([CVPR Link](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html))  
- [1] He, K., Zhang, X., Ren, S., & Sun, J. (2016). 深度残差学习用于图像识别。在IEEE计算机视觉与模式识别会议论文集（第770-778页）。([CVPR 链接](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html))

- [2] http://yann.lecun.com/exdb/mnist/  
- [2] http://yann.lecun.com/exdb/mnist/

![](../images/resnets/resnet34/resnet34-arch.png)

The following figure illustrates residual blocks with skip connections such that the input passed via the shortcut matches the dimensions of the main path's output, which allows the network to learn identity functions.  
下图展示了带有跳跃连接的残差块，使得通过快捷方式传递的输入与主路径输出的维度相匹配，从而使网络能够学习恒等函数。

![](../images/resnets/resnet-ex-1-1.png)

The ResNet-34 architecture actually uses residual blocks with skip connections such that the input passed via the shortcut matches is resized to dimensions of the main path's output. Such a residual block is illustrated below:  
ResNet-34架构实际上使用带有跳跃连接的残差块，使得通过快捷方式传递的输入被调整为与主路径输出的维度匹配。下图展示了这样的残差块：

![](../images/resnets/resnet-ex-1-2.png)

For a more detailed explanation see the other notebook, [resnet-ex-1.ipynb](resnet-ex-1.ipynb).

## Imports

In [8]:
import os
import time

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader

from torchvision import datasets
from torchvision import transforms

import matplotlib.pyplot as plt
from PIL import Image


if torch.cuda.is_available():
    torch.backends.cudnn.deterministic = True
    torch.cuda.set_per_process_memory_fraction(0.5, device=0)

## Model Settings

In [9]:
##########################
### 设置
##########################

# 超参数
RANDOM_SEED = 1  # 随机种子，确保实验结果可复现
LEARNING_RATE = 0.001  # 学习率
BATCH_SIZE = 256  # 批次大小
NUM_EPOCHS = 10  # 训练轮数

# 网络架构
NUM_FEATURES = 28*28  # 输入特征数（假设输入图像大小为28x28）
NUM_CLASSES = 10  # 分类数量（例如：手写数字分类任务中的0-9，共10类）

# 其他设置
DEVICE = "cuda:0"  # 指定使用的设备，这里选择第二个GPU（cuda:0）
GRAYSCALE = False  # 是否使用灰度图像（False表示使用RGB图像）

The following code cell that implements the ResNet-34 architecture is a derivative of the code provided at https://pytorch.org/docs/0.4.0/_modules/torchvision/models/resnet.html.  
以下实现ResNet-34架构的代码单元是基于https://pytorch.org/docs/0.4.0/_modules/torchvision/models/resnet.html提供的代码。

In [10]:
##########################
### 模型
##########################


def conv3x3(in_planes, out_planes, stride=1):
    """带填充的3x3卷积"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=1, bias=False)  # 使用3x3卷积核，步幅为stride，填充为1，bias=False表示不使用偏置


class BasicBlock(nn.Module):
    expansion = 1  # 扩展因子，表示每个基本块的输出通道数与输入通道数的关系

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)  # 第1个卷积层
        self.bn1 = nn.BatchNorm2d(planes)  # 批归一化
        self.relu = nn.ReLU(inplace=True)  # ReLU激活函数
        self.conv2 = conv3x3(planes, planes)  # 第2个卷积层
        self.bn2 = nn.BatchNorm2d(planes)  # 批归一化
        self.downsample = downsample  # 用于下采样的层
        self.stride = stride  # 步幅

    def forward(self, x):
        residual = x  # 保存输入x，用于残差连接

        out = self.conv1(x)  # 通过第1个卷积层
        out = self.bn1(out)  # 批归一化
        out = self.relu(out)  # ReLU激活函数

        out = self.conv2(out)  # 通过第2个卷积层
        out = self.bn2(out)  # 批归一化

        if self.downsample is not None:  # 如果需要下采样，则应用下采样操作
            residual = self.downsample(x)

        out += residual  # 残差连接
        out = self.relu(out)  # ReLU激活函数

        return out  # 返回结果


class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes, grayscale):
        self.inplanes = 64  # 初始输入通道数
        if grayscale:
            in_dim = 1  # 如果是灰度图，输入通道数为1
        else:
            in_dim = 3  # 如果是彩色图，输入通道数为3
        super(ResNet, self).__init__()
        self.conv1 = nn.Conv2d(in_dim, 64, kernel_size=7, stride=2, padding=3,  # 第一层卷积
                               bias=False)
        self.bn1 = nn.BatchNorm2d(64)  # 批归一化
        self.relu = nn.ReLU(inplace=True)  # ReLU激活函数
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)  # 最大池化层
        # 创建四个残差块层
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        self.avgpool = nn.AvgPool2d(7, stride=1)  # 平均池化层
        self.fc = nn.Linear(512 * block.expansion, num_classes)  # 全连接层，用于分类

        # 初始化网络中的参数
        for m in self.modules():
            if isinstance(m, nn.Conv2d):  # 对卷积层进行初始化
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, (2. / n)**.5)
            elif isinstance(m, nn.BatchNorm2d):  # 对批归一化层进行初始化
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            # 如果步幅不等于1或者输入通道数与输出通道数不匹配，则需要下采样
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes * block.expansion,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))  # 添加第一个残差块
        self.inplanes = planes * block.expansion  # 更新输入通道数
        for i in range(1, blocks):  # 添加后续的残差块
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)  # 返回由残差块组成的层

    def forward(self, x):
        x = self.conv1(x)  # 第1个卷积层
        x = self.bn1(x)  # 批归一化
        x = self.relu(x)  # ReLU激活函数
        x = self.maxpool(x)  # 最大池化层

        x = self.layer1(x)  # 第1个残差层
        x = self.layer2(x)  # 第2个残差层
        x = self.layer3(x)  # 第3个残差层
        x = self.layer4(x)  # 第4个残差层

        # 由于MNIST图像的尺寸已经是1x1，因此可以跳过平均池化层
        # 如果输入图像较大，可以启用avgpool
        # x = self.avgpool(x)
        
        x = x.view(x.size(0), -1)  # 展平张量
        logits = self.fc(x)  # 全连接层
        probas = F.softmax(logits, dim=1)  # softmax得到概率分布
        return logits, probas  # 返回logits和概率分布


def resnet34(num_classes):
    """构建ResNet-34模型"""
    model = ResNet(block=BasicBlock, 
                   layers=[3, 4, 6, 3],  # ResNet-34的层数配置
                   num_classes=num_classes,
                   grayscale=GRAYSCALE)  # 是否使用灰度图像
    return model


## Training without Pinned Memory

In [11]:
##########################
### CIFAR-10 数据集
##########################


# 注意 transforms.ToTensor() 会将输入图像缩放到 0-1 范围
train_dataset = datasets.CIFAR10(root='data', 
                                 train=True,  # 训练集
                                 transform=transforms.ToTensor(),  # 转换为Tensor格式
                                 download=True)  # 如果数据集不存在，下载数据集

test_dataset = datasets.CIFAR10(root='data', 
                                train=False,  # 测试集
                                transform=transforms.ToTensor())  # 转换为Tensor格式


train_loader = DataLoader(dataset=train_dataset, 
                          batch_size=BATCH_SIZE,  # 设置批量大小
                          num_workers=8,  # 设置数据加载的并行工作线程数
                          shuffle=True)  # 是否打乱数据

test_loader = DataLoader(dataset=test_dataset, 
                         batch_size=BATCH_SIZE,  # 设置批量大小
                         num_workers=8,  # 设置数据加载的并行工作线程数
                         shuffle=False)  # 不打乱测试数据


# 检查数据集
for images, labels in train_loader:  # 遍历训练数据加载器中的每个batch
    print('图像批次维度:', images.shape)  # 打印图像的维度
    print('图像标签维度:', labels.shape)  # 打印标签的维度
    break  # 只打印一次数据维度，避免浪费计算资源

# 再次检查数据集
for images, labels in train_loader:  # 遍历训练数据加载器中的每个batch
    print('图像批次维度:', images.shape)  # 打印图像的维度
    print('图像标签维度:', labels.shape)  # 打印标签的维度
    break  # 只打印一次数据维度，避免浪费计算资源

图像批次维度: torch.Size([256, 3, 32, 32])
图像标签维度: torch.Size([256])
图像批次维度: torch.Size([256, 3, 32, 32])
图像标签维度: torch.Size([256])


In [12]:
torch.manual_seed(RANDOM_SEED)

model = resnet34(NUM_CLASSES)
model.to(DEVICE)

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)  

In [13]:
def compute_accuracy(model, data_loader, device):
    correct_pred, num_examples = 0, 0  # 初始化正确预测数和总样本数
    for i, (features, targets) in enumerate(data_loader):  # 遍历数据加载器中的每个批次
        
        features = features.to(device)  # 将输入数据移到指定设备
        targets = targets.to(device)  # 将标签数据移到指定设备

        logits, probas = model(features)  # 获取模型的预测结果和概率分布
        _, predicted_labels = torch.max(probas, 1)  # 获取预测标签
        num_examples += targets.size(0)  # 累加样本数量
        correct_pred += (predicted_labels == targets).sum()  # 累加正确预测的数量
    return correct_pred.float()/num_examples * 100  # 返回准确率


start_time = time.time()  # 记录开始时间
for epoch in range(NUM_EPOCHS):  # 遍历所有epoch
    
    model.train()  # 设置模型为训练模式
    for batch_idx, (features, targets) in enumerate(train_loader):  # 遍历训练数据加载器中的每个批次
        
        features = features.to(DEVICE)  # 将输入数据移到指定设备
        targets = targets.to(DEVICE)  # 将标签数据移到指定设备
            
        ### 正向传播和反向传播
        logits, probas = model(features)  # 获取模型的预测结果和概率分布
        cost = F.cross_entropy(logits, targets)  # 计算交叉熵损失
        optimizer.zero_grad()  # 清除之前的梯度
        
        cost.backward()  # 计算当前的梯度
        
        ### 更新模型参数
        optimizer.step()  # 使用优化器更新模型的参数
        
        ### 日志记录
        if not batch_idx % 150:  # 每150个批次输出一次日志
            print ('Epoch: %03d/%03d | Batch %04d/%04d | Cost: %.4f' 
                   % (epoch+1, NUM_EPOCHS, batch_idx, 
                      len(train_loader), cost))

    model.eval()  # 设置模型为评估模式
    with torch.set_grad_enabled(False):  # 在推理时不计算梯度，节省内存
        print('Epoch: %03d/%03d | Train: %.3f%%' % (
              epoch+1, NUM_EPOCHS, 
              compute_accuracy(model, train_loader, device=DEVICE)))  # 输出当前epoch的训练准确率
        
    print('Time elapsed: %.2f min' % ((time.time() - start_time)/60))  # 输出当前epoch已用时间
    
print('Total Training Time: %.2f min' % ((time.time() - start_time)/60))  # 输出总训练时间


with torch.set_grad_enabled(False):  # 在推理时不计算梯度，节省内存
    print('Test accuracy: %.2f%%' % (compute_accuracy(model, test_loader, device=DEVICE)))  # 输出测试集准确率
    
print('Total Time: %.2f min' % ((time.time() - start_time)/60))  # 输出总耗时

Epoch: 001/010 | Batch 0000/0196 | Cost: 2.6471
Epoch: 001/010 | Batch 0150/0196 | Cost: 1.1742
Epoch: 001/010 | Train: 42.820%
Time elapsed: 0.13 min
Epoch: 002/010 | Batch 0000/0196 | Cost: 1.2281
Epoch: 002/010 | Batch 0150/0196 | Cost: 1.0295
Epoch: 002/010 | Train: 64.700%
Time elapsed: 0.24 min
Epoch: 003/010 | Batch 0000/0196 | Cost: 0.8946
Epoch: 003/010 | Batch 0150/0196 | Cost: 0.9104
Epoch: 003/010 | Train: 65.626%
Time elapsed: 0.36 min
Epoch: 004/010 | Batch 0000/0196 | Cost: 0.7971
Epoch: 004/010 | Batch 0150/0196 | Cost: 0.9031
Epoch: 004/010 | Train: 67.766%
Time elapsed: 0.47 min
Epoch: 005/010 | Batch 0000/0196 | Cost: 0.5787
Epoch: 005/010 | Batch 0150/0196 | Cost: 0.5928
Epoch: 005/010 | Train: 54.922%
Time elapsed: 0.58 min
Epoch: 006/010 | Batch 0000/0196 | Cost: 0.6084
Epoch: 006/010 | Batch 0150/0196 | Cost: 0.5809
Epoch: 006/010 | Train: 80.326%
Time elapsed: 0.70 min
Epoch: 007/010 | Batch 0000/0196 | Cost: 0.3953
Epoch: 007/010 | Batch 0150/0196 | Cost: 0.553

## Training with Pinned Memory

In [14]:
##########################
### CIFAR-10 数据集
##########################

# 注意 transforms.ToTensor() 会将输入图像缩放到0-1范围
train_dataset = datasets.CIFAR10(root='data', 
                                 train=True,  # 加载训练集
                                 transform=transforms.ToTensor(),  # 转换为Tensor并归一化到0-1
                                 download=True)  # 如果数据集不存在则下载

test_dataset = datasets.CIFAR10(root='data', 
                                train=False,  # 加载测试集
                                transform=transforms.ToTensor())  # 转换为Tensor并归一化到0-1


# 创建训练数据加载器
train_loader = DataLoader(dataset=train_dataset, 
                          batch_size=BATCH_SIZE,  # 每批次的样本数
                          pin_memory=True,  # 将数据加载到固定内存中，以便提高数据传输效率
                          shuffle=True)  # 打乱数据

# 创建测试数据加载器
test_loader = DataLoader(dataset=test_dataset, 
                         batch_size=BATCH_SIZE,  # 每批次的样本数
                         pin_memory=True,  # 将数据加载到固定内存中
                         shuffle=False)  # 不打乱数据

# 检查数据集
for images, labels in train_loader:  # 遍历数据加载器中的一个批次
    print('图像批次尺寸:', images.shape)  # 打印图像数据的形状
    print('标签批次尺寸:', labels.shape)  # 打印标签数据的形状
    break  # 只查看第一个批次

# 再次检查数据集
for images, labels in train_loader:  # 遍历数据加载器中的一个批次
    print('图像批次尺寸:', images.shape)  # 打印图像数据的形状
    print('标签批次尺寸:', labels.shape)  # 打印标签数据的形状
    break  # 只查看第一个批次


图像批次尺寸: torch.Size([256, 3, 32, 32])
标签批次尺寸: torch.Size([256])
图像批次尺寸: torch.Size([256, 3, 32, 32])
标签批次尺寸: torch.Size([256])


In [15]:
torch.manual_seed(RANDOM_SEED)

model = resnet34(NUM_CLASSES)
model.to(DEVICE)

optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)  

In [16]:
def compute_accuracy(model, data_loader, device):
    correct_pred, num_examples = 0, 0  # 初始化正确预测数和样本总数
    for i, (features, targets) in enumerate(data_loader):  # 遍历数据加载器中的每个batch
            
        features = features.to(device)  # 将特征数据传输到指定设备
        targets = targets.to(device)  # 将目标标签传输到指定设备

        logits, probas = model(features)  # 获取模型的输出，logits是未归一化的logit值，probas是归一化的概率值
        _, predicted_labels = torch.max(probas, 1)  # 通过最大概率选择预测标签
        num_examples += targets.size(0)  # 累加当前batch的样本数量
        correct_pred += (predicted_labels == targets).sum()  # 统计正确预测的数量
    return correct_pred.float()/num_examples * 100  # 返回准确率

start_time = time.time()  # 记录开始时间
for epoch in range(NUM_EPOCHS):  # 训练多个epoch
    
    model.train()  # 设置模型为训练模式
    for batch_idx, (features, targets) in enumerate(train_loader):  # 遍历训练数据加载器中的每个batch
        
        features = features.to(DEVICE)  # 将特征数据传输到指定设备
        targets = targets.to(DEVICE)  # 将目标标签传输到指定设备
            
        ### 前向传播和反向传播
        logits, probas = model(features)  # 获取模型的输出
        cost = F.cross_entropy(logits, targets)  # 计算交叉熵损失
        optimizer.zero_grad()  # 清除之前的梯度
        
        cost.backward()  # 反向传播，计算梯度
        
        ### 更新模型参数
        optimizer.step()  # 更新模型的权重参数
        
        ### 日志记录
        if not batch_idx % 150:  # 每150个batch打印一次日志
            print ('Epoch: %03d/%03d | Batch %04d/%04d | Cost: %.4f' 
                   %(epoch+1, NUM_EPOCHS, batch_idx, 
                     len(train_loader), cost))

    model.eval()  # 设置模型为评估模式
    with torch.set_grad_enabled(False):  # 在推理过程中禁用梯度计算，节省内存
        print('Epoch: %03d/%03d | Train: %.3f%%' % (
              epoch+1, NUM_EPOCHS, 
              compute_accuracy(model, train_loader, device=DEVICE)))  # 打印训练集准确率
        
    print('Time elapsed: %.2f min' % ((time.time() - start_time)/60))  # 打印已经经过的时间
    
print('Total Training Time: %.2f min' % ((time.time() - start_time)/60))  # 打印总训练时间


with torch.set_grad_enabled(False):  # 在推理过程中禁用梯度计算，节省内存
    print('测试集准确率: %.2f%%' % (compute_accuracy(model, test_loader, device=DEVICE)))  # 打印测试集准确率
    
print('Total Time: %.2f min' % ((time.time() - start_time)/60))  # 打印总时间

Epoch: 001/010 | Batch 0000/0196 | Cost: 2.6471
Epoch: 001/010 | Batch 0150/0196 | Cost: 1.1742
Epoch: 001/010 | Train: 42.820%
Time elapsed: 0.12 min
Epoch: 002/010 | Batch 0000/0196 | Cost: 1.2281
Epoch: 002/010 | Batch 0150/0196 | Cost: 1.0295
Epoch: 002/010 | Train: 64.700%
Time elapsed: 0.24 min
Epoch: 003/010 | Batch 0000/0196 | Cost: 0.8946
Epoch: 003/010 | Batch 0150/0196 | Cost: 0.9104
Epoch: 003/010 | Train: 65.626%
Time elapsed: 0.36 min
Epoch: 004/010 | Batch 0000/0196 | Cost: 0.7971
Epoch: 004/010 | Batch 0150/0196 | Cost: 0.9031
Epoch: 004/010 | Train: 67.766%
Time elapsed: 0.49 min
Epoch: 005/010 | Batch 0000/0196 | Cost: 0.5787
Epoch: 005/010 | Batch 0150/0196 | Cost: 0.5928
Epoch: 005/010 | Train: 54.922%
Time elapsed: 0.61 min
Epoch: 006/010 | Batch 0000/0196 | Cost: 0.6084
Epoch: 006/010 | Batch 0150/0196 | Cost: 0.5809
Epoch: 006/010 | Train: 80.326%
Time elapsed: 0.73 min
Epoch: 007/010 | Batch 0000/0196 | Cost: 0.3953
Epoch: 007/010 | Batch 0150/0196 | Cost: 0.553

## Conclusions

Based on the training time without and with `pin_memory=True`, there doesn't seem to be a speed-up when using page-locked (or "pinned") memory -- in fact, pinning the memory even slowed down the training. (I reran the code in the opposite order, i.e., `pin_memory=True` first, and got the same results.)  
根据使用`pin_memory=True`与不使用的训练时间，似乎使用页锁定（或“固定”）内存并没有加速——实际上，锁定内存甚至使训练变慢了。（我以相反的顺序重新运行了代码，即先使用`pin_memory=True`，并得到了相同的结果。）  

This could be due to the relatively small dataset size, batch size, and hardware configuration that I was using:  
这可能是由于我使用的相对较小的数据集大小、批量大小和硬件配置所致：

In [17]:
%watermark -iv

pandas     : 2.2.3
PIL        : 11.1.0
matplotlib : 3.10.1
torchvision: 0.21.0+cu126
torch      : 2.6.0+cu126
numpy      : 1.26.4

