<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
<br>汉化的库: <a href="https://github.com/GoatCsu/CN-LLMs-from-scratch.git">https://github.com/GoatCsu/CN-LLMs-from-scratch.git</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# 附录A: Pytorch介绍

许多科学计算库不会立即支持最新版本的 Python。因此，在安装 PyTorch 时，建议使用
比最新版本旧一到两个版本的 Python。如果最新的 Python 版本是 Python 3.13，那么推荐使用
Python 3.11 或 Python 3.12

`pip install torch`

假设你的计算机支持兼容 CUDA 的 GPU。在这种情况下，如果你正在使用的 Python 环境已安装
必要的依赖项（如 pip），那么系统将自动安装支持 CUDA 加速的 PyTorch 版本。


本书中使用的是 PyTorch 2.4.0，为了确保与本书的兼容性，建议你使用以下命令安装该版本：
`pip install torch==2.4.0`

建议你访问 PyTorch 官方网站并使用安装菜单选择适合你操作系统的安装命令

In [1]:
import sys
print(sys.version)

3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]


## A.1 什么是PyTorch

In [2]:
import torch

print(torch.__version__)

2.8.0+cu126


In [3]:
print(torch.cuda.is_available())

False


<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A1/1.png" width="400px">

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A1/2.png" width="300px">

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A1/3.png" width="300px">

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A1/4.png" width="500px">

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A1/5.png" width="500px">

## A.2 理解张量

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A2/1.png" width="400px">

PyTorch 采用了大部分 NumPy 数组 API 和语法来进行张量操作, 如果你对 NumPy 不熟悉，可以通过我的文章:
Scientific Computing in Python: Introduction to NumPy and Matplotlib >>> https://sebastianraschka.com/blog/2020/numpy-intro.html

### A.2.1 标量、向量、矩阵和张量

In [4]:
import torch
import numpy as np

# 从Python整数创建一个零维张量（标量）
tensor0d = torch.tensor(1)

# 从Python列表创建一个一维张量（向量）
tensor1d = torch.tensor([1, 2, 3])

# 从嵌套的Python列表创建一个二维张量
tensor2d = torch.tensor([[1, 2],
                         [3, 4]])

# 从嵌套的Python列表创建一个三维张量
tensor3d_1 = torch.tensor([[[1, 2], [3, 4]],
                           [[5, 6], [7, 8]]])

# 从NumPy数组创建一个张量
ary3d = np.array([[[1, 2], [3, 4]],
                  [[5, 6], [7, 8]]])
tensor3d_2 = torch.tensor(ary3d)  # 复制 NumPy 数组
tensor3d_3 = torch.from_numpy(ary3d)  # 与 NumPy 数组共享内存

In [5]:
print(tensor0d)
print(tensor1d)
print(tensor2d)
print(tensor3d_1)
print(tensor3d_2)
print(tensor3d_3)

tensor(1)
tensor([1, 2, 3])
tensor([[1, 2],
        [3, 4]])
tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])
tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])
tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])


In [6]:
ary3d[0, 0, 0] = 999
print(tensor3d_2) # 保持不变

tensor([[[1, 2],
         [3, 4]],

        [[5, 6],
         [7, 8]]])


In [7]:
print(tensor3d_3) # 由于共享内存而发生变化

tensor([[[999,   2],
         [  3,   4]],

        [[  5,   6],
         [  7,   8]]])


### A.2.2 张量数据类型

In [8]:
tensor1d = torch.tensor([1, 2, 3])
print(tensor1d.dtype) #查看数据类型 PyTorch 采用 Python 默认的 64 位整数数据类型

torch.int64


In [9]:
floatvec = torch.tensor([1.0, 2.0, 3.0])
print(floatvec.dtype) #如果使用 Python 浮点数创建张量，那么 PyTorch 默认会创建具有 32 位精度的张量, 这种选择主要是为了在精度和计算效率之间取得平衡, 1. 大多数深度学习任务里足够精度, 且内存和计算资源少 2. GPU架构对32bit有优化

torch.float32


In [10]:
floatvec = tensor1d.to(torch.float32) # 可以使用张量的.to 方法更改精度
print(floatvec.dtype)

torch.float32


### A.2.3 常见的PyTorch张量操作

In [11]:
tensor2d = torch.tensor([[1, 2, 3],
                         [4, 5, 6]])
tensor2d

tensor([[1, 2, 3],
        [4, 5, 6]])

In [12]:
tensor2d.shape # .shape 属性允许我们访问张量的形状

torch.Size([2, 3])

In [13]:
tensor2d.reshape(3, 2) #要将该张量变为 3×2 的形状

tensor([[1, 2],
        [3, 4],
        [5, 6]])

In [14]:
tensor2d.view(3, 2) #在 PyTorch 中，重塑张量更常用的命令是.view()

tensor([[1, 2],
        [3, 4],
        [5, 6]])

In [15]:
print(tensor2d)

tensor([[1, 2, 3],
        [4, 5, 6]])


.view()和.reshape()的微妙区别在于它们对内存布局的处理方式：.view()要求原始数据是连续的，如果不是，它将无法工作，而.reshape()会工作，如有必要，它会复制数据以确保所需的形状

In [16]:
tensor2d.T #转置张量, 将其沿对角线翻转

tensor([[1, 4],
        [2, 5],
        [3, 6]])

In [17]:
tensor2d.matmul(tensor2d.T) #PyTorch 中常用的矩阵相乘方法, 也可以使用@运算符，它能够更简洁地实现相同的功能

tensor([[14, 32],
        [32, 77]])

In [18]:
tensor2d @ tensor2d.T

tensor([[14, 32],
        [32, 77]])

## A.3 将模型视为计算图

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A3/1.png" width="600px">

In [19]:
import torch.nn.functional as F

y = torch.tensor([1.0])  # 真实标签
x1 = torch.tensor([1.1]) # 输入特征
w1 = torch.tensor([2.2]) # 权重参数
b = torch.tensor([0.0])  # 偏置单元

z = x1 * w1 + b          # 净输入
a = torch.sigmoid(z)     # 激活和输出

loss = F.binary_cross_entropy(a, y)
print(loss)

tensor(0.0852)


## A.4 轻松实现自动微分

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A4/1.png" width="600px">

In [20]:
import torch.nn.functional as F #Imports the functional interface for neural network operations from PyTorch.
from torch.autograd import grad #Imports the grad function specifically for calculating gradients.

y = torch.tensor([1.0]) #Defines the true label as a PyTorch tensor with a value of 1.0.
x1 = torch.tensor([1.1]) #Defines the input feature as a PyTorch tensor with a value of 1.1.
w1 = torch.tensor([2.2], requires_grad=True) #Defines the weight parameter as a PyTorch tensor with a value of 2.2. requires_grad=True is crucial here, as it tells PyTorch to track operations on this tensor so that gradients can be computed later.
b = torch.tensor([0.0], requires_grad=True) #Defines the bias term as a PyTorch tensor with a value of 0.0

z = x1 * w1 + b  #Calculates the net input z by performing a linear transformation on the input feature x1 using the weight w1 and adding the bias b.
a = torch.sigmoid(z) #Applies the sigmoid activation function to the net input z to get the output a

loss = F.binary_cross_entropy(a, y) #Calculates the binary cross-entropy loss between the output a and the true label y

grad_L_w1 = grad(loss, w1, retain_graph=True) #默认情况下，PyTorch 在计算梯度后会销毁计算图以释放内存。然而，由于我们即将再次使用这个计算图，因此可以设置 retain_graph=True，使其保留在内存中
grad_L_b = grad(loss, b, retain_graph=True)

print(grad_L_w1)
print(grad_L_b)
#这里我们手动使用了 grad 函数，这在实验、调试和概念演示中很有用。

(tensor([-0.0898]),)
(tensor([-0.0817]),)


In [21]:
# 在实际操作中，PyTorch 提供了更高级的工具来自动化这个过程。例如，我们可以对损失函数调用.backward方法，随后 PyTorch 将计算计算图中所有叶节点的梯度，这些梯度将通过张量的.grad 属性进行存储
loss.backward()

print(w1.grad)
print(b.grad)

#PyTorch 通过.backward方法为我们处理了微积分问题——我们不需要手动计算任何导数或梯度

tensor([-0.0898])
tensor([-0.0817])


## A.5 实现多层神经网络

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A5/1.png" width="500px">

In [53]:
'''
In summary, this class creates a simple feedforward neural network with two hidden layers
using ReLU activation and an output layer. The number of input features and output units
are customizable through the constructor.
'''
class NeuralNetwork(torch.nn.Module): #在 PyTorch 中实现神经网络时，可以通过子类化 torch.nn.Module 类来定义我们自己的自定义网络架构
    def __init__(self, num_inputs, num_outputs): # 将输入和输出的数量编码为变量，使我们可以在具有不同特征数量和类别数量的数据集上重复使用相同的代码
        super().__init__() #This line calls the constructor of the parent class (torch.nn.Module). This is necessary to properly initialize the module.

        self.layers = torch.nn.Sequential(

            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30), #线性层将输入节点和输出节点的数量作为参数
            torch.nn.ReLU(), #非线性激活函数被放置在隐藏层之间

            # 2nd hidden layer
            torch.nn.Linear(30, 20), #一个隐藏层的输出节点数量必须与下一层的输入节点数量相匹配
            torch.nn.ReLU(),

            # output layer
            torch.nn.Linear(20, num_outputs),
        )

    def forward(self, x): #This method defines the forward pass of the neural network. It takes an input tensor x and passes it through the defined layers.
        logits = self.layers(x)
        return logits #最后一层的输出称为 logits

In [54]:
model = NeuralNetwork(50, 3) #实例化一个新的神经网络对象
#(50, 3) are the arguments passed to the __init__ method of the NeuralNetwork class.

In [62]:
'''
请注意，在实现 NeuralNetwork 类时，我们使用了 Sequential 类。虽然 Sequential 并非
必需，但如果有一系列想要按特定顺序执行的层（正如本例中的情况），那么使用它可以让我们
的工作更轻松。因此，在__init__构造函数中实例化 self.layers = Sequential(...)后，
只需在 NeuralNetwork 的 forward 方法中调用 self.layers，而无须单独调用每个层。
'''
print(model)

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)


In [56]:
#检查一下该模型的可训练参数总数
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) #每一个 requires_grad=True 的参数都会被视为可训练参数，并在训练期间进行更新
print("Total number of trainable model parameters:", num_params)

Total number of trainable model parameters: 2213


`num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)`: This line calculates the sum of the number of elements (parameters) in all the layers of the `model` that require gradients.

- `model.parameters()`: This method returns an iterator over all the parameters (weights and biases) of the model.
- `for p in model.parameters()`: This iterates through each parameter p in the model.
if p.requires_grad: This condition filters the parameters to include only those that have `requires_grad` set to `True`. These are the parameters that will be updated during the training process.
- `p.numel()`: This method returns the total number of elements (scalars) in a given parameter tensor p.
- `sum(...)`: This sums up the number of elements for all the parameters that meet the condition.

对于前面我们提到的具有两个隐藏层的神经网络模型，这些可训练参数包含在 `torch.nn.Linear` 层中。`Linear` 层会将输入与权重矩阵相乘，并加上一个偏置向量。这有时被称为**前馈层**或**全连接层**

基于这里执行的 `print(model)`调用，可以看到第一个 `Linear` 层在 `layers` 属性中的索引位置是 0。

可以通过以下方式访问对应的权重参数矩阵

In [57]:
print(model.layers[0].weight) #访问权重参数矩阵

Parameter containing:
tensor([[ 0.1388,  0.0159,  0.1215,  ...,  0.1032,  0.0296,  0.0102],
        [ 0.0229,  0.0260, -0.0458,  ..., -0.0358,  0.0362,  0.0497],
        [-0.0896,  0.0113,  0.1370,  ...,  0.1037,  0.1230, -0.0929],
        ...,
        [-0.1362, -0.0713, -0.0010,  ...,  0.1176,  0.1054, -0.1012],
        [ 0.1226,  0.0937, -0.1409,  ...,  0.1321, -0.0613,  0.0086],
        [-0.0045, -0.0604,  0.0535,  ...,  0.0697,  0.0373,  0.0923]],
       requires_grad=True)


In [28]:
print(model.layers[0].weight.shape) #查看其维度
#这里的权重矩阵是一个 30×50 的矩阵，可以看到 requires_grad 被设置为 True（意味着该矩阵是可训练的）——这是 torch.nn.Linear 中权重和偏置的默认设置

torch.Size([30, 50])


In [58]:
print(model.layers[0].bias) # 访问偏置向量

Parameter containing:
tensor([-0.0181,  0.1404,  0.0374,  0.1102,  0.0045,  0.0788,  0.1013,  0.0211,
         0.1191, -0.1204, -0.0152, -0.0222, -0.0056,  0.0466, -0.0365, -0.0321,
         0.0927, -0.1029, -0.0093,  0.1047,  0.1279, -0.1176,  0.0445,  0.0583,
         0.0263,  0.0459, -0.0549,  0.0258,  0.0305,  0.0463],
       requires_grad=True)


In [59]:
print(model.layers[0].bias.shape)

torch.Size([30])


如果你在自己的计算机上执行前面的代码，那么权重矩阵中的数值可能会与本书展示的有所不同。

模型权重会用小的随机数进行初始化，每次实例化网络时这些数值都会不同。在深度学习中，使用小的随机数初始化模型权重是为了在训练过程中打破对称性。否则，各节点将执行相同的操作并在反向传播过程中进行相同的更新，导致网络无法学习从输入到输出的复杂映射关系。

In [61]:
torch.manual_seed(123) #可以通过 manual_seed 来为 PyTorch 的随机数生成器设定种子，从而使随机数初始化可重复

model = NeuralNetwork(50, 3)
print(model.layers[0].weight)

Parameter containing:
tensor([[-0.0577,  0.0047, -0.0702,  ...,  0.0222,  0.1260,  0.0865],
        [ 0.0502,  0.0307,  0.0333,  ...,  0.0951,  0.1134, -0.0297],
        [ 0.1077, -0.1108,  0.0122,  ...,  0.0108, -0.1049, -0.1063],
        ...,
        [-0.0787,  0.1259,  0.0803,  ...,  0.1218,  0.1303, -0.1351],
        [ 0.1359,  0.0175, -0.0673,  ...,  0.0674,  0.0676,  0.1058],
        [ 0.0790,  0.1343, -0.0293,  ...,  0.0344, -0.0971, -0.0509]],
       requires_grad=True)


In [29]:
'''
This code snippet demonstrates how to perform a forward pass through the model
with a single input sample and how to prevent gradient calculation during inference.
'''

torch.manual_seed(123)

X = torch.rand((1, 50)) #This line creates a random tensor named X with a shape of (1, 50).
# torch.rand() creates a tensor filled with random numbers from a uniform distribution between 0 and 1.
# (1, 50) specifies the shape of the tensor. A shape of (1, 50) represents a single input sample with 50 features, which matches the num_inputs specified when the NeuralNetwork was instantiated earlier.

out = model(X) #This line performs a forward pass through the model using the input tensor X
#When you call a torch.nn.Module instance like a function (model(X)), it automatically executes the forward method you defined in the class. The output of the forward pass (the logits in this case) is stored in the out variable.
print(out)

#You'll notice that the output tensor has grad_fn=<AddmmBackward0>, indicating that PyTorch is tracking the operations performed to compute out for potential gradient calculations later.

tensor([[-0.1262,  0.1080, -0.1792]], grad_fn=<AddmmBackward0>)


The output `tensor([[-0.1262, 0.1080, -0.1792]], grad_fn=<AddmmBackward0>)` represents the **logits** produced by the neural network for the single input sample.

Here's what each part means:

`tensor([[-0.1262, 0.1080, -0.1792]])`: This is the actual output data. It's a tensor with **one row and three columns**. Since the NeuralNetwork was instantiated with num_outputs=3, each of these three values **corresponds to the raw, unnormalized score (or logit) **for each of the three possible output classes. A higher logit value for a class generally indicates that the model has higher confidence that the input belongs to that class.
`grad_fn=<AddmmBackward0`>: This part indicates that this tensor is the result of an operation (`AddmmBackward0` corresponds to a matrix multiplication followed by an addition, which is what happens in a linear layer), and that PyTorch is tracking the operations performed to create this tensor. This tracking is essential for automatic differentiation, allowing PyTorch to compute gradients during the backward pass for training.

In a typical classification task, these logits would be passed through a softmax function to convert them into probabilities for each class, which sum up to 1.

结果中返回的 3 个数值对应于分配给每个输出节点的分数。注意输出张量还包含了一个grad_fn 值.

`grad_fn=<AddmmBackward0>`意味着我们正在查看的张量是通过矩阵乘法和加法操作创建的。PyTorch 会在反向传播期间使用这些信息来计算梯度。`grad_fn=<AddmmBackward0>`中的
`<AddmmBackward0>`指定了执行的操作。在这种情况下，它执行的是一个 Addmm 操作。Addmm代表的是矩阵乘法（mm）后接加法（Add）的组合运算

In [30]:
'''
如果只想使用网络进行预测而不进行训练或反向传播（比如在训练之后使用它进行预测），
那么为反向传播构建这个计算图可能会浪费资源，因为它会执行不必要的计算并消耗额外的内
存。因此，当使用模型进行推理（比如做出预测）而不是训练时，最好的做法是使用
torch.no_grad()上下文管理器。这会告诉 PyTorch 无须跟踪梯度，从而可以显著节省内存和
计算资源
'''
with torch.no_grad():
    out = model(X)
print(out)

tensor([[-0.1262,  0.1080, -0.1792]])


In [31]:
with torch.no_grad():
    out = torch.softmax(model(X), dim=1) #dim=1: This specifies the dimension along which the softmax function is applied. In this case, dim=1 means the softmax is applied across the columns (the different output classes) for each input sample (row). This is the standard way to apply softmax for classification tasks where each row represents a single instance and each column represents a class score.
print(out) # 输出的结果现在这些值可以解释为类别成员的概率，并且它们的总和大约为 1

tensor([[0.3113, 0.3934, 0.2952]])


**模型通常返回 Logits:** 在 PyTorch 中构建分类模型时，习惯上让模型的最后一层只进行线性变换，输出的是原始的、未经过 Softmax 或 Sigmoid 激活的数值，这些数值被称为 logits。
损失函数中集成了 Softmax/Sigmoid: PyTorch 提供了一些常用的损失函数，例如用于多分类的 CrossEntropyLoss 和用于二分类的 BCEWithLogitsLoss。这些损失函数在内部已经集成了 Softmax（或 Sigmoid）操作和负对数似然损失计算。

**原因：**数值计算的效率和稳定性: 将 Softmax/Sigmoid 和损失函数结合在一起计算，相比于先单独计算 Softmax/Sigmoid 再计算损失，在数值计算上更有效率，并且可以提高数值稳定性。这是因为 Softmax/Sigmoid 的计算涉及到指数运算，当输入值很大或很小时，可能会出现数值溢出或下溢的问题。将它们与对数似然损失结合后，可以通过一些数学技巧（例如 LogSumExp 技巧）来避免这些问题，从而提高计算的精度和稳定性。

**需要概率时再显式调用:** 因此，如果您只是想在训练过程中计算损失并进行反向传播，您可以直接将模型的 logits 输出传递给 PyTorch 集成了 Softmax/Sigmoid 的损失函数。只有当您需要查看模型对每个类别的预测概率时（例如在推理阶段或者进行模型评估时），才需要显式地对模型的 logits 输出应用 Softmax 函数，就像您在代码中看到的那样。

总结来说，这种做法是为了在训练过程中获得更好的数值表现，同时仍然允许您在需要时方便地获取类别概率。

## A.6 设置高效的数据加载器

<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A6/1.png" width="600px">

In [32]:
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])

y_train = torch.tensor([0, 0, 0, 1, 1])

In [33]:
X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])

y_test = torch.tensor([0, 1])

In [34]:
from torch.utils.data import Dataset


class ToyDataset(Dataset):
    def __init__(self, X, y):
        self.features = X
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y

    def __len__(self):
        return self.labels.shape[0]

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

In [35]:
len(train_ds)

5

In [36]:
from torch.utils.data import DataLoader

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0
)

In [37]:
test_ds = ToyDataset(X_test, y_test)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

In [38]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Batch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])


In [39]:
train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0,
    drop_last=True
)

In [40]:
for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

Batch 1: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 2: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])


<img src="https://raw.githubusercontent.com/MLNLP-World/LLMs-from-scratch-CN/main/imgs/A6/2.png" width="600px">

## A.7 典型的训练循环

In [41]:
import torch.nn.functional as F


torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):

    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        logits = model(features)

        loss = F.cross_entropy(logits, labels) # 损失函数

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        ### 日志
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()
    # 插入可选的模型评估代码

Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00


In [42]:
model.eval()

with torch.no_grad():
    outputs = model(X_train)

print(outputs)

tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])


In [43]:
torch.set_printoptions(sci_mode=False)
probas = torch.softmax(outputs, dim=1)
print(probas)

predictions = torch.argmax(probas, dim=1)
print(predictions)

tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])
tensor([0, 0, 0, 1, 1])


In [44]:
predictions = torch.argmax(outputs, dim=1)
print(predictions)

tensor([0, 0, 0, 1, 1])


In [45]:
predictions == y_train

tensor([True, True, True, True, True])

In [46]:
torch.sum(predictions == y_train)

tensor(5)

In [47]:
def compute_accuracy(model, dataloader):

    model = model.eval()
    correct = 0.0
    total_examples = 0

    for idx, (features, labels) in enumerate(dataloader):

        with torch.no_grad():
            logits = model(features)

        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)

    return (correct / total_examples).item()

In [48]:
compute_accuracy(model, train_loader)

1.0

In [49]:
compute_accuracy(model, test_loader)

1.0

## A.8 保存和加载模型

In [50]:
torch.save(model.state_dict(), "model.pth")

In [51]:
model = NeuralNetwork(2, 2) # 需要与最初保存的模型完全匹配
model.load_state_dict(torch.load("model.pth", weights_only=True))

<All keys matched successfully>

## A.9 使用 GPU 优化训练性能

### A.9.1 在GPU设备运行PyTorch

见 [code-part2.ipynb](code-part2.ipynb)

### A.9.2 单个GPU训练

见 [code-part2.ipynb](code-part2.ipynb)

### A.9.3 使用多个GPU训练

见 [DDP-script.py](DDP-script.py)