<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Understanding PyTorch Buffers
# 理解 PyTorch 缓冲区

In essence, PyTorch buffers are tensor attributes associated with a PyTorch module or model similar to parameters, but unlike parameters, buffers are not updated during training.
本质上，PyTorch 缓冲区是与 PyTorch 模块或模型相关联的张量属性，类似于参数，但与参数不同的是，缓冲区在训练期间不会更新。

Buffers in PyTorch are particularly useful when dealing with GPU computations, as they need to be transferred between devices (like from CPU to GPU) alongside the model's parameters. Unlike parameters, buffers do not require gradient computation, but they still need to be on the correct device to ensure that all computations are performed correctly.
PyTorch 中的缓冲区在处理 GPU 计算时特别有用，因为它们需要与模型参数一起在设备之间传输（如从 CPU 传输到 GPU）。与参数不同，缓冲区不需要梯度计算，但它们仍需要在正确的设备上以确保所有计算正确执行。

In chapter 3, we use PyTorch buffers via `self.register_buffer`, which is only briefly explained in the book. Since the concept and purpose are not immediately clear, this code notebook offers a longer explanation with a hands-on example.
在第 3 章中，我们通过 `self.register_buffer` 使用 PyTorch 缓冲区，这在书中只是简单解释。由于概念和目的并不是很清晰，这个代码笔记本提供了更详细的解释和实践示例。

## An example without buffers
## 不使用缓冲区的示例

Suppose we have the following code, which is based on code from chapter 3. This version has been modified to exclude buffers. It implements the causal self-attention mechanism used in LLMs:
假设我们有以下代码，它基于第3章的代码。这个版本已经修改为不包含缓冲区。它实现了LLM中使用的因果自注意力机制：

In [1]:
# 导入PyTorch和神经网络模块
import torch
import torch.nn as nn

# 定义一个不使用缓冲区的因果注意力模块类
class CausalAttentionWithoutBuffers(nn.Module):

    # 初始化函数,接收输入维度、输出维度、上下文长度、dropout率和是否使用偏置等参数
    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        # 保存输出维度
        self.d_out = d_out
        # 创建查询、键、值的线性变换层
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        # 创建dropout层
        self.dropout = nn.Dropout(dropout)
        # 创建上三角掩码矩阵,用于实现因果注意力
        self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)

    # 前向传播函数
    def forward(self, x):
        # 获取输入张量的形状:批次大小、token数量和输入维度
        b, num_tokens, d_in = x.shape
        # 计算键、查询和值向量
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        # 计算注意力分数
        attn_scores = queries @ keys.transpose(1, 2)
        # 使用掩码将未来token的注意力分数设为负无穷
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        # 计算注意力权重(使用缩放点积注意力)
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        # 应用dropout
        attn_weights = self.dropout(attn_weights)

        # 计算并返回上下文向量
        context_vec = attn_weights @ values
        return context_vec

We can initialize and run the module as follows on some example data:
我们可以按如下方式初始化并运行该模块的示例数据:

In [2]:
# 设置随机种子以确保结果可重现
torch.manual_seed(123)

# 创建输入张量,每行代表一个词向量
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

# 创建批次数据,将inputs复制两份并堆叠
batch = torch.stack((inputs, inputs), dim=0)
# 获取上下文长度(序列长度)
context_length = batch.shape[1]
# 获取输入维度
d_in = inputs.shape[1]
# 设置输出维度
d_out = 2

# 初始化因果注意力模块(不带缓冲区版本)
ca_without_buffer = CausalAttentionWithoutBuffers(d_in, d_out, context_length, 0.0)

# 在无梯度计算模式下运行前向传播
with torch.no_grad():
    context_vecs = ca_without_buffer(batch)

# 打印输出的上下文向量
print(context_vecs)

tensor([[[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]],

        [[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]]])


So far, everything has worked fine so far.
到目前为止,一切运行正常。

However, when training LLMs, we typically use GPUs to accelerate the process. Therefore, let's transfer the `CausalAttentionWithoutBuffers` module onto a GPU device.
然而,在训练大型语言模型时,我们通常使用GPU来加速处理过程。因此,让我们将`CausalAttentionWithoutBuffers`模块转移到GPU设备上。

Please note that this operation requires the code to be run in an environment equipped with GPUs.
请注意,此操作需要在配备GPU的环境中运行代码。

In [3]:
# 检查机器是否有可用的GPU
print("Machine has GPU:", torch.cuda.is_available())

# 将输入数据batch移动到GPU上
batch = batch.to("cuda")

# 将模型移动到GPU上
ca_without_buffer.to("cuda");

Machine has GPU: True


Now, let's run the code again:

In [4]:
# 在无梯度计算模式下运行前向传播
with torch.no_grad():
    context_vecs = ca_without_buffer(batch)

# 打印输出的上下文向量
print(context_vecs)

RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0

Running the code resulted in an error. What happened? It seems like we attempted a matrix multiplication between a tensor on a GPU and a tensor on a CPU. But we moved the module to the GPU!?
运行代码时出现了错误。发生了什么?看起来我们试图在GPU上的张量和CPU上的张量之间进行矩阵乘法。但是我们不是已经将模块移动到GPU上了吗!?

Let's double-check the device locations of some of the tensors:
让我们再次检查一下一些张量所在的设备位置:

In [5]:
# 打印W_query权重张量所在的设备
print("W_query.device:", ca_without_buffer.W_query.weight.device)
# 打印mask张量所在的设备 
print("mask.device:", ca_without_buffer.mask.device)

W_query.device: cuda:0
mask.device: cpu


In [6]:
# 检查mask属性的类型
type(ca_without_buffer.mask)

torch.Tensor

As we can see, the `mask` was not moved onto the GPU. That's because it's not a PyTorch parameter like the weights (e.g., `W_query.weight`).
正如我们所看到的，`mask`没有被移动到GPU上。这是因为它不像权重(例如`W_query.weight`)那样是PyTorch参数。

This means we  have to manually move it to the GPU via `.to("cuda")`:
这意味着我们必须通过`.to("cuda")`手动将其移动到GPU上:

In [7]:
# 将mask张量移动到GPU设备上
ca_without_buffer.mask = ca_without_buffer.mask.to("cuda")
# 打印移动后mask张量所在的设备
print("mask.device:", ca_without_buffer.mask.device)

mask.device: cuda:0


Let's try our code again:

In [8]:
# 禁用梯度计算
with torch.no_grad():
    # 通过ca_without_buffer模型处理batch数据
    context_vecs = ca_without_buffer(batch)

# 打印输出上下文向量
print(context_vecs)

tensor([[[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]],

        [[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]]], device='cuda:0')


This time, it worked!
这次成功了！

However, remembering to move individual tensors to the GPU can be tedious. As we will see in the next section, it's easier to use `register_buffer` to register the `mask` as a buffer.
然而，记住要将单个张量移动到GPU上可能很繁琐。正如我们将在下一节看到的，使用`register_buffer`来注册`mask`作为缓冲区会更容易。

## An example with buffers

Let's now modify the causal attention class to register the causal `mask` as a buffer:
现在让我们修改因果注意力类来将因果`mask`注册为缓冲区:

In [9]:
# 导入PyTorch相关包
import torch
import torch.nn as nn

# 定义带有缓冲区的因果注意力类,继承自nn.Module
class CausalAttentionWithBuffer(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        # 调用父类初始化
        super().__init__()
        # 保存输出维度
        self.d_out = d_out
        # 初始化查询(query)权重矩阵
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        # 初始化键(key)权重矩阵
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        # 初始化值(value)权重矩阵
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        # 初始化dropout层
        self.dropout = nn.Dropout(dropout)
        # Old:
        # self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)

        # New:
        # 注册上三角掩码矩阵作为缓冲区
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        # 获取输入张量的形状:批次大小、序列长度、输入维度
        b, num_tokens, d_in = x.shape
        # 计算键向量
        keys = self.W_key(x)
        # 计算查询向量
        queries = self.W_query(x)
        # 计算值向量
        values = self.W_value(x)

        # 计算注意力分数:查询和键的矩阵乘法
        attn_scores = queries @ keys.transpose(1, 2)
        # 使用掩码将未来位置的注意力分数设为负无穷
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        # 计算注意力权重:使用softmax并进行缩放
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        # 对注意力权重应用dropout
        attn_weights = self.dropout(attn_weights)

        # 计算上下文向量:注意力权重和值的矩阵乘法
        context_vec = attn_weights @ values
        # 返回上下文向量
        return context_vec

现在,很方便的是,如果我们将模块移动到GPU上,掩码也会被一同移动到GPU上:
Now, conveniently, if we move the module to the GPU, the mask will be located on the GPU as well:

In [10]:
# 创建一个带缓冲区的因果注意力模型实例
ca_with_buffer = CausalAttentionWithBuffer(d_in, d_out, context_length, 0.0)
# 将模型移动到CUDA设备(GPU)上
ca_with_buffer.to("cuda")

# 打印查询权重矩阵和掩码的设备位置
print("W_query.device:", ca_with_buffer.W_query.weight.device)
print("mask.device:", ca_with_buffer.mask.device)

W_query.device: cuda:0
mask.device: cuda:0


In [11]:
# 使用torch.no_grad()上下文管理器禁用梯度计算
with torch.no_grad():
    # 使用带缓冲区的因果注意力模型处理输入批次
    context_vecs = ca_with_buffer(batch)

# 打印输出上下文向量
print(context_vecs)

tensor([[[0.4772, 0.1063],
         [0.5891, 0.3257],
         [0.6202, 0.3860],
         [0.5478, 0.3589],
         [0.5321, 0.3428],
         [0.5077, 0.3493]],

        [[0.4772, 0.1063],
         [0.5891, 0.3257],
         [0.6202, 0.3860],
         [0.5478, 0.3589],
         [0.5321, 0.3428],
         [0.5077, 0.3493]]], device='cuda:0')


As we can see above, registering a tensor as a buffer can make our lives a lot easier: We don't have to remember to move tensors to a target device like a GPU manually.
正如我们在上面看到的,将张量注册为缓冲区可以让我们的工作变得更加轻松:我们不需要手动将张量移动到GPU等目标设备上。

## Buffers and `state_dict`

- Another advantage of PyTorch buffers, over regular tensors, is that they get included in a model's `state_dict`
- PyTorch缓冲区相对于普通张量的另一个优势是它们会被包含在模型的`state_dict`中
- For example, consider the `state_dict` of the causal attention object without buffers  
- 例如,让我们看看不带缓冲区的因果注意力对象的`state_dict`

In [12]:
# 打印不带缓冲区的因果注意力模型的状态字典
# 这将只包含模型的权重参数,不包含掩码
ca_without_buffer.state_dict()

OrderedDict([('W_query.weight',
              tensor([[-0.2354,  0.0191, -0.2867],
                      [ 0.2177, -0.4919,  0.4232]], device='cuda:0')),
             ('W_key.weight',
              tensor([[-0.4196, -0.4590, -0.3648],
                      [ 0.2615, -0.2133,  0.2161]], device='cuda:0')),
             ('W_value.weight',
              tensor([[-0.4900, -0.3503, -0.2120],
                      [-0.1135, -0.4404,  0.3780]], device='cuda:0'))])

- The mask is not included in the `state_dict` above
- 上面的`state_dict`中不包含掩码
- However, the mask *is* included in the `state_dict` below, thanks to registering it as a buffer  
- 然而,由于将掩码注册为缓冲区,下面的`state_dict`中包含了掩码

In [13]:
# 打印带缓冲区的因果注意力模型的状态字典
# 这将包含模型的权重参数和掩码缓冲区
ca_with_buffer.state_dict()

OrderedDict([('mask',
              tensor([[0., 1., 1., 1., 1., 1.],
                      [0., 0., 1., 1., 1., 1.],
                      [0., 0., 0., 1., 1., 1.],
                      [0., 0., 0., 0., 1., 1.],
                      [0., 0., 0., 0., 0., 1.],
                      [0., 0., 0., 0., 0., 0.]], device='cuda:0')),
             ('W_query.weight',
              tensor([[-0.1362,  0.1853,  0.4083],
                      [ 0.1076,  0.1579,  0.5573]], device='cuda:0')),
             ('W_key.weight',
              tensor([[-0.2604,  0.1829, -0.2569],
                      [ 0.4126,  0.4611, -0.5323]], device='cuda:0')),
             ('W_value.weight',
              tensor([[ 0.4929,  0.2757,  0.2516],
                      [ 0.2377,  0.4800, -0.0762]], device='cuda:0'))])

- A `state_dict` is useful when saving and loading trained PyTorch models, for example
- `state_dict`在保存和加载训练好的PyTorch模型时很有用
- In this particular case, saving and loading the `mask` is maybe not super useful, because it remains unchanged during training; so, for demonstration purposes, let's assume it was modified where all `1`'s were changed to `2`'s:
- 在这个特定的例子中,保存和加载`mask`可能不是特别有用,因为它在训练过程中保持不变;所以为了演示目的,让我们假设将所有的`1`都修改为`2`:

In [14]:
# 将带缓冲区的因果注意力模型中掩码值为1的位置修改为2
ca_with_buffer.mask[ca_with_buffer.mask == 1.] = 2.
# 打印修改后的掩码
ca_with_buffer.mask

tensor([[0., 2., 2., 2., 2., 2.],
        [0., 0., 2., 2., 2., 2.],
        [0., 0., 0., 2., 2., 2.],
        [0., 0., 0., 0., 2., 2.],
        [0., 0., 0., 0., 0., 2.],
        [0., 0., 0., 0., 0., 0.]], device='cuda:0')

- Then, if we save and load the model, we can see that the mask is restored with the modified value
 - 然后,如果我们保存并加载模型,我们可以看到掩码会以修改后的值被恢复

In [15]:
# 保存带缓冲区的因果注意力模型的状态字典到文件
torch.save(ca_with_buffer.state_dict(), "model.pth")

# 创建一个新的带缓冲区的因果注意力模型实例
new_ca_with_buffer = CausalAttentionWithBuffer(d_in, d_out, context_length, 0.0)
# 从文件加载保存的状态字典到新模型
new_ca_with_buffer.load_state_dict(torch.load("model.pth"))

# 打印新模型的掩码,验证掩码值是否正确恢复
new_ca_with_buffer.mask

tensor([[0., 2., 2., 2., 2., 2.],
        [0., 0., 2., 2., 2., 2.],
        [0., 0., 0., 2., 2., 2.],
        [0., 0., 0., 0., 2., 2.],
        [0., 0., 0., 0., 0., 2.],
        [0., 0., 0., 0., 0., 0.]])

- This is not true if we don't use buffers:
- 如果我们不使用缓冲区,情况就不是这样了:

In [16]:
# 将不带缓冲区的因果注意力模型中掩码值为1的位置修改为2
ca_without_buffer.mask[ca_without_buffer.mask == 1.] = 2.

# 保存不带缓冲区的因果注意力模型的状态字典到文件
torch.save(ca_without_buffer.state_dict(), "model.pth")

# 创建一个新的不带缓冲区的因果注意力模型实例
new_ca_without_buffer = CausalAttentionWithoutBuffers(d_in, d_out, context_length, 0.0)
# 从文件加载保存的状态字典到新模型
new_ca_without_buffer.load_state_dict(torch.load("model.pth"))

# 打印新模型的掩码,验证掩码值是否正确恢复
new_ca_without_buffer.mask

tensor([[0., 1., 1., 1., 1., 1.],
        [0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1.],
        [0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0.]])