# Chapter 3: Coding Attention Mechanisms
# 第三章：编码注意力机制

- 如果文本顺序为先英文，后中文，则为原文翻译；先中文，后英文，则为译者后期注释。
- If the text order is English first, then Chinese, it will be the original text; if Chinese comes first, followed by English, then it will be translator's annotations later.

Packages that are being used in this notebook:

在这个笔记本中使用的软件包：

In [1]:
from importlib.metadata import version
import torch

print("torch version:", version("torch"))

torch version: 2.1.0


## 3.1 The problem with modeling long sequences
## 3.1 建模长序列的问题

- No code in this section
- 本节没有代码

## 3.2 Capturing data dependencies with attention mechanisms
## 3.2 使用注意力机制捕获数据依赖关系

- No code in this section
- 本节没有代码

## 3.3 Attending to different parts of the input with self-attention
## 3.3 使用自注意力机制关注输入的不同部分

### 3.3.1 A simple self-attention mechanism without trainable weights
### 3.3.1 一个简单的无可训练权重的自注意力机制

- This section explains a very simplified variant of self-attention, which does not contain any trainable weights. This is purely for illustration purposes and NOT the attention mechanism that is used in transformers. The next section, section 3.3.2, will extend this simple attention mechanism to implement the real self-attention mechanism.
- 本节解释了一种非常简化的自注意力机制变体，它不包含任何可训练的权重。这纯粹是为了说明目的，而不是transformer中使用的注意力机制。下一节，3.3.2节，将扩展这个简单的注意力机制来实现真正的自注意力机制。
- Suppose we are given an input sequence $x^{(1)}$ to $x^{(T)}$.
- 假设我们有一个输入序列 $x^{(1)}$ 到 $x^{(T)}$。
  - The input is a text (for example, a sentence like "Your journey starts with one step") that has already been converted into token embeddings as described in chapter 2.
  - 输入是一段文本（例如，一句话，如“Your journey starts with one step（你的旅程从一步开始）”），已经根据第二章中描述的方式转换为标记嵌入。
  - For instance, $x^{(1)}$ is a d-dimensional vector representing the word "Your", and so forth.
  - 例如，$x^{(1)}$ 是一个表示单词 "Your" 的d维向量，依此类推。
- **Goal:** compute context vectors $z^{(i)}$ for each input sequence element $x^{(i)}$ in $x^{(1)}$ to $x^{(T)}$ (where $z$ and $x$ have the same dimension).
- **目标：** 为 $x^{(1)}$ 到 $x^{(T)}$ 中的每个输入序列元素 $x^{(i)}$ 计算上下文向量 $z^{(i)}$（其中 $z$ 和 $x$ 具有相同的维度）。
    - A context vector $z^{(i)}$ is a weighted sum over the inputs $x^{(1)}$ to $x^{(T)}$.
    - 一个上下文向量 $z^{(i)}$ 是对输入 $x^{(1)}$ 到 $x^{(T)}$ 的加权和。
    - The context vector is "context"-specific to a certain input.
    - 上下文向量是针对某个特定输入的“上下文”特定的。
      - Instead of $x^{(i)}$ as a placeholder for an arbitrary input token, let's consider the second input, $x^{(2)}$.
      - 不是将$x^{(i)}$作为任意输入标记的占位符，我们将考虑第二个输入$x^{(2)}$。
      - And to continue with a concrete example, instead of the placeholder $z^{(i)}$, we consider the second output context vector, $z^{(2)}$.
      - 继续使用一个具体的例子，不是使用占位符$z^{(i)}$，我们考虑第二个输出上下文向量$z^{(2)}$。
      - The second context vector, $z^{(2)}$, is a weighted sum over all inputs $x^{(1)}$ to $x^{(T)}$ weighted with respect to the second input element, $x^{(2)}$. The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing $z^{(2)}$.
      - 第二个上下文向量 $z^{(2)}$ 是对所有输入 $x^{(1)}$ 到 $x^{(T)}$ 的加权和，其权重相对于第二个输入元素 $x^{(2)}$ 而言。注意力权重确定在计算 $z^{(2)}$ 时每个输入元素对加权和的贡献程度。
    - In short, think of $z^{(2)}$ as a modified version of $x^{(2)}$ that also incorporates information about all other input elements that are relevant to a given task at hand.
    - 简而言之，可以将$z^{(2)}$视为$x^{(2)}$的修改版本，它还整合了有关所有其他输入元素的信息，这些输入元素对于手头的给定任务是相关的。

- By convention, the unnormalized attention weights are referred to as **"attention scores"** whereas the normalized attention scores, which sum to 1, are referred to as **"attention weights"**.
- 根据惯例，未归一化（或称未标准化）的注意力权重被称为 **“注意力分数”**，而归一化（又称标准化）的注意力分数，其总和为1，被称为 **“注意力权重”**。
- The attention weights and context vector calculation are summarized in the figure below:
- 注意力权重和上下文向量的计算如下图所示：

<img src="figures/attention.png" width="600px">

- The code below walks through the figure above step by step.
- 下面的代码逐步展示了上图中的过程。

<br>

- **Step 1:** compute unnormalized attention scores $\omega$.
- **步骤1：** 计算未归一化的注意力分数 $\omega$。
- Suppose we use the second input token as the query, that is, $q^{(2)} = x^{(2)}$, we compute the unnormalized attention scores via dot products:
- 假设我们使用第二个输入标记作为查询，即 $q^{(2)} = x^{(2)}$，我们通过点积计算未归一化的注意力分数：
    - $\omega_{21} = x^{(1)} q^{(2)\top}$
    - $\omega_{22} = x^{(2)} q^{(2)\top}$
    - $\omega_{23} = x^{(3)} q^{(2)\top}$
    - ...
    - $\omega_{2T} = x^{(T)} q^{(2)\top}$
- Above, $\omega$ is the Greek letter "omega" used to symbolize the unnormalized attention scores.
- 上面，$\omega$ 是希腊字母“omega”，用来表示未归一化的注意力分数。
    - The subscript "21" in $\omega_{21}$ means that input sequence element 2 was used as a query against input sequence element 1.
    - $\omega_{21}$ 中的下标 "21" 表示输入序列元素2被用作查询，针对输入序列元素1。

<img src="figures/dot-product.png" width="450px">

- Suppose we have the following input sentence that is already embedded in 3-dimensional vectors as described in chapter 3 (we use a very small embedding dimension here for illustration purposes, so that it fits onto the page without line breaks):
- 假设我们有以下输入句子，已经按照第3章中描述的方式嵌入为3维向量（为了说明的目的，我们在这里使用了非常小的嵌入维度，以便它可以适合在页面上显示而不换行）：

In [2]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

- We use input sequence element 2, $x^{(2)}$, as an example to compute context vector $z^{(2)}$; later in this section, we will generalize this to compute all context vectors.
- 我们以输入序列的第2个元素 $x^{(2)}$ 为例来计算上下文向量 $z^{(2)}$；在本节的后面，我们将推广这种方法来计算所有的上下文向量。
- The first step is to compute the unnormalized attention scores by computing the dot product between the query $x^{(2)}$ and all other input tokens:
- 第一步是通过计算查询 $x^{(2)}$ 与所有其他输入标记的点积来计算非归一化的注意力分数：


In [None]:
query = inputs[1]  # 2nd input token is the query # 第二个输入标记作为查询

attn_scores_2 = torch.empty(inputs.shape[0]) # 创建一个空的一维张量，用于存储每个输入向量与查询向量的点积 # Create an empty one-dimensional tensor to store the dot product of each input vector with the query vector.

for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors) # 点积（由于它们是一维向量，这里不需要转置）

print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


- Side note: a dot product is essentially a shorthand for multiplying two vectors elements-wise and summing the resulting products:
- 旁注：点积本质上是将两个向量的元素逐个相乘，然后将得到的乘积求和的简写方法：

In [4]:
res = 0. # 初始化为浮点数 0.0 # Initialization to the floating-point number 0.0

for idx, element in enumerate(inputs[0]):
    res += inputs[0][idx] * query[idx]

print(res)
print(torch.dot(inputs[0], query))

tensor(0.9544)
tensor(0.9544)


- **Step 2:** normalize the unnormalized attention scores ("omegas", $\omega$) so that they sum up to 1.
- **第二步：** 将非标准化的注意力分数（"omegas", $\omega$）进行归一化，使它们的总和为1。
- Here is a simple way to normalize the unnormalized attention scores to sum up to 1 (a convention, useful for interpretation, and important for training stability):
- 这里有一个简单的方法可以将非标准化的注意力分数归一化到总和为1（这是一个惯例，有助于解释，对训练稳定性也很重要）：


In [5]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum() #  元素级别    #  element-wise

print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


- However, in practice, using the softmax function for normalization, which is better at handling extreme values and has more desirable gradient properties during training, is common and recommended.
- 然而，在实际应用中，通常推荐使用softmax函数进行归一化，因为它在处理极端值时表现更好，并且在训练过程中具有更理想的梯度属性。
- Here's a naive implementation of a softmax function for scaling, which also normalizes the vector elements such that they sum up to 1:
- 下面是一个用于缩放的softmax函数的简单实现，它也使得向量元素归一化，使得它们的总和为1：

In [6]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)

print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


- The naive implementation above can suffer from numerical instability issues for large or small input values due to overflow and underflow issues.
- 上面的简单实现可能会因为溢出和下溢问题，在输入值很大或很小时遇到数值不稳定的问题。
- Hence, in practice, it's recommended to use the PyTorch implementation of softmax instead, which has been highly optimized for performance:
- 因此，在实践中，建议使用PyTorch的softmax实现，它已经针对性能进行了高度优化：

In [7]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


- **Step 3**: compute the context vector $z^{(2)}$ by multiplying the embedded input tokens, $x^{(i)}$ with the attention weights and sum the resulting vectors:
- **步骤3：** 通过将嵌入的输入标记 $x^{(i)}$ 与注意力权重相乘，并对结果向量求和，计算上下文向量 $z^{(2)}$：

In [8]:
query = inputs[1] # 2nd input token is the query # 第二个输入标记作为查询

context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i

print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


### 3.3.2 Computing attention weights for all input tokens
### 3.3.2 计算所有输入标记的注意力权重

#### Generalize to all input sequence tokens:
#### 推广到所有输入序列标记：

- Above, we computed the attention weights and context vector for input 2 (as illustrated in the highlighted row in the figure below).
- 在上面，我们计算了输入2的注意力权重和上下文向量（如下图中突出显示的行所示）。
- Next, we are generalizing this computation to compute all attention weights and context vectors.
- 接下来，我们将这个计算推广到计算所有注意力权重和上下文向量。

<img src="figures/attention-matrix.png" width="400px">

- Apply previous **step 1** to all pairwise elements to compute the unnormalized attention score matrix:
- 对所有成对元素应用前面的**步骤1**来计算未归一化的注意力分数矩阵：

In [9]:
attn_scores = torch.empty(6, 6)

for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)

print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- We can achieve the same as above more efficiently via matrix multiplication:
- 我们可以通过矩阵乘法更有效地实现与上述相同的结果：

In [10]:
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- Similar to **step 2** previously, we normalize each row so that the values in each row sum to 1:
- 类似于之前的**步骤2**，我们对每一行进行归一化，使得每一行的值总和为1：

In [11]:
attn_weights = torch.softmax(attn_scores, dim=1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


- Quick verification that the values in each row indeed sum to 1:
- 快速验证每一行的值确实总和为1：

In [12]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)

print("All row sums:", attn_weights.sum(dim=1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


- Apply previous **step 3** to compute all context vectors:
- 应用前面的**步骤3**来计算所有上下文向量：

In [13]:
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


- As a sanity check, the previously computed context vector $z^{(2)} = [0.4419, 0.6515, 0.5683]$ can be found in the 2nd row in above: 
- 作为一个健全性检查，之前计算的上下文向量 $z^{(2)} = [0.4419, 0.6515, 0.5683]$ 可以在上面的第2行找到：

In [14]:
print("Previous 2nd context vector:", context_vec_2)

Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])


## 3.4 Implementing self-attention with trainable weights
## 3.4 使用可训练权重实现自注意力

### 3.4.1 Computing the attention weights step by step
### 3.4.1 逐步计算注意力权重

- In this section, we are implementing the self-attention mechanism that is used in the original transformer architecture, the GPT models, and most other popular LLMs.
- 在本节中，我们正在实现用于原始变换器架构、GPT模型以及大多数其他流行的LLMs中的自注意力机制。
- This self-attention mechanism is also called "scaled dot-product attention".
- 这种自注意力机制也被称为“缩放点积注意力”。
- The overall idea is similar to before:
- 总体思路与之前类似：
  - We want to compute context vectors as weighted sums over the input vectors specific to a certain input element.
  - 我们想要计算上下文向量作为特定输入元素与输入向量的加权和。
  - For the above, we need attention weights.
  - 为了上述目的，我们需要注意力权重。
- As you will see, there are only slight differences compared to the basic attention mechanism introduced earlier:
- 正如你将看到的，与早期介绍的基本注意力机制相比，只有轻微的差异：
  - The most notable difference is the introduction of weight matrices that are updated during model training.
  - 最显著的区别是引入了在模型训练期间更新的权重矩阵。
  - These trainable weight matrices are crucial so that the model (specifically, the attention module inside the model) can learn to produce "good" context vectors.
  - 这些可训练的权重矩阵至关重要，以便模型（特别是模型内部的注意力模块）可以学习产生“好”的上下文向量。

- Implementing the self-attention mechanism step by step, we will start by introducing the three training weight matrices $W_q$, $W_k$, and $W_v$.
- 逐步实现自注意力机制，我们将从介绍三个训练权重矩阵$W_q$、$W_k$和$W_v$开始。
- These three matrices are used to project the embedded input tokens, $x^{(i)}$, into query, key, and value vectors via matrix multiplication:
- 这三个矩阵用于通过矩阵乘法将嵌入的输入令牌$x^{(i)}$投影到查询向量、键向量和值向量上：
  - Query vector: $q^{(i)} = W_q \,x^{(i)}$
  - 查询向量：$q^{(i)} = W_q \,x^{(i)}$
  - Key vector: $k^{(i)} = W_k \,x^{(i)}$
  - 键向量：$k^{(i)} = W_k \,x^{(i)}$
  - Value vector: $v^{(i)} = W_v \,x^{(i)}$
  - 值向量：$v^{(i)} = W_v \,x^{(i)}$

<img src="figures/weight-selfattn-1.png" width="600px">

- The embedding dimensions of the input $x$ and the query vector $q$ can be the same or different, depending on the model's design and specific implementation.
- 输入 $x$ 和查询向量 $q$ 的嵌入维度可以相同也可以不同，这取决于模型的设计和具体实现。
- In GPT models, the input and output dimensions are usually the same, but for illustration purposes, to better follow the computation, we choose different input and output dimensions here:
- 在GPT模型中，输入和输出维度通常是相同的，但为了说明的目的，为了更好地理解计算，我们在这里选择了不同的输入和输出维度：

In [15]:
x_2 = inputs[1] # second input element # 第二个输入元素
d_in = inputs.shape[1] # the input embedding size, d=3 # 输入嵌入长度，维度为d=3
d_out = 2 # the output embedding size, d=2 # 输出嵌入长度，维度为d=2

- Below, we initialize the three weight matrices; note that we are setting `requires_grad=False` to reduce clutter in the outputs for illustration purposes, but if we were to use the weight matrices for model training, we would set `requires_grad=True` to update these matrices during model training.
- 下面，我们初始化了三个权重矩阵；请注意，出于说明目的，我们设置了 `requires_grad=False` 以减少输出中的混乱，但如果我们要在模型训练中使用这些权重矩阵，我们会设置 `requires_grad=True` 以在模型训练期间更新这些矩阵。

In [16]:
torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

- Next we compute the query, key, and value vectors:
- 接下来我们计算查询、键和值向量：

In [17]:
query_2 = x_2 @ W_query # _2 because it's with respect to the 2nd input element # _2 是因为它是相对于第二个输入元素的
key_2 = x_2 @ W_key 
value_2 = x_2 @ W_value

print(query_2)

tensor([0.4306, 1.4551])


- As we can see below, we successfully projected the 6 input tokens from a 3D onto a 2D embedding space:
- 如下所示，我们成功地将6个输入标记从3D投影到2D嵌入空间：

In [18]:
keys = inputs @ W_key 
values = inputs @ W_value

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


- In the next step, **step 2**, we compute the unnormalized attention scores by computing the dot product between the query and each key vector:
- 在下一步中，**步骤2**，我们通过计算查询向量和每个键向量之间的点积来计算未归一化的注意力分数：

<img src="figures/weight-selfattn-2.png" width="600px">

In [19]:
keys_2 = keys[1] # Python starts index at 0 # Python 中索引从 0 开始
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

tensor(1.8524)


- Since we have 6 inputs, we have 6 attention scores for the given query vector:
- 由于我们有6个输入，对于给定的查询向量，我们有6个注意力分数：

In [20]:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query # 给定查询向量的所有注意力分数
print(attn_scores_2)

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


<img src="figures/weight-selfattn-3.png" width="600px">

- Next, in **step 3**, we compute the attention weights (normalized attention scores that sum up to 1) using the softmax function we used earlier.
- 接下来，在**步骤3**中，我们使用我们之前使用过的 softmax 函数计算注意力权重（归一化的注意力分数，总和为1）。
- The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension, $\sqrt{d_k}$ (i.e., `d_k**0.5`):
- 与之前的不同之处在于，我们现在通过将注意力分数除以嵌入维度的平方根 $\sqrt{d_k}$（即 `d_k**0.5`）来对其进行缩放：

In [21]:
d_k = keys.shape[1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])


<img src="figures/weight-selfattn-4.png" width="600px">

- In **step 4**, we now compute the context vector for input query vector 2:
- 在**步骤4**中，我们现在计算输入查询向量2的上下文向量：

In [22]:
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

tensor([0.3061, 0.8210])


### 3.4.2 Implementing a compact SelfAttention class
### 3.4.2 实现一个简洁的SelfAttention类

- Putting it all together, we can implement the self-attention mechanism as follows:
- 综合考虑以上所有内容，我们可以实现自注意力机制如下：

In [23]:
import torch.nn as nn

class SelfAttention_v1(nn.Module): # 继承自nn.Module基类 # Inherited from the nn.Module base class 

    def __init__(self, d_in, d_out):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs)) # 默认调用forward() # Forward() is called by default

tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


- We can streamline the implementation above using PyTorch's Linear layers, which are equivalent to a matrix multiplication if we disable the bias units.
- 我们可以使用 PyTorch 的线性层来简化上面的实现，这些层相当于矩阵乘法，如果我们禁用偏置单元。
- Another big advantage of using `nn.Linear` over our manual `nn.Parameter(torch.rand(...)` approach is that `nn.Linear` has a preferred weight initialization scheme, which leads to more stable model training.
- 使用 `nn.Linear` 而不是我们手动的 `nn.Parameter(torch.rand(...)` 方法的另一个重大优势是，`nn.Linear` 具有首选的权重初始化方案，这导致模型训练更加稳定。

In [24]:
class SelfAttention_v2(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) # y = xA^T + b，其中 A 是qkv的权重矩阵，b 是偏置向量 
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias) # y = xA^T + b, where A is the weight matrix of qkv, and b is the bias vector
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


- Note that `SelfAttention_v1` and `SelfAttention_v2` give different outputs because they use different initial weights for the weight matrices.
- 请注意，`SelfAttention_v1` 和 `SelfAttention_v2` 给出不同的输出，因为它们对权重矩阵使用不同的初始权重。

## 3.5 Hiding future words with causal attention
## 3.5 使用因果注意力隐藏未来词语

### 3.5.1 Applying a causal attention mask
### 3.5.1 应用一个因果注意力掩码

- In this section, we are converting the previous self-attention mechanism into a causal self-attention mechanism.
- 在本节中，我们将先前的自注意力机制转换为因果自注意力机制。
- Causal self-attention ensures that the model's prediction for a certain position in a sequence is only dependent on the known outputs at previous positions, not on future positions.
- 因果自注意力确保模型对序列中某一位置的预测只依赖于先前位置的已知输出，而不依赖于未来位置。
- In simpler words, this ensures that each next word prediction should only depend on the preceding words.
- 简单来说，这确保了每个后续单词的预测只应依赖于前面的单词。
- To achieve this, for each given token, we mask out the future tokens (the ones that come after the current token in the input text):
- 为了实现这一点，对于每个给定的令牌（此前也称为标记），我们屏蔽掉未来的令牌（输入文本中当前令牌之后的令牌）：

<img src="figures/masked.png" width="600px">

- To illustrate and implement causal self-attention, let's work with the attention scores and weights from the previous section: 
- 为了说明和实现因果自注意力，让我们使用上一节中的注意力分数和权重：

In [25]:
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)
print(attn_weights)

tensor([[0.1972, 0.1910, 0.1894, 0.1361, 0.1344, 0.1520],
        [0.1476, 0.2164, 0.2134, 0.1365, 0.1240, 0.1621],
        [0.1479, 0.2157, 0.2129, 0.1366, 0.1260, 0.1608],
        [0.1505, 0.1952, 0.1933, 0.1525, 0.1375, 0.1711],
        [0.1571, 0.1874, 0.1885, 0.1453, 0.1819, 0.1399],
        [0.1473, 0.2033, 0.1996, 0.1500, 0.1160, 0.1839]])


- The simplest way to mask out future attention weights is by creating a mask via PyTorch's tril function with elements below the main diagonal (including the diagonal itself) set to 1 and above the main diagonal set to 0:
- 屏蔽未来的注意力权重的最简单方法是通过使用 PyTorch 的 `tril` 函数创建一个掩码，使主对角线以下的元素（包括对角线本身）设置为1，而主对角线以上的元素设置为0：

In [26]:
block_size = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(block_size, block_size))
print(mask_simple)

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])


- Then, we can multiply the attention weights with this mask to zero out the attention scores above the diagonal:
- 然后，我们可以将注意力权重与这个掩码相乘，以将对角线以上的注意力分数归零：

In [27]:
masked_simple = attn_weights*mask_simple # 逐元素乘法 # element-wise multiplication
print(masked_simple)

tensor([[0.1972, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1476, 0.2164, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1479, 0.2157, 0.2129, 0.0000, 0.0000, 0.0000],
        [0.1505, 0.1952, 0.1933, 0.1525, 0.0000, 0.0000],
        [0.1571, 0.1874, 0.1885, 0.1453, 0.1819, 0.0000],
        [0.1473, 0.2033, 0.1996, 0.1500, 0.1160, 0.1839]])


- However, if the mask were applied after softmax, like above, it would disrupt the probability distribution created by softmax. Softmax ensures that all output values sum to 1. Masking after softmax would require re-normalizing the outputs to sum to 1 again, which complicates the process and might lead to unintended effects.
- 但是，如果在softmax之后应用掩码，就像上面一样，它会破坏softmax创建的概率分布。 Softmax确保所有输出值总和为1。 在softmax之后应用掩码将需要重新对输出进行归一化，以使其总和再次为1，这会使过程变得复杂，并可能导致意想不到的效果。

- To make sure that the rows sum to 1, we can normalize the attention weights as follows:
- 为了确保行总和为1，我们可以按如下方式对注意力权重进行归一化：

In [28]:
row_sums = masked_simple.sum(dim=1, keepdim=True)   # keepdim=True 保持原始张量的维度结构    # keepdim=True Preserve the dimensional structure of the original tensor.
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4056, 0.5944, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2566, 0.3741, 0.3693, 0.0000, 0.0000, 0.0000],
        [0.2176, 0.2823, 0.2796, 0.2205, 0.0000, 0.0000],
        [0.1826, 0.2178, 0.2191, 0.1689, 0.2115, 0.0000],
        [0.1473, 0.2033, 0.1996, 0.1500, 0.1160, 0.1839]])


- While we are technically done with coding the causal attention mechanism now, let's briefly look at a more efficient approach to achieve the same as above.
- 尽管我们在技术上已经完成了编写因果注意力机制的工作，但让我们简要地看一下实现相同效果的更有效方法。
- So, instead of zeroing out attention weights above the diagonal and renormalizing the results, we can mask the unnormalized attention scores above the diagonal with negative infinity before they enter the softmax function:
- 因此，我们可以在进入softmax函数之前用负无穷大掩盖对角线以上的未归一化的注意力分数，而不是将注意力权重归零并重新归一化结果：

In [29]:
mask = torch.triu(torch.ones(block_size, block_size), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)   # 根据布尔值替换为-inf  # Replace with -inf based on boolean value
print(masked)

tensor([[0.9995,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.9544, 1.4950,   -inf,   -inf,   -inf,   -inf],
        [0.9422, 1.4754, 1.4570,   -inf,   -inf,   -inf],
        [0.4753, 0.8434, 0.8296, 0.4937,   -inf,   -inf],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654,   -inf],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- 通过使用`-torch.inf`作为掩码值，这确保了在应用softmax之前，被掩码的位置不会对最终结果产生影响（因为`e^-inf`趋近于0）。
- By using `-torch.inf` as the mask value, this ensures that the masked positions have no impact on the final result (as `e^-inf` approaches 0) before softmax is applied. 

- As we can see below, now the attention weights in each row correctly sum to 1 again:
- 正如我们下面所看到的，现在每一行中的注意力权重正确地再次总和为1：

In [30]:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)
print(attn_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4056, 0.5944, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2566, 0.3741, 0.3693, 0.0000, 0.0000, 0.0000],
        [0.2176, 0.2823, 0.2796, 0.2205, 0.0000, 0.0000],
        [0.1826, 0.2178, 0.2191, 0.1689, 0.2115, 0.0000],
        [0.1473, 0.2033, 0.1996, 0.1500, 0.1160, 0.1839]])


### 3.5.2 Masking additional attention weights with dropout
### 3.5.2 使用dropout屏蔽额外的注意力权重

- In addition, we also apply dropout to reduce overfitting during training.
- 此外，我们还应用dropout来减少训练过程中的过拟合。
- Dropout can be applied in several places:
- Dropout可以应用在几个地方：
  - for example, after computing the attention weights;
  - 例如，在计算注意力权重之后；
  - or after multiplying the attention weights with the value vectors.
  - 或者在将注意力权重与值向量相乘之后。
- Here, we will apply the dropout mask after computing the attention weights because it's more common.
- 在这里，我们将在计算注意力权重之后应用dropout掩码，因为这更常见。

- Furthermore, in this specific example, we use a dropout rate of 50%, which means randomly masking out half of the attention weights. (When we train the GPT model later, we will use a lower dropout rate, such as 0.1 or 0.2.)
- 此外，在这个特定示例中，我们使用50%的dropout率，这意味着随机屏蔽掉一半的注意力权重。（当我们稍后训练GPT模型时，我们将使用更低的dropout率，例如0.1或0.2。）

<img src="figures/dropout.png" width="500px">

- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2.
- 如果我们应用一个dropout率为0.5（50％），那么未被删除的值将按照1/0.5 = 2的因子进行相应的缩放。

In [31]:
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) # dropout rate of 50% # dropout 率为 50% 
example = torch.ones(6, 6) # create a matrix of ones # 创建一个全是1的矩阵

print(dropout(example))

tensor([[2., 2., 0., 2., 2., 0.],
        [0., 0., 0., 2., 0., 2.],
        [2., 2., 2., 2., 0., 2.],
        [0., 2., 2., 0., 0., 2.],
        [0., 2., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.]])


- Dropout基本思想是在训练过程中随机“丢弃”（即设置为零）神经网络中的一部分神经元（包括它们的输入和输出连接），从而减少模型对训练数据的过拟合。
- The basic idea of Dropout is to randomly "drop out" (i.e., set to zero) a portion of neurons in the neural network, including their input and output connections, during the training process, thereby reducing overfitting of the model to the training data.
- 当使用dropout时，实际上是在每个训练批次中，增加噪声，训练了一个“稀疏”的网络版本。
- When using dropout, in fact, noise is added in each training batch, training a "sparse" version of the network.
- 测试时，为了利用完整的网络能力并保持模型输出的期望不变，所有单元都保持激活状态。为了补偿训练时单元的丢弃，需要对网络权重进行缩放。
- During testing, to utilize the full capacity of the network while maintaining the expected output of the model, all units remain active. To compensate for the dropped units during training, the network weights need to be scaled.

In [32]:
torch.manual_seed(123)
print(dropout(attn_weights))

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5132, 0.7482, 0.7386, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.5646, 0.5592, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4357, 0.0000, 0.3378, 0.0000, 0.0000],
        [0.0000, 0.4065, 0.3991, 0.2999, 0.2320, 0.0000]])


### 3.5.3 Implementing a compact causal self-attention class
### 3.5.3 实现一个简洁的因果自注意力类

- Now, we are ready to implement a working implementation of self-attention, including the causal and dropout masks. 
- 现在，我们准备好实现一个可工作的自注意力实现，包括因果和dropout掩码。
- One more thing is to implement the code to handle batches consisting of more than one input so that our `CausalAttention` class supports the batch outputs produced by the data loader we implemented in chapter 2.
- 还有一件事是实现处理包含多个输入的批次的代码，这样我们的 `CausalAttention` 类就支持我们在第2章实现的数据加载器产生的批次输出。
- For simplicity, to simulate such batch input, we duplicate the input text example:
- 为了简单起见，为了模拟这样的批量输入，我们复制了输入文本示例：

In [33]:
batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape) # 2 inputs with 6 tokens each, and each token has embedding dimension 3 # 2个输入，每个有6个标记，每个标记的嵌入维度为3

torch.Size([2, 6, 3])


In [34]:
class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, block_size, dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New    # 新增的
        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1)) # New  # 新增的（开辟了一片名为mask的缓存）

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b   # 新的批次维度 b
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose    # 修改了转置操作
        attn_scores.masked_fill_(  # New, _ ops are in-place    # 新增的，带下划线的操作是原地操作（是指fill后的下划线）
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) # 切片操作，将 mask 矩阵调整为合适的大小  # Slicing operation, adjusting the mask matrix to a suitable size
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=1)
        attn_weights = self.dropout(attn_weights) # New # 新增的

        context_vec = attn_weights @ values
        return context_vec

torch.manual_seed(123)

block_size = batch.shape[1]
ca = CausalAttention(d_in, d_out, block_size, 0.0)

context_vecs = ca(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.0844,  0.0414],
         [-0.2264, -0.0039],
         [-0.4163, -0.0564],
         [-0.5014, -0.1011],
         [-0.7754, -0.1867],
         [-1.1632, -0.3303]],

        [[-0.0844,  0.0414],
         [-0.2264, -0.0039],
         [-0.4163, -0.0564],
         [-0.5014, -0.1011],
         [-0.7754, -0.1867],
         [-1.1632, -0.3303]]], grad_fn=<UnsafeViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


- Note that dropout is only applied during training, not during inference.
- 注意，dropout仅在训练期间应用，而不是在推理期间应用。

## 3.6 Extending single-head attention to multi-head attention
## 3.6 将单头注意力扩展到多头注意力

### 3.6.1 Stacking multiple single-head attention layers
### 3.6.1 堆叠多个单头注意力层

- Below is a summary of the self-attention implemented previously (causal and dropout masks not shown for simplicity).
- 以下是之前实现的自注意力机制的总结（为简单起见，未显示因果和dropout掩码）。

- This is also called single-head attention:
- 这也被称为单头注意力：

<img src="figures/single-head.png" width="600px">

- We simply stack multiple single-head attention modules to obtain a multi-head attention module:
- 我们简单地堆叠多个单头注意力模块以获得多头注意力模块：

<img src="figures/multi-head.png" width="600px">

- The main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions.
- 多头注意力背后的主要思想是用不同的学习线性投影（并行）多次运行注意力机制。这允许模型共同关注来自不同表示子空间在不同位置的信息。

In [35]:
class MultiHeadAttentionWrapper(nn.Module):

    def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList( # nn.ModuleList 创建了一个名为 heads 的列表对象 # nn.ModuleList creates a list object named heads.
            [CausalAttention(d_in, d_out, block_size, dropout, qkv_bias) 
             for _ in range(num_heads)] 
             # 下划线 _ 是一个惯用法，用来表示循环中的临时变量是不重要的，我们不打算在循环体内使用它。
             # The underscore _ is a convention used to indicate that a temporary variable in a loop is unimportant, and we do not intend to use it inside the loop body.
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)
        # 迭代每个 head 调用其 forward 方法（也就是用 head(x) 来计算输出），并沿着最后一个维度（dim=-1）拼接起来。
        # Iterate over each head and call its forward method (i.e., compute the output with head(x)), and concatenate them along the last dimension (dim=-1).

torch.manual_seed(123)

block_size = batch.shape[1] # This is the number of tokens  # 这是令牌的数量
d_in, d_out = 3, 2
mha = MultiHeadAttentionWrapper(d_in, d_out, block_size, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.0844,  0.0414,  0.0766,  0.0171],
         [-0.2264, -0.0039,  0.2143,  0.1185],
         [-0.4163, -0.0564,  0.3878,  0.2453],
         [-0.5014, -0.1011,  0.4992,  0.3401],
         [-0.7754, -0.1867,  0.7387,  0.4868],
         [-1.1632, -0.3303,  1.1224,  0.8460]],

        [[-0.0844,  0.0414,  0.0766,  0.0171],
         [-0.2264, -0.0039,  0.2143,  0.1185],
         [-0.4163, -0.0564,  0.3878,  0.2453],
         [-0.5014, -0.1011,  0.4992,  0.3401],
         [-0.7754, -0.1867,  0.7387,  0.4868],
         [-1.1632, -0.3303,  1.1224,  0.8460]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 4])


- In the implementation above, the embedding dimension is 4, because we `d_out=2` as the embedding dimension for the key, query, and value vectors as well as the context vector. And since we have 2 attention heads, we have the output embedding dimension 2*2=4.
- 在上面的实现中，嵌入维度为4，因为我们将`d_out=2`作为键、查询和值向量以及上下文向量的嵌入维度。而且由于我们有2个注意力头,我们有输出嵌入维度2*2=4。

- If we want to have an output dimension of 2, as earlier in single-head attention, we can have to change the projection dimension `d_out` to 1:
- 如果我们希望与早期的单头注意力一样有输出维度2,我们必须将投影维度`d_out`更改为1:

In [36]:
torch.manual_seed(123)

d_out = 1
mha = MultiHeadAttentionWrapper(d_in, d_out, block_size, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-9.1476e-02,  3.4164e-02],
         [-2.6796e-01, -1.3427e-03],
         [-4.8421e-01, -4.8909e-02],
         [-6.4808e-01, -1.0625e-01],
         [-8.8380e-01, -1.7140e-01],
         [-1.4744e+00, -3.4327e-01]],

        [[-9.1476e-02,  3.4164e-02],
         [-2.6796e-01, -1.3427e-03],
         [-4.8421e-01, -4.8909e-02],
         [-6.4808e-01, -1.0625e-01],
         [-8.8380e-01, -1.7140e-01],
         [-1.4744e+00, -3.4327e-01]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


### 3.6.2 Implementing multi-head attention with weight splits
### 3.6.2 实现带有权重分割的多头注意力

- While the above is an intuitive and fully functional implementation of multi-head attention (wrapping the single-head attention `CausalAttention` implementation from earlier), we can write a stand-alone class called `MultiHeadAttention` to achieve the same.
- 虽然上面是一个直观且功能完整的多头注意力实现（包装了早期的单头注意力`CausalAttention`实现），但我们可以编写一个名为`MultiHeadAttention`的独立类来实现相同的功能。

- We don't concatenate single attention heads for this stand-alone `MultiHeadAttention` class. Instead, we create single W_query, W_key, and W_value weight matrices and then split those into individual matrices for each attention head:
- 对于这个独立的`MultiHeadAttention`类,我们不会连接单个注意力头。相反,我们创建单个W_query、W_key和W_value权重矩阵,然后将其拆分为每个注意力头的单个矩阵:

In [37]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, block_size, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by n_heads" # 断言（assert）确保 `d_out` 能够被 `num_heads` 整除。  # Assert that `d_out` is divisible by `num_heads`.

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim  # 将投影维度减少以匹配所需的输出维度

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs # 线性层以合并头输出
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)    # 形状：(b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # 我们通过添加一个 `num_heads` 维度来隐式分割矩阵
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        # 展开最后一个维度：(b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        # 转置：(b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        # 使用因果掩码计算缩放点积注意力（又名自注意力）
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head   # 每个头部的点积（头数相当于，类比于批次数）
        # Original mask truncated to the number of tokens and converted to boolean
        # 原始掩码截断到令牌数量并转换为布尔值
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
        # Unsqueeze the mask twice to match dimensions
        # 对掩码进行两次扩展以匹配维度
        mask_unsqueezed = mask_bool.unsqueeze(0).unsqueeze(0)
        # Use the unsqueezed mask to fill attention scores
        # 使用扩展后的掩码填充注意力分数
        attn_scores.masked_fill_(mask_unsqueezed, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)   # 形状：(b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 

        # Combine heads, where self.d_out = self.num_heads * self.head_dim   # 合并头部，其中 self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)# 将上下文向量变为连续并改变其形状   # Make the context vector contiguous and reshape it
        context_vec = self.out_proj(context_vec) # optional projection # 对上下文向量进行可选的投影操作

        return context_vec

torch.manual_seed(123)

batch_size, block_size, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, block_size, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]],

        [[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


- Note that the above is essentially a rewritten version of `MultiHeadAttentionWrapper` that is more efficient.
- 注意,上面实质上是一个重写的`MultiHeadAttentionWrapper`版本,更加高效。

- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters.
- 由于随机权重初始化不同,因此得到的输出看起来有一点不同,但两者都是可以在我们即将实现的GPT类中使用的功能完整的实现。

- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)
- 另请注意,我们还在上面的`MultiHeadAttention`类中添加了一个线性投影层(`self.out_proj`)。这只是一个不改变维度的线性变换。在LLM实现中使用这种投影层是标准惯例,但并非是绝对必要的(最新研究表明,它可以被移除而不会影响建模性能;请参阅本章末尾的进一步阅读部分)

- Note that if you are interested in a compact and efficient implementation of the above, you can also consider the [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) class in PyTorch.
- 请注意，如果您对上述内容的紧凑和高效实现感兴趣，您也可以考虑使用 PyTorch 中的 [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) 类。

- Since the above implementation may look a bit complex at first glance, let's look at what happens when executing `attn_scores = queries @ keys.transpose(2, 3)`:
- 鉴于上述实现乍看之下可能有些复杂，让我们来看看执行 `attn_scores = queries @ keys.transpose(2, 3)` 时发生了什么：

In [38]:
# (b, num_heads, num_tokens, head_dim) = (1, 2, 3, 4)
a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573],
                    [0.8993, 0.0390, 0.9268, 0.7388],
                    [0.7179, 0.7058, 0.9156, 0.4340]],

                   [[0.0772, 0.3565, 0.1479, 0.5331],
                    [0.4066, 0.2318, 0.4545, 0.9737],
                    [0.4606, 0.5159, 0.4220, 0.5786]]]])

print(a @ a.transpose(2, 3))

tensor([[[[1.3208, 1.1631, 1.2879],
          [1.1631, 2.2150, 1.8424],
          [1.2879, 1.8424, 2.0402]],

         [[0.4391, 0.7003, 0.5903],
          [0.7003, 1.3737, 1.0620],
          [0.5903, 1.0620, 0.9912]]]])


- In this case, the matrix multiplication implementation in PyTorch will handle the 4-dimensional input tensor so that the matrix multiplication is carried out between the 2 last dimensions (num_tokens, head_dim) and then repeated for the individual heads.
- 在这种情况下,PyTorch中的矩阵乘法实现将处理4维输入张量,以便在最后2个维度(num_tokens,head_dim)之间进行矩阵乘法,然后对每个单独的头重复此操作。

- For instance, the above becomes a more compact way to compute the matrix multiplication for each head separately:
- 例如,上面成为了一种更加紧凑的方式,用于单独为每个头计算矩阵乘法:

In [39]:
first_head = a[0, 0, :, :]
first_res = first_head @ first_head.T
print("First head:\n", first_res)

second_head = a[0, 1, :, :]
second_res = second_head @ second_head.T
print("\nSecond head:\n", second_res)

First head:
 tensor([[1.3208, 1.1631, 1.2879],
        [1.1631, 2.2150, 1.8424],
        [1.2879, 1.8424, 2.0402]])

Second head:
 tensor([[0.4391, 0.7003, 0.5903],
        [0.7003, 1.3737, 1.0620],
        [0.5903, 1.0620, 0.9912]])


In [40]:
block_size = 1024
d_in, d_out = 768, 768
num_heads = 12

mha = MultiHeadAttention(d_in, d_out, block_size, 0.0, num_heads)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(mha)

2360064

# Summary and takeaways
# 总结与要点

- See the [./multihead-attention.ipynb](./multihead-attention.ipynb) code notebook, which is a concise version of the data loader (chapter 2) plus the multi-head attention class that we implemented in this chapter and will need for training the GPT model in upcoming chapters.
- 请参阅 [./multihead-attention.ipynb](./multihead-attention.ipynb) 代码笔记本，它是数据加载器（第2章）的简洁版本，加上我们在本章实现的多头注意力类，并且在接下来的章节中训练 GPT 模型时将需要它。