<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


 # Chapter 3: Coding Attention Mechanisms
 # 第三章：编写注意力机制

Packages that are being used in this notebook:

In [2]:
# 导入importlib.metadata模块中的version函数
from importlib.metadata import version

# 打印PyTorch的版本号
print("torch version:", version("torch"))

torch version: 2.5.0


 - This chapter covers attention mechanisms, the engine of LLMs:
 - 本章介绍注意力机制，即大语言模型的引擎：

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/01.webp?123" width="500px">

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/02.webp" width="600px">

 ## 3.1 The problem with modeling long sequences
 ## 3.1 建模长序列的问题

- No code in this section
- Translating a text word by word isn't feasible due to the differences in grammatical structures between the source and target languages:
 - 本节没有代码
 - 由于源语言和目标语言之间语法结构的差异，逐字翻译文本是不可行的：

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/03.webp" width="400px">

 - Prior to the introduction of transformer models, encoder-decoder RNNs were commonly used for machine translation tasks
 - 在Transformer模型出现之前,编码器-解码器RNN通常用于机器翻译任务
 - In this setup, the encoder processes a sequence of tokens from the source language, using a hidden state—a kind of intermediate layer within the neural network—to generate a condensed representation of the entire input sequence:
 - 在这种设置中,编码器处理来自源语言的标记序列,使用隐藏状态(神经网络中的一种中间层)来生成整个输入序列的压缩表示:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/04.webp" width="500px">

 ## 3.2 Capturing data dependencies with attention mechanisms
 ## 3.2 使用注意力机制捕获数据依赖关系

 - No code in this section
 - Through an attention mechanism, the text-generating decoder segment of the network is capable of selectively accessing all input tokens, implying that certain input tokens hold more significance than others in the generation of a specific output token:
 - 本节没有代码
 - 通过注意力机制,网络的文本生成解码器部分能够选择性地访问所有输入标记,这意味着某些输入标记在生成特定输出标记时比其他标记更重要:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/05.webp" width="500px">

 - Self-attention in transformers is a technique designed to enhance input representations by enabling each position in a sequence to engage with and determine the relevance of every other position within the same sequence
 - Transformer中的自注意力是一种旨在增强输入表示的技术,它使序列中的每个位置都能与同一序列中的所有其他位置进行交互并确定其相关性

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/06.webp" width="300px">

 ## 3.3 Attending to different parts of the input with self-attention
 ## 3.3 使用自注意力机制关注输入的不同部分

### 3.3.1 A simple self-attention mechanism without trainable weights

- This section explains a very simplified variant of self-attention, which does not contain any trainable weights
- This is purely for illustration purposes and NOT the attention mechanism that is used in transformers
- The next section, section 3.3.2, will extend this simple attention mechanism to implement the real self-attention mechanism
- Suppose we are given an input sequence $x^{(1)}$ to $x^{(T)}$
  - The input is a text (for example, a sentence like "Your journey starts with one step") that has already been converted into token embeddings as described in chapter 2
  - For instance, $x^{(1)}$ is a d-dimensional vector representing the word "Your", and so forth
- **Goal:** compute context vectors $z^{(i)}$ for each input sequence element $x^{(i)}$ in $x^{(1)}$ to $x^{(T)}$ (where $z$ and $x$ have the same dimension)
    - A context vector $z^{(i)}$ is a weighted sum over the inputs $x^{(1)}$ to $x^{(T)}$
    - The context vector is "context"-specific to a certain input
      - Instead of $x^{(i)}$ as a placeholder for an arbitrary input token, let's consider the second input, $x^{(2)}$
      - And to continue with a concrete example, instead of the placeholder $z^{(i)}$, we consider the second output context vector, $z^{(2)}$
      - The second context vector, $z^{(2)}$, is a weighted sum over all inputs $x^{(1)}$ to $x^{(T)}$ weighted with respect to the second input element, $x^{(2)}$
      - The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing $z^{(2)}$
      - In short, think of $z^{(2)}$ as a modified version of $x^{(2)}$ that also incorporates information about all other input elements that are relevant to a given task at hand
- 本节介绍了一个非常简化的自注意力变体,它不包含任何可训练的权重
- 这纯粹是为了说明目的,而不是transformer中使用的注意力机制
- 下一节3.3.2将扩展这个简单的注意力机制来实现真正的自注意力机制
- 假设我们有一个输入序列 $x^{(1)}$ 到 $x^{(T)}$
  - 输入是一个文本(例如一个句子"Your journey starts with one step"),已经按照第2章所述转换为token嵌入
  - 例如,$x^{(1)}$ 是一个表示单词"Your"的d维向量,以此类推
- **目标:** 为输入序列 $x^{(1)}$ 到 $x^{(T)}$ 中的每个输入序列元素 $x^{(i)}$ 计算上下文向量 $z^{(i)}$ (其中 $z$ 和 $x$ 具有相同的维度)
    - 上下文向量 $z^{(i)}$ 是输入 $x^{(1)}$ 到 $x^{(T)}$ 的加权和
    - 上下文向量是特定于某个输入的"上下文"
      - 不用 $x^{(i)}$ 作为任意输入token的占位符,让我们考虑第二个输入 $x^{(2)}$
      - 继续用一个具体的例子,不用占位符 $z^{(i)}$,我们考虑第二个输出上下文向量 $z^{(2)}$
      - 第二个上下文向量 $z^{(2)}$ 是所有输入 $x^{(1)}$ 到 $x^{(T)}$ 的加权和,权重是相对于第二个输入元素 $x^{(2)}$ 而言的
      - 注意力权重决定了在计算 $z^{(2)}$ 时每个输入元素的贡献程度
      - 简而言之,可以将 $z^{(2)}$ 看作 $x^{(2)}$ 的修改版本,它还包含了与手头任务相关的所有其他输入元素的信息

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/07.webp" width="400px">

 - (Please note that the numbers in this figure are truncated to one
 digit after the decimal point to reduce visual clutter; similarly, other figures may also contain truncated values)
 - (请注意,为了减少视觉混乱,此图中的数字被截断为小数点后一位;类似地,其他图中也可能包含截断的值)

 - By convention, the unnormalized attention weights are referred to as **"attention scores"** whereas the normalized attention scores, which sum to 1, are referred to as **"attention weights"**
 - 按照惯例,未归一化的注意力权重被称为 **"注意力分数"** ,而归一化后和为1的注意力分数被称为 **"注意力权重"**


 - The code below walks through the figure above step by step
 - 下面的代码将逐步讲解上图
 
 <br>
 
 - **Step 1:** compute unnormalized attention scores $\omega$
 - **步骤1:** 计算未归一化的注意力分数 $\omega$
 - Suppose we use the second input token as the query, that is, $q^{(2)} = x^{(2)}$, we compute the unnormalized attention scores via dot products:
 - 假设我们使用第二个输入token作为查询,即 $q^{(2)} = x^{(2)}$,我们通过点积计算未归一化的注意力分数:
     - $\omega_{21} = x^{(1)} q^{(2)\top}$
     - $\omega_{22} = x^{(2)} q^{(2)\top}$
     - $\omega_{23} = x^{(3)} q^{(2)\top}$
     - ...
     - $\omega_{2T} = x^{(T)} q^{(2)\top}$
 - Above, $\omega$ is the Greek letter "omega" used to symbolize the unnormalized attention scores
 - 上面的 $\omega$ 是希腊字母"omega",用来表示未归一化的注意力分数
     - The subscript "21" in $\omega_{21}$ means that input sequence element 2 was used as a query against input sequence element 1
     - $\omega_{21}$ 中的下标"21"表示输入序列元素2被用作查询,与输入序列元素1进行比较

 - Suppose we have the following input sentence that is already embedded in 3-dimensional vectors as described in chapter 3 (we use a very small embedding dimension here for illustration purposes, so that it fits onto the page without line breaks):
 - 假设我们有以下输入句子,它已经按照第3章所述嵌入到3维向量中(这里为了说明目的使用了非常小的嵌入维度,这样它就可以在不换行的情况下适合页面):

In [3]:
# 导入PyTorch库
import torch

# 创建输入张量,每行代表一个词的3维嵌入向量
# 每个向量的维度为[embedding_dim=3]
# 整个序列包含6个词:"Your journey starts with one step"
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)  - 第1个词的嵌入向量
   [0.55, 0.87, 0.66], # journey  (x^2)  - 第2个词的嵌入向量 
   [0.57, 0.85, 0.64], # starts   (x^3)  - 第3个词的嵌入向量
   [0.22, 0.58, 0.33], # with     (x^4)  - 第4个词的嵌入向量
   [0.77, 0.25, 0.10], # one      (x^5)  - 第5个词的嵌入向量
   [0.05, 0.80, 0.55]] # step     (x^6)  - 第6个词的嵌入向量
)

 - (In this book, we follow the common machine learning and deep learning convention where training examples are represented as rows and feature values as columns; in the case of the tensor shown above, each row represents a word, and each column represents an embedding dimension)
 - (在本书中,我们遵循机器学习和深度学习的常见约定,即训练样本表示为行,特征值表示为列;在上面的张量中,每一行代表一个词,每一列代表一个嵌入维度)
 
 - The primary objective of this section is to demonstrate how the context vector $z^{(2)}$
   is calculated using the second input sequence, $x^{(2)}$, as a query
 - 本节的主要目标是演示如何使用第二个输入序列 $x^{(2)}$ 作为查询来计算上下文向量 $z^{(2)}$
 
 - The figure depicts the initial step in this process, which involves calculating the attention scores ω between $x^{(2)}$
   and all other input elements through a dot product operation
 - 图中描述了这个过程的初始步骤,即通过点积运算计算 $x^{(2)}$ 与所有其他输入元素之间的注意力分数 ω

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/08.webp" width="400px">

 - We use input sequence element 2, $x^{(2)}$, as an example to compute context vector $z^{(2)}$; later in this section, we will generalize this to compute all context vectors.
 - 我们使用输入序列元素2, $x^{(2)}$, 作为示例来计算上下文向量 $z^{(2)}$; 在本节后面,我们将推广到计算所有上下文向量。
 - The first step is to compute the unnormalized attention scores by computing the dot product between the query $x^{(2)}$ and all other input tokens:
 - 第一步是通过计算查询 $x^{(2)}$ 与所有其他输入标记之间的点积来计算未归一化的注意力分数:

In [4]:
# 选择第2个输入向量作为查询向量
query = inputs[1]  # 2nd input token is the query

# 创建一个空张量用于存储注意力分数,大小与输入序列长度相同
attn_scores_2 = torch.empty(inputs.shape[0])

# 计算查询向量与所有输入向量的点积,得到注意力分数
for i, x_i in enumerate(inputs):
    # 由于是1维向量,直接计算点积即可,无需转置
    # torch.dot计算两个向量的点积
    # 对于两个向量a和b,点积结果为:a1*b1 + a2*b2 + ... + an*bn
    # 这里计算查询向量query与每个输入向量x_i的点积,得到它们之间的相似度分数
    attn_scores_2[i] = torch.dot(x_i, query) # dot product (transpose not necessary here since they are 1-dim vectors)

# 打印计算得到的注意力分数
print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


 - Side note: a dot product is essentially a shorthand for multiplying two vectors elements-wise and summing the resulting products:
 - 旁注:点积本质上是两个向量按元素相乘并对结果求和的简写:

In [5]:
# 初始化结果变量为0
res = 0.

# 遍历第一个输入向量inputs[0]的每个元素
for idx, element in enumerate(inputs[0]):
    # 将第一个输入向量的元素与查询向量对应位置的元素相乘并累加
    # 这实现了手动计算点积的过程
    res += inputs[0][idx] * query[idx]

# 打印手动计算的点积结果
print(res)
# 打印使用torch.dot()函数计算的点积结果,用于验证手动计算是否正确
print(torch.dot(inputs[0], query))

tensor(0.9544)
tensor(0.9544)


 - **Step 2:** normalize the unnormalized attention scores ("omegas", $\omega$) so that they sum up to 1
 - **第二步:** 对未归一化的注意力分数("omegas", $\omega$)进行归一化,使其总和为1
 - Here is a simple way to normalize the unnormalized attention scores to sum up to 1 (a convention, useful for interpretation, and important for training stability):
 - 这里有一个简单的方法来将未归一化的注意力分数归一化为总和为1(这是一个惯例,有助于解释,并且对训练稳定性很重要):

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/09.webp" width="500px">

In [6]:
# 通过简单的除法计算归一化的注意力权重
# 将每个注意力分数除以所有分数的总和
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

# 打印计算得到的注意力权重
print("Attention weights:", attn_weights_2_tmp)
# 打印权重总和,验证是否为1
print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


 - However, in practice, using the softmax function for normalization, which is better at handling extreme values and has more desirable gradient properties during training, is common and recommended.
 - 然而,在实践中,使用softmax函数进行归一化是常见且推荐的做法,因为它能更好地处理极端值,并且在训练过程中具有更理想的梯度特性。
 - Here's a naive implementation of a softmax function for scaling, which also normalizes the vector elements such that they sum up to 1:
 - 这里是一个简单的softmax函数实现,用于缩放,它同样可以将向量元素归一化使其总和为1:

In [7]:
# 定义一个简单的softmax函数实现
def softmax_naive(x):
    # torch.exp对输入张量进行以e为底的指数运算
    # 例如,如果输入是2,输出就是e^2 ≈ 7.389
    # 在softmax中,指数运算可以将任意实数映射到正数,有助于归一化
    return torch.exp(x) / torch.exp(x).sum(dim=0)

# 使用naive softmax计算注意力权重
attn_weights_2_naive = softmax_naive(attn_scores_2)

# 打印计算得到的注意力权重
print("Attention weights:", attn_weights_2_naive)
# 打印权重总和,验证是否为1
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


 - The naive implementation above can suffer from numerical instability issues for large or small input values due to overflow and underflow issues
 - 上面的简单实现在处理较大或较小的输入值时可能会因为溢出和下溢问题而导致数值不稳定
 - Hence, in practice, it's recommended to use the PyTorch implementation of softmax instead, which has been highly optimized for performance:
 - 因此,在实践中建议使用PyTorch的softmax实现,它在性能方面已经过高度优化:

In [8]:
# 使用PyTorch的softmax函数计算注意力权重
# dim=0表示在第0维度上进行softmax运算
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

# 打印计算得到的注意力权重
print("Attention weights:", attn_weights_2)
# 打印权重总和,验证是否为1
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


 - **Step 3**: compute the context vector $z^{(2)}$ by multiplying the embedded input tokens, $x^{(i)}$ with the attention weights and sum the resulting vectors:
 - **步骤 3**: 通过将嵌入的输入词元 $x^{(i)}$ 与注意力权重相乘并对结果向量求和来计算上下文向量 $z^{(2)}$:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/10.webp" width="500px">

In [9]:
# 将第2个输入词元作为查询向量
query = inputs[1]

# 初始化一个与查询向量相同形状的零向量作为上下文向量
context_vec_2 = torch.zeros(query.shape)

# 遍历所有输入词元,计算加权和得到上下文向量
# 每个输入词元都根据其对应的注意力权重进行加权
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i

# 打印计算得到的上下文向量
print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


 ### 3.3.2 Computing attention weights for all input tokens
 ### 3.3.2 计算所有输入词元的注意力权重

 #### Generalize to all input sequence tokens:
 #### 推广到所有输入序列词元:
 
 - Above, we computed the attention weights and context vector for input 2 (as illustrated in the highlighted row in the figure below)
 - 上面,我们计算了输入2的注意力权重和上下文向量(如下图中高亮行所示)
 - Next, we are generalizing this computation to compute all attention weights and context vectors
 - 接下来,我们将推广这个计算过程来计算所有的注意力权重和上下文向量

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/11.webp" width="400px">

 - (Please note that the numbers in this figure are truncated to two
 digits after the decimal point to reduce visual clutter; the values in each row should add up to 1.0 or 100%; similarly, digits in other figures are truncated)
 - (请注意,为了减少视觉混乱,此图中的数字被截断为小数点后两位;每行的值加起来应该等于1.0或100%;同样,其他图中的数字也被截断)

 - In self-attention, the process starts with the calculation of attention scores, which are subsequently normalized to derive attention weights that total 1
 - These attention weights are then utilized to generate the context vectors through a weighted summation of the inputs
 - 在自注意力中,过程从计算注意力分数开始,这些分数随后被归一化以得到总和为1的注意力权重
 - 然后通过输入的加权求和使用这些注意力权重来生成上下文向量

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/12.webp" width="400px">

 - Apply previous **step 1** to all pairwise elements to compute the unnormalized attention score matrix:
 - 将之前的**步骤1**应用于所有成对元素以计算未归一化的注意力分数矩阵:

In [10]:
# 创建一个6x6的空张量来存储注意力分数
attn_scores = torch.empty(6, 6)

# 通过两层循环计算所有输入词元对之间的注意力分数
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        # 使用点积计算第i个和第j个输入词元之间的注意力分数
        attn_scores[i, j] = torch.dot(x_i, x_j)

# 打印计算得到的注意力分数矩阵
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


 - We can achieve the same as above more efficiently via matrix multiplication:
 - 我们可以通过矩阵乘法更高效地实现上述相同的结果:

In [11]:
# 通过矩阵乘法计算注意力分数
# @ 是矩阵乘法运算符,用于计算两个矩阵的乘积
# inputs @ inputs.T 表示将inputs矩阵与其转置矩阵相乘
# 这种写法比使用双重循环更简洁高效
# inputs.T是inputs矩阵的转置,即将矩阵的行和列互换
# 例如,如果inputs是6x3的矩阵,inputs.T就是3x6的矩阵
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- Similar to **step 2** previously, we normalize each row so that the values in each row sum to 1:
- 类似于之前的**步骤2**,我们对每一行进行归一化，使得每一行中的值之和为1：

In [12]:
# 计算注意力权重
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


- Quick verification that the values in each row indeed sum to 1:
- 快速验证每行的值确实总和为1：

In [13]:
# 计算第二行的总和
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)

# 打印所有行的总和
print("All row sums:", attn_weights.sum(dim=-1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


- Apply previous **step 3** to compute all context vectors:
- 应用之前的**步骤3**来计算所有上下文向量:

In [14]:
# 使用注意力权重和输入矩阵进行矩阵乘法，计算所有上下文向量
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


- As a sanity check, the previously computed context vector $z^{(2)} = [0.4419, 0.6515, 0.5683]$ can be found in the 2nd row in above: 
- 作为一个理智检查，之前计算的上下文向量$z^{(2)} = [0.4419, 0.6515, 0.5683]$可以在上面的第2行中找到：


In [15]:
# 打印之前计算的第二个上下文向量
print("Previous 2nd context vector:", context_vec_2)

Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])


## 3.4 Implementing self-attention with trainable weights
## 实现带有可训练权重的自注意力机制


- A conceptual framework illustrating how the self-attention mechanism developed in this section integrates into the overall narrative and structure of this book and chapter
 - 一个概念框架，说明本节开发的自注意力机制如何整合到本书和本章的整体叙述和结构中


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/13.webp" width="400px">

### 3.4.1 Computing the attention weights step by step
 ### 3.4.1 依次计算注意力权重


- In this section, we are implementing the self-attention mechanism that is used in the original transformer architecture, the GPT models, and most other popular LLMs
- This self-attention mechanism is also called "scaled dot-product attention"
- The overall idea is similar to before:
  - We want to compute context vectors as weighted sums over the input vectors specific to a certain input element
  - For the above, we need attention weights
- As you will see, there are only slight differences compared to the basic attention mechanism introduced earlier:
  - The most notable difference is the introduction of weight matrices that are updated during model training
  - These trainable weight matrices are crucial so that the model (specifically, the attention module inside the model) can learn to produce "good" context vectors
 - 在本节中，我们正在实现用于原始变压器架构、GPT模型和大多数其他流行的LLMs的自注意力机制
 - 这种自注意力机制也被称为“缩放点积注意力”
 - 整体思想与之前类似：
   - 我们想计算特定于某个输入元素的输入向量的加权和，以得到上下文向量
   - 为了实现上述目标，我们需要注意力权重
 - 正如你将看到的那样，与之前引入的基本注意力机制相比，只有轻微的差异：
   - 最明显的差异是引入了在模型训练期间更新的权重矩阵
   - 这些可训练的权重矩阵对于模型（特别是模型内部的注意力模块）能够学习生成“好的”上下文向量至关重要

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/14.webp" width="600px">

- Implementing the self-attention mechanism step by step, we will start by introducing the three training weight matrices $W_q$, $W_k$, and $W_v$
- These three matrices are used to project the embedded input tokens, $x^{(i)}$, into query, key, and value vectors via matrix multiplication:

  - Query vector: $q^{(i)} = W_q \,x^{(i)}$
  - Key vector: $k^{(i)} = W_k \,x^{(i)}$
  - Value vector: $v^{(i)} = W_v \,x^{(i)}$
 - 依次实现自注意力机制，我们将从引入三个训练权重矩阵$W_q$、$W_k$和$W_v$开始
 - 这三个矩阵用于通过矩阵乘法将嵌入的输入令牌$x^{(i)}$投影到查询、键和值向量中：
 
   - 查询向量：$q^{(i)} = W_q \,x^{(i)}$
   - 键向量：$k^{(i)} = W_k \,x^{(i)}$
   - 值向量：$v^{(i)} = W_v \,x^{(i)}$


- The embedding dimensions of the input $x$ and the query vector $q$ can be the same or different, depending on the model's design and specific implementation
- In GPT models, the input and output dimensions are usually the same, but for illustration purposes, to better follow the computation, we choose different input and output dimensions here:
 - 输入$x$和查询向量$q$的嵌入维度可以相同，也可以不同，这取决于模型的设计和具体实现
  - 在GPT模型中，输入和输出维度通常是相同的，但为了说明的目的，为了更好地跟踪计算，我们在这里选择了不同的输入和输出维度：

In [16]:
x_2 = inputs[1]  # 获取第二个输入元素
d_in = inputs.shape[1]  # 获取输入的嵌入维度，d=3
d_out = 2  # 设置输出的嵌入维度，d=2

- Below, we initialize the three weight matrices; note that we are setting `requires_grad=False` to reduce clutter in the outputs for illustration purposes, but if we were to use the weight matrices for model training, we would set `requires_grad=True` to update these matrices during model training
 - 下面，我们初始化了三个权重矩阵；注意，我们设置了`requires_grad=False`，以减少输出中的混乱，仅用于说明目的，但如果我们要使用这些权重矩阵进行模型训练，我们将设置`requires_grad=True`，以便在模型训练期间更新这些矩阵


In [17]:
# 设置随机种子以确保结果的可重复性
torch.manual_seed(123)
# torch.nn.Parameter用于将一个Tensor转换为一个可以被优化的参数。它是PyTorch中用于表示模型参数的主要方式。
# 当我们将一个Tensor转换为Parameter时，它会被注册到模型的参数列表中，并且可以被自动梯度机制所追踪。
# 这样，在模型训练过程中，我们可以通过反向传播来更新这些参数，从而实现模型的学习和优化。

# torch.rand用于生成一个随机的张量，所有元素都是从0到1的均匀分布。它的参数是生成的张量的大小。
# 在这里，我们使用torch.rand来初始化权重矩阵，生成的随机数范围在0到1之间。

# 初始化查询权重矩阵，大小为输入维度d_in到输出维度d_out，且不需要梯度更新
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
# 初始化键权重矩阵，大小为输入维度d_in到输出维度d_out，且不需要梯度更新
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
# 初始化值权重矩阵，大小为输入维度d_in到输出维度d_out，且不需要梯度更新
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

print(W_query)
print(W_key)
print(W_value)

Parameter containing:
tensor([[0.2961, 0.5166],
        [0.2517, 0.6886],
        [0.0740, 0.8665]])
Parameter containing:
tensor([[0.1366, 0.1025],
        [0.1841, 0.7264],
        [0.3153, 0.6871]])
Parameter containing:
tensor([[0.0756, 0.1966],
        [0.3164, 0.4017],
        [0.1186, 0.8274]])


- Next we compute the query, key, and value vectors:
 接下来，我们计算查询、键和值向量：


In [18]:
# 计算第二个输入元素对应的查询向量
query_2 = x_2 @ W_query
# 计算第二个输入元素对应的键向量
key_2 = x_2 @ W_key 
# 计算第二个输入元素对应的值向量
value_2 = x_2 @ W_value

print(query_2)

tensor([0.4306, 1.4551])


- As we can see below, we successfully projected the 6 input tokens from a 3D onto a 2D embedding space:
- 如下所示，我们成功地将6个输入令牌从3D投影到2D嵌入空间中：

In [19]:
# 计算键向量
keys = inputs @ W_key 
# 计算值向量
values = inputs @ W_value

# 打印键向量和值向量的形状
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


- In the next step, **step 2**, we compute the unnormalized attention scores by computing the dot product between the query and each key vector:
- 在下一步骤中，即**步骤2**，我们通过计算查询向量与每个键向量的点积来计算未归一化的注意力分数：


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/15.webp" width="600px">

In [20]:
# 获取索引为1的键向量
keys_2 = keys[1]
# 计算查询向量与键向量的点积，得到未归一化的注意力分数
attn_score_22 = query_2.dot(keys_2)
# 打印未归一化的注意力分数
print(attn_score_22)

tensor(1.8524)


- Since we have 6 inputs, we have 6 attention scores for the given query vector:
- 由于我们有6个输入，因此我们有6个给定查询向量的注意力分数：

In [21]:
# 计算给定查询向量的所有注意力分数
attn_scores_2 = query_2 @ keys.T
print(attn_scores_2)

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/16.webp" width="600px">

- Next, in **step 3**, we compute the attention weights (normalized attention scores that sum up to 1) using the softmax function we used earlier
- The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension, $\sqrt{d_k}$ (i.e., `d_k**0.5`):
 - 接下来，在**步骤3**中，我们使用之前使用的softmax函数计算注意力权重（归一化后的注意力分数，它们的和为1）
 - 与之前不同的是，我们现在通过除以嵌入维度的平方根，$\sqrt{d_k}$（即`d_k**0.5`），来缩放注意力分数：

In [22]:
# 获取键向量的维度
d_k = keys.shape[1]
# 计算注意力权重，首先将未归一化的注意力分数除以键向量维度的平方根，然后应用softmax函数
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
# 打印计算得到的注意力权重
print(attn_weights_2)

tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/17.webp" width="600px">

- In **step 4**, we now compute the context vector for input query vector 2:
- 在**步骤4**中，我们现在计算输入查询向量2的上下文向量：


In [23]:
# 使用注意力权重和值向量计算上下文向量
context_vec_2 = attn_weights_2 @ values
# 打印
print(context_vec_2)

tensor([0.3061, 0.8210])


### 3.4.2 Implementing a compact SelfAttention class
### 3.4.2 实现一个紧凑的SelfAttention类


- Putting it all together, we can implement the self-attention mechanism as follows:
- 将所有内容结合起来，我们可以如下实现自注意力机制：

In [24]:
import torch.nn as nn

# 定义SelfAttention_v1类，继承自nn.Module
class SelfAttention_v1(nn.Module):

    def __init__(self, d_in, d_out):
        # 调用父类的初始化方法
        super().__init__()
        # 初始化查询矩阵W_query
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        # 初始化键矩阵W_key
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        # 初始化值矩阵W_value
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        # 计算键向量
        keys = x @ self.W_key
        # 计算查询向量
        queries = x @ self.W_query
        # 计算值向量
        values = x @ self.W_value
        
        # 计算注意力分数
        attn_scores = queries @ keys.T # omega
        # 计算注意力权重
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )

        # 计算上下文向量
        context_vec = attn_weights @ values
        # 返回上下文向量
        return context_vec

# 设置随机种子
torch.manual_seed(123)
# 实例化SelfAttention_v1类
sa_v1 = SelfAttention_v1(d_in, d_out)
# 打印输入数据经过SelfAttention_v1处理后的结果
print(sa_v1(inputs))

tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/18.webp" width="400px">

- We can streamline the implementation above using PyTorch's Linear layers, which are equivalent to a matrix multiplication if we disable the bias units
- Another big advantage of using `nn.Linear` over our manual `nn.Parameter(torch.rand(...)` approach is that `nn.Linear` has a preferred weight initialization scheme, which leads to more stable model training
 - 我们可以使用PyTorch的线性层来简化上面的实现，这些层在禁用偏置单元时等同于矩阵乘法
 - 使用`nn.Linear`而不是我们手动的`nn.Parameter(torch.rand(...)`方法的另一个大优势是，`nn.Linear`有一个首选的权重初始化方案，这会导致模型训练更加稳定

In [25]:
class SelfAttention_v2(nn.Module):
    # 初始化SelfAttention_v2类
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        # 使用nn.Linear来初始化查询矩阵W_query、键矩阵W_key和值矩阵W_value
        # 这样可以简化矩阵乘法的实现，并且使用PyTorch的线性层可以带来更稳定的模型训练
        # 初始化查询矩阵W_query
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        # 初始化键矩阵W_key
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        # 初始化值矩阵W_value
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        # 参数bias=qkv_bias的意思是，如果qkv_bias为True，那么nn.Linear层将包含偏置单元，否则不包含

    # 前向传播方法
    def forward(self, x):
        # 计算键向量
        keys = self.W_key(x)
        # 计算查询向量
        queries = self.W_query(x)
        # 计算值向量
        values = self.W_value(x)
        
        # 计算注意力分数
        attn_scores = queries @ keys.T
        # 计算注意力权重
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        # 计算上下文向量
        context_vec = attn_weights @ values
        # 返回上下文向量
        return context_vec

# 设置随机种子
torch.manual_seed(789)
# 实例化SelfAttention_v2类
sa_v2 = SelfAttention_v2(d_in, d_out)
# 打印输入数据经过SelfAttention_v2处理后的结果
print(sa_v2(inputs))

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


- Note that `SelfAttention_v1` and `SelfAttention_v2` give different outputs because they use different initial weights for the weight matrices
- 注意，`SelfAttention_v1`和`SelfAttention_v2`的输出不同，因为它们为权重矩阵使用了不同的初始权重

## 3.5 Hiding future words with causal attention
## 3.5 使用因果关注隐藏未来单词

- In causal attention, the attention weights above the diagonal are masked, ensuring that for any given input, the LLM is unable to utilize future tokens while calculating the context vectors with the attention weight
- 在因果关注中，对角线上方的注意力权重被屏蔽，确保对于任何给定的输入，LLM在计算具有注意力权重的上下文向量时无法利用未来的标记

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/19.webp" width="400px">

### 3.5.1 Applying a causal attention mask
### 3.5.1 应用因果关注掩码


- In this section, we are converting the previous self-attention mechanism into a causal self-attention mechanism
- Causal self-attention ensures that the model's prediction for a certain position in a sequence is only dependent on the known outputs at previous positions, not on future positions
- In simpler words, this ensures that each next word prediction should only depend on the preceding words
- To achieve this, for each given token, we mask out the future tokens (the ones that come after the current token in the input text):
 - 在本节中，我们将之前的自注意力机制转换为因果自注意力机制
 - 因果自注意力确保模型对序列中某个位置的预测仅依赖于之前位置的已知输出，而不是未来的位置
 - 用更简单的话说，这确保每个下一个词的预测都只依赖于前面的词
 - 为了实现这一点，对于每个给定的标记，我们屏蔽掉未来的标记（在输入文本中当前标记之后出现的标记）：

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/20.webp" width="600px">

- To illustrate and implement causal self-attention, let's work with the attention scores and weights from the previous section: 
 - 为了说明和实现因果自注意力，让我们使用前一节的注意力分数和权重： 

In [26]:
# Reuse the query and key weight matrices of the
# SelfAttention_v2 object from the previous section for convenience
# 使用sa_v2对象的W_query方法对输入进行查询
queries = sa_v2.W_query(inputs)
# 使用sa_v2对象的W_key方法对输入进行键处理
keys = sa_v2.W_key(inputs) 
# 计算注意力分数，queries与keys的转置矩阵相乘
attn_scores = queries @ keys.T

# 计算注意力权重，使用softmax函数对注意力分数进行归一化
# 除以keys的最后一个维度的平方根是为了稳定训练
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
# 打印注意力权重
print(attn_weights)

tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)


- The simplest way to mask out future attention weights is by creating a mask via PyTorch's tril function with elements below the main diagonal (including the diagonal itself) set to 1 and above the main diagonal set to 0:
- 掩蔽未来注意力权重的最简单方法是通过PyTorch的tril函数创建一个掩码，主对角线以下（包括对角线本身）的元素设置为1，主对角线以上的元素设置为0：

In [27]:
# 获取注意力分数的第一个维度的长度，作为上下文长度
context_length = attn_scores.shape[0]
# 使用torch.tril函数创建一个下三角矩阵，所有元素为1，作为简单的掩码
# torch.ones创建一个全为1的张量
mask_simple = torch.tril(torch.ones(context_length, context_length))
# 打印简单的掩码矩阵
print(mask_simple)

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])


- Then, we can multiply the attention weights with this mask to zero out the attention scores above the diagonal:
- 然后，我们可以将注意力权重与此掩码相乘，以将对角线以上的注意力分数归零：

In [28]:
# 使用掩码矩阵将注意力权重与掩码相乘，以将对角线以上的注意力分数归零
masked_simple = attn_weights * mask_simple
print(masked_simple)

tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<MulBackward0>)


- However, if the mask were applied after softmax, like above, it would disrupt the probability distribution created by softmax
- Softmax ensures that all output values sum to 1
- Masking after softmax would require re-normalizing the outputs to sum to 1 again, which complicates the process and might lead to unintended effects
 - 然而，如果像上面那样在softmax之后应用掩码，它将破坏softmax创建的概率分布
 - Softmax确保所有输出值之和为1
 - 在softmax之后进行掩码处理将需要重新归一化输出，使之之和为1，这会复杂化过程，并可能导致意外的效果

- To make sure that the rows sum to 1, we can normalize the attention weights as follows:
- 为了确保行之和为1，我们可以按照以下方式归一化注意力权重：


In [29]:
# 计算每一行的和，保持维度不变
row_sums = masked_simple.sum(dim=-1, keepdim=True)
# 除以每一行的和，进行归一化
masked_simple_norm = masked_simple / row_sums
# 打印归一化后的结果
print(masked_simple_norm)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<DivBackward0>)


 - While we are technically done with coding the causal attention mechanism now, let's briefly look at a more efficient approach to achieve the same as above
 - So, instead of zeroing out attention weights above the diagonal and renormalizing the results, we can mask the unnormalized attention scores above the diagonal with negative infinity before they enter the softmax function:
 - 虽然我们现在已经完成了因果注意力机制的编码，但让我们简要看一下实现相同目标的更高效方法
 - 因此，我们可以在注意力分数进入softmax函数之前，用负无穷大掩盖对角线以上的未归一化注意力分数，而不是将对角线以上的注意力权重置零并重新归一化结果

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/21.webp" width="450px">

In [30]:
# torch.ones创建一个全1矩阵
# torch.triu将矩阵转换为上三角矩阵,diagonal=1表示对角线上移1位
# 最终创建一个上三角掩码矩阵,对角线以上为1,其余为0
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
# 将掩码位置的注意力分数填充为负无穷
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
# 打印掩码后的注意力分数
print(masked)

tensor([[0.2899,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.4656, 0.1723,   -inf,   -inf,   -inf,   -inf],
        [0.4594, 0.1703, 0.1731,   -inf,   -inf,   -inf],
        [0.2642, 0.1024, 0.1036, 0.0186,   -inf,   -inf],
        [0.2183, 0.0874, 0.0882, 0.0177, 0.0786,   -inf],
        [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
       grad_fn=<MaskedFillBackward0>)


- As we can see below, now the attention weights in each row correctly sum to 1 again:
- 如下所示,现在每一行的注意力权重之和又正确地等于1:

In [31]:
# 使用softmax函数计算注意力权重
# masked除以keys维度的平方根进行缩放,防止梯度消失
# dim=-1表示在最后一个维度上进行softmax运算
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1)
print(attn_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)


### 3.5.2 Masking additional attention weights with dropout
### 3.5.2 使用dropout对注意力权重进行额外掩码

 - In addition, we also apply dropout to reduce overfitting during training
 - 此外,我们还应用dropout来减少训练期间的过拟合
 - Dropout can be applied in several places:
 - Dropout可以应用在多个地方:
   - for example, after computing the attention weights;
   - 例如,在计算注意力权重之后;
   - or after multiplying the attention weights with the value vectors
   - 或者在注意力权重与值向量相乘之后
 - Here, we will apply the dropout mask after computing the attention weights because it's more common
 - 在这里,我们将在计算注意力权重后应用dropout掩码,因为这种做法更常见
 
 - Furthermore, in this specific example, we use a dropout rate of 50%, which means randomly masking out half of the attention weights. (When we train the GPT model later, we will use a lower dropout rate, such as 0.1 or 0.2
 - 此外,在这个具体示例中,我们使用50%的dropout率,这意味着随机掩盖一半的注意力权重。(当我们稍后训练GPT模型时,我们将使用较低的dropout率,如0.1或0.2)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/22.webp" width="400px">

- If we apply a dropout rate of 0.5 (50%), the non-dropped values will be scaled accordingly by a factor of 1/0.5 = 2.
- 如果我们应用0.5(50%)的dropout率,未被丢弃的值将相应地按1/0.5=2的因子进行缩放。

In [32]:
# 设置随机种子以确保结果可重现
torch.manual_seed(123)

# 创建一个dropout层,dropout率为50%,意味着会随机丢弃一半的神经元
dropout = torch.nn.Dropout(0.5)

# 创建一个6x6的全1矩阵作为示例输入
example = torch.ones(6, 6)

# 打印应用dropout后的结果
# 未被丢弃的值会按1/0.5=2的因子进行缩放
print(dropout(example))

tensor([[2., 2., 2., 2., 2., 2.],
        [0., 2., 0., 0., 0., 0.],
        [0., 0., 2., 0., 2., 0.],
        [2., 2., 0., 0., 0., 2.],
        [2., 0., 0., 0., 0., 2.],
        [0., 2., 0., 0., 0., 0.]])


In [33]:
# 设置随机种子以确保结果可重现
torch.manual_seed(123)
# 对注意力权重应用dropout
print(dropout(attn_weights))

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.8966, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4921, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4350, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.0000, 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)


- Note that the resulting dropout outputs may look different depending on your operating system; you can read more about this inconsistency [here on the PyTorch issue tracker](https://github.com/pytorch/pytorch/issues/121595)
- 请注意,根据您的操作系统不同,dropout的输出结果可能会有所不同;您可以在[PyTorch问题追踪器](https://github.com/pytorch/pytorch/issues/121595)上了解更多关于这种不一致性的信息

### 3.5.3 Implementing a compact causal self-attention class
### 3.5.3 实现一个紧凑的因果自注意力类

- Now, we are ready to implement a working implementation of self-attention, including the causal and dropout masks
- One more thing is to implement the code to handle batches consisting of more than one input so that our `CausalAttention` class supports the batch outputs produced by the data loader we implemented in chapter 2
- For simplicity, to simulate such batch input, we duplicate the input text example:
- 现在,我们准备实现一个包含因果掩码和dropout掩码的自注意力工作实现
- 还有一件事是实现代码来处理由多个输入组成的批次,以便我们的`CausalAttention`类支持我们在第2章中实现的数据加载器产生的批次输出
- 为简单起见,为了模拟这样的批次输入,我们复制输入文本示例:

In [40]:
print(inputs)
# 使用torch.stack将两个inputs张量在第0维(dim=0)堆叠
# 这会创建一个新的维度作为批次维度,将两个2D张量组合成一个3D张量
# 相当于将两个形状为[6,3]的张量堆叠成一个形状为[2,6,3]的张量
batch = torch.stack((inputs, inputs), dim=0)
# 打印批次的形状: [2, 6, 3] 表示2个输入样本,每个样本有6个token,每个token的嵌入维度为3
print(batch.shape)

tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])
torch.Size([2, 6, 3])


In [35]:
# 定义因果注意力类,继承自nn.Module
class CausalAttention(nn.Module):

    # 初始化函数,接收输入维度、输出维度、上下文长度、dropout率和是否使用偏置
    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        # 保存输出维度
        self.d_out = d_out
        # 创建查询、键、值的线性变换层
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        # 创建dropout层
        self.dropout = nn.Dropout(dropout)
        # 注册因果掩码缓冲区
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

    # 前向传播函数
    def forward(self, x):
        # 获取输入张量的形状:批次大小、token数量、输入维度
        b, num_tokens, d_in = x.shape
        # 计算键、查询、值向量
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        # 计算注意力分数
        # keys.transpose(1,2)将keys张量的维度1和2进行交换
        # 假设keys的形状是[batch_size, num_tokens, d_out]
        # 交换后变成[batch_size, d_out, num_tokens]
        # 这样做是为了让矩阵乘法的维度对齐:
        # queries:[batch_size, num_tokens, d_out] @ keys.T:[batch_size, d_out, num_tokens]
        # = attn_scores:[batch_size, num_tokens, num_tokens]
        attn_scores = queries @ keys.transpose(1, 2)
        # 应用因果掩码
        # self.mask.bool()将掩码张量转换为布尔类型
        # [:num_tokens, :num_tokens]从掩码中截取一个子矩阵,大小为[num_tokens, num_tokens]
        # 这个掩码确保每个token只能关注它之前的token(因果性)
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        # 计算注意力权重
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        # 应用dropout
        attn_weights = self.dropout(attn_weights)

        # 计算上下文向量
        context_vec = attn_weights @ values
        return context_vec

# 设置随机种子
torch.manual_seed(123)

# 获取上下文长度
context_length = batch.shape[1]
# 创建因果注意力实例
ca = CausalAttention(d_in, d_out, context_length, 0.0)

# 计算上下文向量
context_vecs = ca(batch)

# 打印结果
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]],

        [[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]]], grad_fn=<UnsafeViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


- Note that dropout is only applied during training, not during inference、
注意:dropout 仅在训练期间应用,在推理期间不应用

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/23.webp" width="500px">

## 3.6 Extending single-head attention to multi-head attention
## 3.6 将单头注意力扩展为多头注意力

### 3.6.1 Stacking multiple single-head attention layers
### 3.6.1 堆叠多个单头注意力层

- Below is a summary of the self-attention implemented previously (causal and dropout masks not shown for simplicity)
- 以下是之前实现的自注意力机制的总结(为简单起见,未显示因果和dropout掩码)

- This is also called single-head attention:
- 这也被称为单头注意力:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/24.webp" width="400px">

- We simply stack multiple single-head attention modules to obtain a multi-head attention module:
- 我们简单地堆叠多个单头注意力模块来获得多头注意力模块:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/25.webp" width="400px">

- The main idea behind multi-head attention is to run the attention mechanism multiple times (in parallel) with different, learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions.
- 多头注意力背后的主要思想是使用不同的学习到的线性投影,多次(并行)运行注意力机制。这使得模型能够在不同位置共同关注来自不同表示子空间的信息。

In [36]:
# 多头注意力包装器类,继承自nn.Module
class MultiHeadAttentionWrapper(nn.Module):

    # 初始化函数,接收输入维度、输出维度、上下文长度、dropout率、注意力头数和qkv偏置参数
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        # 创建多个因果注意力层的ModuleList,数量为num_heads
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)]
        )

    # 前向传播函数
    def forward(self, x):
        # 将所有注意力头的输出在最后一个维度上拼接
        return torch.cat([head(x) for head in self.heads], dim=-1)


# 设置随机种子以保证结果可复现
torch.manual_seed(123)

# 获取输入序列的长度(token数量)
context_length = batch.shape[1] # This is the number of tokens
# 设置输入维度为3,输出维度为2
d_in, d_out = 3, 2
# 创建多头注意力包装器实例,设置2个注意力头
mha = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, 0.0, num_heads=2
)

# 对输入batch进行前向传播计算
context_vecs = mha(batch)

# 打印输出结果和形状
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]],

        [[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 4])


- In the implementation above, the embedding dimension is 4, because we `d_out=2` as the embedding dimension for the key, query, and value vectors as well as the context vector. And since we have 2 attention heads, we have the output embedding dimension 2*2=4
- 在上面的实现中,嵌入维度是4,因为我们将`d_out=2`作为key、query和value向量以及上下文向量的嵌入维度。由于我们有2个注意力头,所以输出嵌入维度为2*2=4

### 3.6.2 Implementing multi-head attention with weight splits
### 3.6.2 使用权重分割实现多头注意力

- While the above is an intuitive and fully functional implementation of multi-head attention (wrapping the single-head attention `CausalAttention` implementation from earlier), we can write a stand-alone class called `MultiHeadAttention` to achieve the same
- 虽然上面是一个直观且功能完整的多头注意力实现(包装了之前的单头注意力`CausalAttention`实现),我们可以编写一个独立的`MultiHeadAttention`类来实现相同的功能

- We don't concatenate single attention heads for this stand-alone `MultiHeadAttention` class
- 对于这个独立的`MultiHeadAttention`类,我们不拼接单个注意力头

- Instead, we create single W_query, W_key, and W_value weight matrices and then split those into individual matrices for each attention head:
- 相反,我们创建单个W_query、W_key和W_value权重矩阵,然后将它们分割成每个注意力头的单独矩阵:

In [37]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        # 继承父类初始化
        super().__init__()
        # 确保输出维度可以被注意力头数整除
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        # 保存模型参数
        self.d_out = d_out  # 输出维度
        self.num_heads = num_heads  # 注意力头数量
        self.head_dim = d_out // num_heads  # 每个注意力头的维度

        # 创建query、key、value的线性变换层
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # 输出投影层,用于组合多头输出
        self.dropout = nn.Dropout(dropout)  # dropout层
        # 注册因果掩码
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        # 获取输入张量的形状
        b, num_tokens, d_in = x.shape

        # 对输入进行线性变换得到key、query、value
        keys = self.W_key(x)  # 形状: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # 重塑张量,添加num_heads维度
        # 将最后一维展开: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # 调整维度顺序: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # 计算缩放点积注意力(即自注意力)和因果掩码
        attn_scores = queries @ keys.transpose(2, 3)  # 每个头的点积

        # 将原始掩码截断到token数量并转换为布尔值
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # 使用掩码填充注意力分数
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        # 计算注意力权重
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # 计算上下文向量 形状: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)
        
        # 组合所有注意力头的输出,其中self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec)  # 可选的投影层

        return context_vec

# 设置随机种子以保证结果可复现
torch.manual_seed(123)

# 获取输入batch的形状参数
batch_size, context_length, d_in = batch.shape
d_out = 2
# 创建多头注意力实例
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

# 前向传播计算
context_vecs = mha(batch)

# 打印输出结果和形状
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]],

        [[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])


- Note that the above is essentially a rewritten version of `MultiHeadAttentionWrapper` that is more efficient
- 注意上面本质上是 `MultiHeadAttentionWrapper` 的一个更高效的重写版本

- The resulting output looks a bit different since the random weight initializations differ, but both are fully functional implementations that can be used in the GPT class we will implement in the upcoming chapters
- 由于随机权重初始化不同，输出结果看起来有点不同，但两者都是功能完整的实现，可以用于我们将在后续章节中实现的 GPT 类

- Note that in addition, we added a linear projection layer (`self.out_proj `) to the `MultiHeadAttention` class above. This is simply a linear transformation that doesn't change the dimensions. It's a standard convention to use such a projection layer in LLM implementation, but it's not strictly necessary (recent research has shown that it can be removed without affecting the modeling performance; see the further reading section at the end of this chapter)
- 另外请注意，我们在上面的 `MultiHeadAttention` 类中添加了一个线性投影层（`self.out_proj`）。这只是一个不改变维度的线性变换。在 LLM 实现中使用这样的投影层是一个标准惯例，但这并不是严格必需的（最近的研究表明，移除它不会影响建模性能；请参见本章末尾的进一步阅读部分）


<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/26.webp" width="400px">

- Note that if you are interested in a compact and efficient implementation of the above, you can also consider the [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) class in PyTorch
- 请注意，如果你对上述内容的紧凑高效实现感兴趣，也可以考虑使用 PyTorch 中的 [`torch.nn.MultiheadAttention`](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) 类

- Since the above implementation may look a bit complex at first glance, let's look at what happens when executing `attn_scores = queries @ keys.transpose(2, 3)`:
- 由于上述实现乍看起来可能有点复杂，让我们来看看执行 `attn_scores = queries @ keys.transpose(2, 3)` 时发生了什么：

In [38]:
# 创建一个4维张量a，形状为(batch_size=1, num_heads=2, num_tokens=3, head_dim=4)
# 每个头包含3个token，每个token由4维向量表示
a = torch.tensor([[[[0.2745, 0.6584, 0.2775, 0.8573],  # 第1个头的第1个token
                    [0.8993, 0.0390, 0.9268, 0.7388],   # 第1个头的第2个token 
                    [0.7179, 0.7058, 0.9156, 0.4340]],  # 第1个头的第3个token

                   [[0.0772, 0.3565, 0.1479, 0.5331],   # 第2个头的第1个token
                    [0.4066, 0.2318, 0.4545, 0.9737],   # 第2个头的第2个token
                    [0.4606, 0.5159, 0.4220, 0.5786]]]])# 第2个头的第3个token

# 计算注意力分数：将a与其转置相乘，得到每个token与其他token的相似度
print(a @ a.transpose(2, 3))

tensor([[[[1.3208, 1.1631, 1.2879],
          [1.1631, 2.2150, 1.8424],
          [1.2879, 1.8424, 2.0402]],

         [[0.4391, 0.7003, 0.5903],
          [0.7003, 1.3737, 1.0620],
          [0.5903, 1.0620, 0.9912]]]])


- In this case, the matrix multiplication implementation in PyTorch will handle the 4-dimensional input tensor so that the matrix multiplication is carried out between the 2 last dimensions (num_tokens, head_dim) and then repeated for the individual heads 
- 在这种情况下，PyTorch 中的矩阵乘法实现会处理 4 维输入张量，使得矩阵乘法在最后 2 个维度（num_tokens, head_dim）之间进行，然后对各个头部重复这个操作

- For instance, the following becomes a more compact way to compute the matrix multiplication for each head separately:
- 例如，以下是一种更紧凑的方式来分别计算每个头部的矩阵乘法：

In [39]:
# 获取第一个头部的所有token向量 (a[0, 0, :, :] 表示:
# - 第一个维度(0): 选择第一个batch
# - 第二个维度(0): 选择第一个注意力头
# - 第三个维度(:): 选择所有token
# - 第四个维度(:): 选择每个token的所有特征维度)
first_head = a[0, 0, :, :]
# 计算第一个头部内token之间的注意力分数
first_res = first_head @ first_head.T
print("First head:\n", first_res)

# 获取第二个头部的所有token向量 
second_head = a[0, 1, :, :]
# 计算第二个头部内token之间的注意力分数
second_res = second_head @ second_head.T
print("\nSecond head:\n", second_res)

First head:
 tensor([[1.3208, 1.1631, 1.2879],
        [1.1631, 2.2150, 1.8424],
        [1.2879, 1.8424, 2.0402]])

Second head:
 tensor([[0.4391, 0.7003, 0.5903],
        [0.7003, 1.3737, 1.0620],
        [0.5903, 1.0620, 0.9912]])


# Summary and takeaways
# 总结和要点

- See the [./multihead-attention.ipynb](./multihead-attention.ipynb) code notebook, which is a concise version of the data loader (chapter 2) plus the multi-head attention class that we implemented in this chapter and will need for training the GPT model in upcoming chapters
- 请查看 [./multihead-attention.ipynb](./multihead-attention.ipynb) 代码笔记本，它是数据加载器(第2章)的简洁版本，并包含我们在本章实现的多头注意力类，这些将用于后续章节中训练 GPT 模型

- You can find the exercise solutions in [./exercise-solutions.ipynb](./exercise-solutions.ipynb)
- 你可以在 [./exercise-solutions.ipynb](./exercise-solutions.ipynb) 中找到练习题的解答