<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


 # Understanding the Difference Between Embedding Layers and Linear Layers
 # 理解嵌入层和线性层之间的区别

 - Embedding layers in PyTorch accomplish the same as linear layers that perform matrix multiplications; the reason we use embedding layers is computational efficiency
 - PyTorch中的嵌入层实现了与执行矩阵乘法的线性层相同的功能;我们使用嵌入层的原因是计算效率更高
 - We will take a look at this relationship step by step using code examples in PyTorch
 - 我们将使用PyTorch中的代码示例逐步了解这种关系

In [1]:
# 导入PyTorch库
import torch

# 打印PyTorch版本号
print("PyTorch version:", torch.__version__)

PyTorch version: 2.5.0+cpu


<br>
&nbsp;

## Using nn.Embedding

In [4]:
# 假设我们有以下3个训练样本
# 它们可能代表LLM上下文中的token ID
idx = torch.tensor([2, 3, 1])

# 嵌入矩阵的行数可以通过获取最大token ID + 1来确定
# 如果最高的token ID是3,那么我们需要4行
# 用于可能的token ID 0, 1, 2, 3
num_idx = max(idx)+1

# 所需的嵌入维度是一个超参数 - 这是一个需要手动设置的参数,不是通过训练学习得到的
# 超参数通常需要通过实验来确定最佳值,比如这里的嵌入维度可以是32、64、128等不同的值
out_dim = 5

 - Let's implement a simple embedding layer:
 - 让我们实现一个简单的嵌入层:

In [5]:
# We use the random seed for reproducibility since
# weights in the embedding layer are initialized with
# small random values
# 设置随机种子以确保可重复性,因为嵌入层中的权重是用小的随机值初始化的
torch.manual_seed(123)

# 创建一个嵌入层实例
# num_idx: 词汇表大小(嵌入矩阵的行数)
# out_dim: 嵌入向量的维度(嵌入矩阵的列数) 
embedding = torch.nn.Embedding(num_idx, out_dim)

 We can optionally take a look at the embedding weights:
 我们可以选择查看嵌入层的权重:

In [6]:
# 查看嵌入层的权重参数
# 这是一个形状为[num_idx, out_dim]的矩阵
# 每一行代表一个token ID对应的嵌入向量
embedding.weight

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035, -0.5880,  1.5810],
        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015],
        [ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953]], requires_grad=True)

 - We can then use the embedding layers to obtain the vector representation of a training example with ID 1:
 - 然后我们可以使用嵌入层来获取ID为1的训练样本的向量表示:

In [7]:
# 使用嵌入层将ID为1的token转换为嵌入向量
# 输入是一个包含单个元素1的张量
# 输出是一个形状为[1, out_dim]的张量,代表ID为1的token的嵌入向量
embedding(torch.tensor([1]))

tensor([[ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],
       grad_fn=<EmbeddingBackward0>)

 - Below is a visualization of what happens under the hood:
 - 下面是对内部工作原理的可视化展示:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/embeddings-and-linear-layers/1.png" width="400px">

 - Similarly, we can use embedding layers to obtain the vector representation of a training example with ID 2:
 - 类似地,我们可以使用嵌入层来获取ID为2的训练样本的向量表示:

In [8]:
# 使用嵌入层将ID为2的token转换为嵌入向量
# 输入是一个包含单个元素2的张量
# 输出是一个形状为[1, out_dim]的张量,代表ID为2的token的嵌入向量
embedding(torch.tensor([2]))

tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315]],
       grad_fn=<EmbeddingBackward0>)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/embeddings-and-linear-layers/2.png" width="400px">

 - Now, let's convert all the training examples we have defined previously:
 - 现在,让我们转换之前定义的所有训练样本:

In [7]:
# 创建一个包含token ID的张量[2, 3, 1]
idx = torch.tensor([2, 3, 1])
# 使用嵌入层将token ID转换为嵌入向量
# 输入形状为[3],输出形状为[3, out_dim]
# 每一行对应一个token的嵌入向量
embedding(idx)

tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],
        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],
       grad_fn=<EmbeddingBackward0>)

- Under the hood, it's still the same look-up concept:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/embeddings-and-linear-layers/3.png" width="450px">

<br>
&nbsp;

## Using nn.Linear

 - Now, we will demonstrate that the embedding layer above accomplishes exactly the same as `nn.Linear` layer on a one-hot encoded representation in PyTorch
 - First, let's convert the token IDs into a one-hot representation:
 - 现在,我们将演示上面的嵌入层与PyTorch中对one-hot编码表示使用`nn.Linear`层的效果完全相同
 - 首先,让我们将token ID转换为one-hot表示:

In [9]:
# 使用one_hot函数将token ID转换为one-hot编码
# idx形状为[3],输出形状为[3, num_idx]
# one-hot编码是一种向量表示方法,对于n个类别,使用n维向量表示,
# 当前类别对应位置为1,其他位置为0
# 例如对于4个类别,类别2的one-hot编码为[0,1,0,0]
onehot = torch.nn.functional.one_hot(idx)
onehot

tensor([[0, 0, 1, 0],
        [0, 0, 0, 1],
        [0, 1, 0, 0]])

 - Next, we initialize a `Linear` layer, which carries out a matrix multiplication $X W^\top$:
 - 接下来,我们初始化一个`Linear`层,它执行矩阵乘法运算 $X W^\top$:

In [10]:
# 设置随机种子以确保结果可复现
torch.manual_seed(123)
# 创建一个线性层,输入维度为num_idx,输出维度为out_dim
# bias=False表示不使用偏置项
linear = torch.nn.Linear(num_idx, out_dim, bias=False)
# 查看线性层的权重矩阵
linear.weight

Parameter containing:
tensor([[-0.2039,  0.0166, -0.2483,  0.1886],
        [-0.4260,  0.3665, -0.3634, -0.3975],
        [-0.3159,  0.2264, -0.1847,  0.1871],
        [-0.4244, -0.3034, -0.1836, -0.0983],
        [-0.3814,  0.3274, -0.1179,  0.1605]], requires_grad=True)

 - Note that the linear layer in PyTorch is also initialized with small random weights; to directly compare it to the `Embedding` layer above, we have to use the same small random weights, which is why we reassign them here:
 - 注意PyTorch中的线性层也是用小的随机权重初始化的;为了直接与上面的`Embedding`层进行比较,我们必须使用相同的小随机权重,这就是为什么我们在这里重新分配它们:

In [11]:
# 将线性层的权重设置为嵌入层权重的转置
# 这样可以确保两个层使用相同的权重进行计算
# 使用nn.Parameter包装以保持梯度计算
linear.weight = torch.nn.Parameter(embedding.weight.T)

 - Now we can use the linear layer on the one-hot encoded representation of the inputs:
 - 现在我们可以在输入的one-hot编码表示上使用线性层:

In [12]:
# 将one-hot编码转换为浮点数类型并通过线性层
# 线性层执行矩阵乘法运算 X * W^T
# 其中X是one-hot编码矩阵[3,4],W是权重矩阵[5,4]
# 输出形状为[3,5],与embedding层的输出相同
linear(onehot.float())

tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],
        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]], grad_fn=<MmBackward0>)

 As we can see, this is exactly the same as what we got when we used the embedding layer:
 正如我们所看到的,这与我们使用嵌入层时得到的结果完全相同:

In [13]:
embedding(idx)

tensor([[ 0.6957, -1.8061, -1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096, -0.4076,  0.7953],
        [ 1.3010,  1.2753, -0.2010, -0.1606, -0.4015]],
       grad_fn=<EmbeddingBackward0>)

 - What happens under the hood is the following computation for the first training example's token ID:
 - 对于第一个训练样本的token ID,底层发生的计算如下:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/embeddings-and-linear-layers/4.png" width="450px">

 - And for the second training example's token ID:
 - 对于第二个训练样本的token ID:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/embeddings-and-linear-layers/5.png" width="450px">

 - Since all but one index in each one-hot encoded row are 0 (by design), this matrix multiplication is essentially the same as a look-up of the one-hot elements
 - 由于每个one-hot编码行中除了一个索引外都是0(这是设计使然),这个矩阵乘法本质上等同于one-hot元素的查找
 - This use of the matrix multiplication on one-hot encodings is equivalent to the embedding layer look-up but can be inefficient if we work with large embedding matrices, because there are a lot of wasteful multiplications by zero
 - 在one-hot编码上使用矩阵乘法等同于嵌入层查找,但如果使用大型嵌入矩阵会效率低下,因为存在大量与零相乘的无用计算