<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Converting Llama 2 to Llama 3.2 From Scratch
# 从零开始将 Llama 2 转换为 Llama 3.2

- This is a follow-up notebook to [Converting a From-Scratch GPT Architecture to Llama 2](./converting-gpt-to-llama2.ipynb), converting Meta AI's Llama 2 architecture model step by step to Llama 3, Llama 3.1, and Llama 3.2
- 这是[从零开始将GPT架构转换为Llama 2](./converting-gpt-to-llama2.ipynb)的后续笔记本，逐步将Meta AI的Llama 2架构模型转换为Llama 3、Llama 3.1和Llama 3.2

- The explanations are purposefully kept minimal in this notebook so as not to bloat it unnecessarily and focus on the main code
- 本笔记本中的解释特意保持简洁，以避免不必要的冗长，专注于主要代码

- For more information about the architectures, please see the Llama 2 and Llama 3 papers
- 有关架构的更多信息，请参阅Llama 2和Llama 3论文
  - [Llama 2: Open Foundation and Fine-Tuned Chat Models (2023)](https://arxiv.org/abs/2307.09288)
  - [The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/gpt2-to-llama2-llama3.webp?1">

In [1]:
# pip install -r requirements-extra.txt

- Packages that are being used in this notebook:

In [2]:
from importlib.metadata import version

pkgs = [
    "blobfile",         # to download pretrained weights
    "huggingface_hub",  # to download pretrained weights
    "tiktoken",         # to implement the tokenizer
    "torch",            # to implement the model
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

blobfile version: 3.0.0
huggingface_hub version: 0.24.7
tiktoken version: 0.8.0
torch version: 2.4.1+cu121


&nbsp;
# 1. Convert the Llama model implementation step by step
# 1. 逐步转换 Llama 模型实现

- If you are new to implementing LLM architectures, I recommend starting with [chapter 4](../../ch04/01_main-chapter-code/ch04.ipynb), which walks you through the implementation of the original GPT architecture step by step
- 如果你是第一次实现LLM架构，我建议从[第4章](../../ch04/01_main-chapter-code/ch04.ipynb)开始，它会一步步指导你实现原始GPT架构

- The [Converting a From-Scratch GPT Architecture to Llama 2](./converting-gpt-to-llama2.ipynb) then implements the Llama-specific components, such as RMSNorm layers, SiLU and SwiGLU activations, RoPE (rotary position embeddings), and the SentencePiece tokenizer
- [从零开始将GPT架构转换为Llama 2](./converting-gpt-to-llama2.ipynb)然后实现了Llama特有的组件，如RMSNorm层、SiLU和SwiGLU激活函数、RoPE(旋转位置编码)和SentencePiece分词器

- This notebook takes the Llama 2 architecture and transforms it into Llama 3 architecture by
- 本笔记本通过以下方式将Llama 2架构转换为Llama 3架构：
    1. modifying the rotary embeddings
    1. 修改旋转编码
    2. implementing grouped-query attention  
    2. 实现分组查询注意力
    3. and using a customized version of the GPT-4 tokenizer
    3. 使用定制版本的GPT-4分词器

- Later, we then load the original Llama 3 weights shared by Meta AI into the architecture
- 之后，我们将Meta AI共享的原始Llama 3权重加载到架构中

&nbsp;
## 1.1 Reusing Llama 2 components
## 1.1 复用 Llama 2 组件

- Llama 2 is actually quite similar to Llama 3, as mentioned above and illustrated in the figure at the top of this notebook
- 如上所述并在本笔记本顶部的图中所示，Llama 2实际上与Llama 3非常相似
- This means that we can import several building blocks from the [Llama 2 notebook](./converting-gpt-to-llama2.ipynb) using the following code
- 这意味着我们可以使用以下代码从[Llama 2笔记本](./converting-gpt-to-llama2.ipynb)导入多个构建模块

In [3]:
# 导入操作系统相关功能
import os
# 导入系统相关功能
import sys
# 导入输入输出功能
import io
# 导入notebook格式处理库
import nbformat
# 导入类型相关功能
import types

def import_from_notebook():
    def import_definitions_from_notebook(fullname, names):
        # 获取当前工作目录
        current_dir = os.getcwd()
        # 拼接完整的notebook文件路径
        path = os.path.join(current_dir, fullname + ".ipynb")
        # 规范化路径
        path = os.path.normpath(path)

        # 检查notebook文件是否存在
        if not os.path.exists(path):
            raise FileNotFoundError(f"Notebook file not found at: {path}")

        # 打开并读取notebook文件
        with io.open(path, "r", encoding="utf-8") as f:
            nb = nbformat.read(f, as_version=4)

        # 创建模块来存储导入的函数和类
        mod = types.ModuleType(fullname)
        sys.modules[fullname] = mod

        # 遍历notebook单元格,只执行函数或类定义
        for cell in nb.cells:
            if cell.cell_type == "code":
                cell_code = cell.source
                for name in names:
                    # 检查是否包含函数或类定义
                    if f"def {name}" in cell_code or f"class {name}" in cell_code:
                        exec(cell_code, mod.__dict__)
        return mod

    # 设置notebook文件名
    fullname = "converting-gpt-to-llama2"
    # 设置需要导入的函数和类名称
    names = ["precompute_rope_params", "compute_rope", "SiLU", "FeedForward", "RMSNorm", "MultiHeadAttention"]

    return import_definitions_from_notebook(fullname, names)

In [4]:
# 从notebook文件导入模块
imported_module = import_from_notebook()

# 我们需要重新定义precompute_rope_params
# precompute_rope_params = getattr(imported_module, "precompute_rope_params", None)

# 从导入的模块中获取compute_rope函数
compute_rope = getattr(imported_module, "compute_rope", None)

# 从导入的模块中获取SiLU激活函数
SiLU = getattr(imported_module, "SiLU", None)

# 从导入的模块中获取前馈神经网络层
FeedForward = getattr(imported_module, "FeedForward", None)

# 从导入的模块中获取RMSNorm层归一化
RMSNorm = getattr(imported_module, "RMSNorm", None)

# 从导入的模块中获取多头注意力层(仅用于比较目的)
MultiHeadAttention = getattr(imported_module, "MultiHeadAttention", None)

&nbsp;
## 1.2 Modified RoPE
## 1.2 修改后的RoPE（旋转位置编码）

- Llama 3 uses rotary position embeddings (RoPE) similar to Llama 2 (for a detailed explanation, please see the [RoPE paper](https://arxiv.org/abs/2104.09864))
- Llama 3使用与Llama 2类似的旋转位置编码(RoPE)(详细解释请参见[RoPE论文](https://arxiv.org/abs/2104.09864))

- There are some subtle differences in the RoPE settings, though
- 但是RoPE设置上有一些细微的差异

 - Llama 3 now supports up to 8,192 tokens, twice as many as Llama 2 (4,096)
 - Llama 3现在支持最多8,192个token，是Llama 2(4,096)的两倍

 - The base value for the so-called RoPE $\theta$ (see equation below) was increased from 10,000 (Llama 2) to 500,000 (Llama 3) in the following equation (adapted from the [RoPE paper](https://arxiv.org/abs/2104.09864))
 - 在下面的等式中，所谓的RoPE $\theta$ 的基值从10,000 (Llama 2)增加到500,000 (Llama 3)(改编自[RoPE论文](https://arxiv.org/abs/2104.09864))

$$\Theta = \left\{\theta_i = \text{base}^{\frac{-2(i-1)}{d}}, i \in \left[1, 2, ..., d/2\right]\right\}$$

- These $\theta$ values are a set of predefined parameters that are used to determine the rotational angles in the rotary matrix, where $d$ is the dimensionality of the embedding space
- 这些 $\theta$ 值是一组预定义的参数，用于确定旋转矩阵中的旋转角度，其中 $d$ 是嵌入空间的维度

- Increasing the base from 10,000 to 500,000 makes the frequencies (or rotation angles) decay more slowly across the dimensions, which means that higher dimensions will be associated with larger angles than before (essentially, it's a decompression of the frequencies)
- 将基值从10,000增加到500,000使得频率(或旋转角度)在维度上衰减得更慢，这意味着更高维度将比以前具有更大的角度(本质上是频率的解压缩)

- In addition, we introduce a `freq_config` section in the code below that adjusts the frequency; however, we won't be needing it in Llama 3 (only Llama 3.1 and Llama 3.2), so we will revisit this `freq_config` later (it's set to `None` and ignored by default)
- 此外，我们在下面的代码中引入了一个调整频率的`freq_config`部分；但是，在Llama 3中我们不需要它(仅在Llama 3.1和Llama 3.2中需要)，所以我们稍后会重新讨论这个`freq_config`(默认设置为`None`并被忽略)

In [5]:
import torch

# 预计算旋转位置编码(RoPE)参数的函数
def precompute_rope_params(head_dim, theta_base=10_000, context_length=4096, freq_config=None):
    # 确保头部维度是偶数
    assert head_dim % 2 == 0, "Embedding dimension must be even"

    # 计算倒数频率
    inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

    ################################ NEW ###############################################
    # 频率调整部分
    if freq_config is not None:
        # 计算低频和高频波长
        low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]
        high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]

        # 计算波长
        wavelen = 2 * torch.pi / inv_freq

        # 根据波长条件调整倒数频率
        inv_freq_llama = torch.where(
            wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq
        )

        # 计算平滑因子
        smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (
            freq_config["high_freq_factor"] - freq_config["low_freq_factor"]
        )

        # 计算平滑后的倒数频率
        smoothed_inv_freq = (
            (1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq
        )

        # 判断中频区域并应用相应的频率调整
        is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)
        inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
        inv_freq = inv_freq_llama
    ####################################################################################

    # 生成位置索引
    positions = torch.arange(context_length)

    # 计算角度
    angles = positions[:, None] * inv_freq[None, :]  # 形状: (context_length, head_dim // 2)

    # 扩展角度以匹配head_dim
    angles = torch.cat([angles, angles], dim=1)  # 形状: (context_length, head_dim)

    # 预计算正弦和余弦值
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    return cos, sin

- To summarize, what's new so far for Llama 3 compared to Llama 2 are the context length and theta base parameter:
- 总结一下,到目前为止 Llama 3 相比 Llama 2 的新变化是上下文长度和 theta base 参数:

In [6]:
# 实例化 RoPE 参数

# Llama 2 和 Llama 3 的上下文长度
llama_2_context_len = 4096  # Llama 2 的上下文长度
llama_3_context_len = 8192  # Llama 3 的上下文长度

# Llama 2 和 Llama 3 的 theta base 参数
llama_2_theta_base = 10_000    # Llama 2 的 theta base 参数
llama_3_theta_base = 500_000   # Llama 3 的 theta base 参数，显著增大

- The usage remains the same as before in Llama 2:
- 用法与 Llama 2 中的保持一致:

In [7]:
# 设置基本参数
batch_size = 2  # 批次大小为2
num_heads = 4   # 注意力头数为4
head_dim = 16   # 每个注意力头的维度为16

# 实例化RoPE参数
cos, sin = precompute_rope_params(  # 预计算RoPE的余弦和正弦参数
    head_dim=head_dim,              # 使用设定的head_dim
    theta_base=llama_3_theta_base,  # 使用Llama 3的theta_base参数
    context_length=llama_3_context_len  # 使用Llama 3的上下文长度
)

# 创建示例的查询和键张量
torch.manual_seed(123)  # 设置随机种子以保证结果可复现
queries = torch.randn(batch_size, num_heads, llama_3_context_len, head_dim)  # 生成随机查询张量
keys = torch.randn(batch_size, num_heads, llama_3_context_len, head_dim)     # 生成随机键张量

# 应用旋转位置编码
queries_rot = compute_rope(queries, cos, sin)  # 对查询张量应用RoPE
keys_rot = compute_rope(keys, cos, sin)        # 对键张量应用RoPE

&nbsp;
## 1.3 Grouped-query attention
## 1.3 分组查询注意力

- In this section, we replace multi-head attention (MHA) with an alternative mechanism called grouped-query attention (GQA)
- 在本节中，我们将多头注意力(MHA)替换为一种称为分组查询注意力(GQA)的替代机制

- In short, one can think of GQA as a more compute- and parameter-efficient version of MHA
- 简而言之，可以将GQA视为一个计算和参数效率更高的MHA版本

- In GQA, we reduce the number of key and value projections by sharing them among multiple attention heads
- 在GQA中，我们通过在多个注意力头之间共享键和值投影来减少它们的数量

- Each attention head still has its unique query, but these queries attend to the same group of keys and values
- 每个注意力头仍然有其唯一的查询，但这些查询关注相同组的键和值

- Below is an illustration of GQA with 2 key-value-groups (kv-groups):
- 下面是具有2个键值组(kv-groups)的GQA示意图：

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/grouped-query-attention.webp" width="500px">


- The main idea behind GQA is to reduce the number of unique query groups that attend to the key-value pairs, reducing the size of some of the matrix multiplications and the number of parameters in MHA without significantly reducing modeling performance
- GQA的主要思想是减少关注键值对的唯一查询组的数量，从而减少一些矩阵乘法的规模和MHA中的参数数量，同时不会显著降低建模性能

- The GQA code is very similar to MHA (I highlighted the changes below via the "NEW" sections)
- GQA代码与MHA非常相似(我在下面通过"NEW"部分突出显示了更改)

- In short, the main change in GQA is that each query group needs to be repeated to match the number of heads it is associated with, as implemented below
- 简而言之，GQA的主要变化是每个查询组需要重复以匹配与其关联的注意力头数量，如下所示

- In addition, we also introduce a `SharedBuffers` class that will allow us to reuse the `mask`, `cos`, and `sin` tensors in the transformer blocks to improve efficiency (this will be crucial when working with models such as Llama 3.1 and 3.2 later, which support up to 131k input tokens)
- 此外，我们还引入了一个`SharedBuffers`类，它允许我们在transformer块中重用`mask`、`cos`和`sin`张量以提高效率(这在后续使用支持高达131k输入令牌的Llama 3.1和3.2等模型时将至关重要)

In [8]:
import torch.nn as nn  # 导入PyTorch神经网络模块


############################# NEW  #############################
class SharedBuffers:  # 定义一个共享缓冲区类
    _buffers = {}  # 用字典存储缓冲区

    @staticmethod  # 静态方法装饰器
    def get_buffers(context_length, head_dim, rope_base, freq_config, dtype=torch.float32):  # 获取缓冲区的方法
        key = (context_length, head_dim, rope_base, tuple(freq_config.values()) if freq_config else freq_config, dtype)  # 创建缓冲区的键

        if key not in SharedBuffers._buffers:  # 如果键不在缓冲区中
            # 创建或获取缓冲区
            mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)  # 创建上三角掩码矩阵
            cos, sin = precompute_rope_params(head_dim, rope_base, context_length, freq_config)  # 预计算RoPE参数
            if dtype is not None:  # 如果指定了数据类型
                cos = cos.to(dtype)  # 转换cos张量的数据类型
                sin = sin.to(dtype)  # 转换sin张量的数据类型
            SharedBuffers._buffers[key] = (mask, cos, sin)  # 将缓冲区存储到字典中

        return SharedBuffers._buffers[key]  # 返回缓冲区
############################# NEW  #############################


class GroupedQueryAttention(nn.Module):  # 定义分组查询注意力模块
    def __init__(
            self, d_in, d_out, context_length, num_heads,
            num_kv_groups,       # NEW  # 键值组的数量
            rope_base=10_000,    # NEW  # RoPE基数
            rope_config=None,    # NEW  # RoPE配置
            dtype=None  # 数据类型
        ):
        super().__init__()  # 调用父类初始化
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"  # 确保d_out能被num_heads整除
        assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"  # 确保num_heads能被num_kv_groups整除

        self.d_out = d_out  # 输出维度
        self.num_heads = num_heads  # 注意力头数量
        self.head_dim = d_out // num_heads  # 每个头的维度

        ############################# NEW  #############################
        # self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        # self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)  # 键投影层
        self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)  # 值投影层
        self.num_kv_groups = num_kv_groups  # 键值组数量
        self.group_size = num_heads // num_kv_groups  # 每组的大小
        ################################################################

        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)  # 查询投影层
        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)  # 输出投影层

        ############################# NEW  #############################
        # 使用SharedBuffers获取缓冲区
        mask, cos, sin = SharedBuffers.get_buffers(context_length, self.head_dim, rope_base, rope_config, dtype)  # 获取共享缓冲区
        ############################# NEW  #############################
        
        self.register_buffer("mask", mask)  # 注册掩码缓冲区
        self.register_buffer("cos", cos)  # 注册余弦缓冲区
        self.register_buffer("sin", sin)  # 注册正弦缓冲区

    def forward(self, x):  # 前向传播方法
        b, num_tokens, d_in = x.shape  # 获取输入张量的形状

        queries = self.W_query(x)  # 计算查询向量 Shape: (b, num_tokens, d_out)
        keys = self.W_key(x)  # 计算键向量 Shape: (b, num_tokens, num_kv_groups * head_dim)
        values = self.W_value(x)  # 计算值向量 Shape: (b, num_tokens, num_kv_groups * head_dim)

        # 重塑查询、键和值
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)  # 重塑查询张量

        ##################### NEW  #####################
        # keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
        # values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)  # 重塑键张量
        values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)  # 重塑值张量
        ################################################

        # 转置键、值和查询
        keys = keys.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)
        values = values.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)
        queries = queries.transpose(1, 2)  # Shape: (b, num_query_groups, num_tokens, head_dim)

        # 应用RoPE
        keys = compute_rope(keys, self.cos, self.sin)  # 对键应用RoPE
        queries = compute_rope(queries, self.cos, self.sin)  # 对查询应用RoPE

        ##################### NEW  #####################
        # 扩展键和值以匹配头的数量
        # Shape: (b, num_heads, num_tokens, head_dim)

        keys = keys.repeat_interleave(self.group_size, dim=1)  # 重复键 Shape: (b, num_heads, num_tokens, head_dim)
        values = values.repeat_interleave(self.group_size, dim=1)  # 重复值 Shape: (b, num_heads, num_tokens, head_dim)
        # 例如，在dim=1上repeat_interleave之前(查询组):
        #   [K1, K2]
        # repeat_interleave后(每个查询组重复group_size次):
        #   [K1, K1, K2, K2]
        # 如果我们使用常规repeat而不是repeat_interleave，我们会得到:
        #   [K1, K2, K1, K2]
        ################################################

        # 计算缩放点积注意力(又称自注意力)与因果掩码
        # Shape: (b, num_heads, num_tokens, num_tokens)
        attn_scores = queries @ keys.transpose(2, 3)  # 计算每个头的点积

        # 原始掩码截断到令牌数量并转换为布尔值
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]  # 获取布尔掩码

        # 使用掩码填充注意力分数
        attn_scores.masked_fill_(mask_bool, -torch.inf)  # 将掩码位置填充为负无穷

        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)  # 计算注意力权重
        assert keys.shape[-1] == self.head_dim  # 确保键的维度正确

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2)  # 计算上下文向量

        # 合并头，其中self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.reshape(b, num_tokens, self.d_out)  # 重塑上下文向量
        context_vec = self.out_proj(context_vec)  # 可选的投影

        return context_vec  # 返回上下文向量

- To illustrate the parameter savings, consider the following multi-head attention example from the GPT and Llama 2 code:
- 为了说明参数节省情况，让我们看看来自GPT和Llama 2代码的以下多头注意力示例:

In [9]:
# 设置参数
batch_size = 1  # 批次大小
context_len = 3000  # 上下文长度
max_context_len = 8192  # 最大上下文长度
embed_dim = 4096  # 嵌入维度
num_heads = 32  # 注意力头数量

# 创建一个随机输入张量作为示例批次
example_batch = torch.randn((batch_size, context_len, embed_dim))

# 初始化多头注意力模块
mha = MultiHeadAttention(
    d_in=embed_dim,  # 输入维度
    d_out=embed_dim,  # 输出维度
    context_length=max_context_len,  # 最大上下文长度
    num_heads=num_heads  # 注意力头数量
)

# 通过多头注意力模块前向传播
mha(example_batch)

# 打印各个权重矩阵的形状
print("W_key:", mha.W_key.weight.shape)  # 键权重矩阵形状
print("W_value:", mha.W_value.weight.shape)  # 值权重矩阵形状
print("W_query:", mha.W_query.weight.shape)  # 查询权重矩阵形状

W_key: torch.Size([4096, 4096])
W_value: torch.Size([4096, 4096])
W_query: torch.Size([4096, 4096])


- Now, if we use grouped-query attention instead, with 8 kv-groups (that's how many Llama 3 8B uses), we can see that the number of rows of the key and value matrices are reduced by a factor of 4 (because 32 attention heads divided by 8 kv-groups is 4)
- 现在，如果我们使用分组查询注意力机制，并设置8个kv组（这是Llama 3 8B使用的数量），我们可以看到键和值矩阵的行数减少了4倍（因为32个注意力头除以8个kv组等于4）

In [10]:
# 初始化分组查询注意力模块
gqa = GroupedQueryAttention(
    d_in=embed_dim,          # 输入维度
    d_out=embed_dim,         # 输出维度
    context_length=max_context_len,  # 最大上下文长度
    num_heads=num_heads,     # 注意力头数量
    num_kv_groups=8,         # KV组数量
    rope_base=llama_3_theta_base  # RoPE基数
)

# 通过分组查询注意力模块前向传播
gqa(example_batch)

# 打印各个权重矩阵的形状
print("W_key:", gqa.W_key.weight.shape)      # 键权重矩阵形状
print("W_value:", gqa.W_value.weight.shape)  # 值权重矩阵形状
print("W_query:", gqa.W_query.weight.shape)  # 查询权重矩阵形状

W_key: torch.Size([1024, 4096])
W_value: torch.Size([1024, 4096])
W_query: torch.Size([4096, 4096])


- As a side note, to make the GroupedQueryAttention equivalent to standard multi-head attention, you can set the number of query groups (`num_kv_groups`) equal to the number of heads (`num_heads`)
- 顺便说一下,要使分组查询注意力等效于标准的多头注意力,你可以将查询组数(`num_kv_groups`)设置为等于注意力头数(`num_heads`)
- Lastly, let's compare the number of parameters below:
- 最后,让我们比较一下下面的参数数量:

In [11]:
# 打印参数总数
print("Total number of parameters:")

# 计算多头注意力模块的参数总数
mha_total_params = sum(p.numel() for p in mha.parameters())
print(f"MHA: {mha_total_params:,}")

# 计算分组查询注意力模块的参数总数
gqa_total_params = sum(p.numel() for p in gqa.parameters())
print(f"GQA: {gqa_total_params:,}")

Total number of parameters:
MHA: 67,108,864
GQA: 41,943,040


In [12]:
# 释放内存:
del mha  # 删除多头注意力模块
del gqa  # 删除分组查询注意力模块

&nbsp;
## 1.4 Update the TransformerBlock module
## 1.4 更新 TransformerBlock 模块

- Next, we update the `TransformerBlock`
- 接下来,我们更新 `TransformerBlock`
- Here, we simply swap `MultiHeadAttention` with `GroupedQueryAttention` and add the new RoPE settings  
- 在这里,我们只需将 `MultiHeadAttention` 替换为 `GroupedQueryAttention` 并添加新的 RoPE 设置

In [13]:
# 定义 Transformer 块类,继承自 nn.Module
class TransformerBlock(nn.Module):
    # 初始化函数,接收配置参数
    def __init__(self, cfg):
        # 调用父类初始化
        super().__init__()
        # 初始化分组查询注意力层,替换原来的多头注意力
        self.att =  GroupedQueryAttention(  # MultiHeadAttention(
            d_in=cfg["emb_dim"],            # 输入维度
            d_out=cfg["emb_dim"],           # 输出维度
            context_length=cfg["context_length"], # 上下文长度
            num_heads=cfg["n_heads"],        # 注意力头数
            num_kv_groups=cfg["n_kv_groups"],  # 新增:KV组数
            rope_base=cfg["rope_base"],        # 新增:RoPE基数
            rope_config=cfg["rope_freq"],      # 新增:RoPE频率配置
            dtype=cfg["dtype"]                 # 数据类型
        )
        # 初始化前馈网络层
        self.ff = FeedForward(cfg)
        # 初始化第一个RMS归一化层
        self.norm1 = RMSNorm(cfg["emb_dim"], eps=1e-5)
        # 初始化第二个RMS归一化层
        self.norm2 = RMSNorm(cfg["emb_dim"], eps=1e-5)

    # 前向传播函数
    def forward(self, x):
        # 注意力块的残差连接
        shortcut = x
        # 应用第一个归一化
        x = self.norm1(x)
        # 应用注意力层,并转换为 bfloat16 类型
        x = self.att(x.to(torch.bfloat16))   # 形状 [batch_size, num_tokens, emb_size]
        # 添加残差连接
        x = x + shortcut  # 将原始输入加回

        # 前馈网络块的残差连接
        shortcut = x
        # 应用第二个归一化
        x = self.norm2(x)
        # 应用前馈网络,并转换为 bfloat16 类型
        x = self.ff(x.to(torch.bfloat16))
        # 添加残差连接
        x = x + shortcut  # 将原始输入加回

        # 返回输出
        return x

&nbsp;
## 1.5 Defining the model class
## 1.5 定义模型类

- When setting up the model class, we fortunately don't have to do much; we just update the name to `Llama3Model`
- 在设置模型类时，我们幸运的是不需要做太多工作；我们只需要将名称更新为`Llama3Model`

In [14]:
# 将Llama2Model更名为Llama3Model
class Llama3Model(nn.Module):
    def __init__(self, cfg):
        # 调用父类初始化
        super().__init__()
        # 创建词嵌入层,将词表大小映射到嵌入维度
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])

        # 创建Transformer块的序列,包含n_layers个TransformerBlock
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        # 创建最终的RMS归一化层
        self.final_norm = RMSNorm(cfg["emb_dim"], eps=1e-5)
        # 创建输出头,将嵌入维度映射回词表大小
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

    def forward(self, in_idx):
        # 将输入索引转换为词嵌入
        tok_embeds = self.tok_emb(in_idx)
        # 将词嵌入赋值给x
        x = tok_embeds
        # 通过Transformer块序列
        x = self.trf_blocks(x)
        # 应用最终的归一化
        x = self.final_norm(x)
        # 通过输出头生成logits,并转换为bfloat16类型
        logits = self.out_head(x.to(torch.bfloat16))
        # 返回logits
        return logits

&nbsp;
## 2. Initialize model
## 2. 初始化模型

- Now we can define a Llama 3 config file (the Llama 2 config file is shown for comparison)
- 现在我们可以定义一个 Llama 3 配置文件（为了比较，这里也展示了 Llama 2 的配置文件）

In [15]:
LLAMA2_CONFIG_7B = {
    "vocab_size": 32_000,    # 词表大小
    "context_length": 4096,  # 上下文长度
    "emb_dim": 4096,         # 嵌入维度
    "n_heads": 32,           # 注意力头数量
    "n_layers": 32,          # 层数
    "hidden_dim": 11_008,    # 前馈网络中间维度大小
    "dtype": torch.bfloat16  # 使用低精度数据类型以节省内存
}

In [16]:
LLAMA3_CONFIG_8B = {
    "vocab_size": 128_256,   # 词表大小增加到128k
    "context_length": 8192,  # 上下文长度增加到8k
    "emb_dim": 4096,         # 嵌入维度
    "n_heads": 32,           # 注意力头数量
    "n_layers": 32,          # 层数
    "hidden_dim": 14_336,    # 前馈网络中间维度增大
    "n_kv_groups": 8,        # 分组查询注意力的KV组数
    "rope_base": 500_000.0,  # RoPE中theta的基数增加到500k
    "rope_freq": None,       # 用于调整RoPE频率的额外配置
    "dtype": torch.bfloat16  # 使用低精度数据类型以节省内存
}

- Using these settings, we can now initialize a Llama 3 8B model
- 使用这些设置，我们现在可以初始化一个 Llama 3 8B 模型
- Note that this requires ~34 GB of memory (for comparison, Llama 2 7B required ~26 GB of memory)
- 注意这需要约 34GB 内存（相比之下，Llama 2 7B 需要约 26GB 内存）

In [17]:
# 使用 Llama 3 的配置初始化模型
model = Llama3Model(LLAMA3_CONFIG_8B)

- The following is expected to print True to confirm buffers are reused instead of being (wastefully) recreated:
- 下面的代码预期会打印 True，以确认缓冲区被重用而不是（浪费地）重新创建：

In [None]:
# 检查缓冲区是否在不同的 Transformer 块之间共享
# 比较第一个和最后一个块的注意力掩码是否是同一个对象
print(model.trf_blocks[0].att.mask is model.trf_blocks[-1].att.mask)
# 比较第一个和最后一个块的余弦位置编码是否是同一个对象
print(model.trf_blocks[0].att.cos is model.trf_blocks[-1].att.cos)
# 比较第一个和最后一个块的正弦位置编码是否是同一个对象
print(model.trf_blocks[0].att.sin is model.trf_blocks[-1].att.sin) 

- Let's now also compute the number of trainable parameters:
- 现在让我们计算可训练参数的数量：

In [18]:
# 计算模型的总参数量
total_params = sum(p.numel() for p in model.parameters())
# 打印总参数量
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 8,030,261,248


- As shown above, the model contains 8 billion parameters
- 如上所示，该模型包含80亿个参数
- Additionally, we can calculate the memory requirements for this model using the code below:
- 此外，我们可以使用下面的代码计算这个模型的内存需求：

In [19]:
# 定义一个函数来计算模型所需的内存大小
def model_memory_size(model, input_dtype=torch.float32):
    # 初始化参数总数为0
    total_params = 0
    # 初始化梯度总数为0 
    total_grads = 0
    # 遍历模型的所有参数
    for param in model.parameters():
        # 计算每个参数的元素总数
        param_size = param.numel()
        # 累加到参数总数
        total_params += param_size
        # 检查该参数是否需要存储梯度
        if param.requires_grad:
            # 如果需要梯度,累加到梯度总数
            total_grads += param_size

    # 计算缓冲区大小(需要内存的非参数部分)
    total_buffers = sum(buf.numel() for buf in model.buffers())

    # 计算每个元素的字节大小
    element_size = torch.tensor(0, dtype=input_dtype).element_size()
    # 计算总内存字节数 = (元素总数) * (每个元素的字节大小)
    total_memory_bytes = (total_params + total_grads + total_buffers) * element_size

    # 将字节转换为GB
    total_memory_gb = total_memory_bytes / (1024**3)

    # 返回总内存大小(GB)
    return total_memory_gb

# 打印使用float32数据类型时的内存大小
print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")
# 打印使用bfloat16数据类型时的内存大小
print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")

float32 (PyTorch default): 68.08 GB
bfloat16: 34.04 GB


- Lastly, we can also transfer the model to an NVIDIA or Apple Silicon GPU if applicable:
- 最后，如果可用的话，我们还可以将模型转移到 NVIDIA 或 Apple Silicon GPU 上:

In [20]:
# 检查是否有NVIDIA GPU可用
if torch.cuda.is_available():
    device = torch.device("cuda")
# 检查是否有Apple Silicon GPU可用 
elif torch.backends.mps.is_available():
    device = torch.device("mps")
# 如果都没有,就使用CPU
else:
    device = torch.device("cpu")

# 将模型移动到选定的设备上
model.to(device);

&nbsp;
## 3. Load tokenizer
## 3. 加载分词器

- In this section, we are going to load the tokenizer for the model
- 在本节中，我们将为模型加载分词器

- Llama 2 used Google's [SentencePiece](https://github.com/google/sentencepiece) tokenizer instead of OpenAI's BPE tokenizer based on the [Tiktoken](https://github.com/openai/tiktoken) library
- Llama 2 使用了 Google 的 [SentencePiece](https://github.com/google/sentencepiece) 分词器，而不是基于 [Tiktoken](https://github.com/openai/tiktoken) 库的 OpenAI BPE 分词器

- Llama 3, however, reverted back to using the BPE tokenizer from Tiktoken; specifically, it uses the GPT-4 tokenizer with an extended vocabulary
- 然而，Llama 3 改回使用了来自 Tiktoken 的 BPE 分词器；具体来说，它使用了带有扩展词汇表的 GPT-4 分词器

- You can find the original Tiktoken-adaptation by Meta AI [here](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py) in their official Llama 3 repository
- 你可以在 Meta AI 的官方 Llama 3 代码库[这里](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py)找到原始的 Tiktoken 改编版本

- Below, I rewrote the tokenizer code to make it more readable and minimal for this notebook (but the behavior should be similar)
- 下面，我重写了分词器代码，使其更易读且更精简（但行为应该类似）

In [21]:
# 导入所需的操作系统相关模块
import os
from pathlib import Path

# 导入tiktoken分词器相关模块
import tiktoken
from tiktoken.load import load_tiktoken_bpe


class Tokenizer:
    def __init__(self, model_path):
        # 检查模型文件是否存在
        assert os.path.isfile(model_path), f"Model file {model_path} not found"
        # 加载tiktoken BPE词表
        mergeable_ranks = load_tiktoken_bpe(model_path)

        # 定义特殊token及其对应的ID
        self.special_tokens = {
            "<|begin_of_text|>": 128000,
            "<|end_of_text|>": 128001,
            "<|start_header_id|>": 128006,
            "<|end_header_id|>": 128007,
            "<|eot_id|>": 128009,
        }
        # 添加预留的特殊token
        self.special_tokens.update({
            f"<|reserved_{i}|>": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()
        })

        # 初始化tiktoken编码器
        self.model = tiktoken.Encoding(
            name=Path(model_path).name,
            pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
            mergeable_ranks=mergeable_ranks,
            special_tokens=self.special_tokens
        )

    def encode(self, text, bos=False, eos=False, allowed_special=set(), disallowed_special=()):
        # 如果需要添加开始标记
        if bos:
            tokens = [self.special_tokens["<|begin_of_text|>"]]
        else:
            tokens = []

        # 对文本进行编码
        tokens += self.model.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special)

        # 如果需要添加结束标记
        if eos:
            tokens.append(self.special_tokens["<|end_of_text|>"])
        return tokens

    def decode(self, tokens):
        # 将token解码为文本
        return self.model.decode(tokens)

- Meta AI shared the original Llama 3 model weights and tokenizer vocabulary on the Hugging Face Hub
- Meta AI 在 Hugging Face Hub 上共享了原始的 Llama 3 模型权重和分词器词表
- We will first download the tokenizer vocabulary from the Hub and load it into the code above  
- 我们将首先从 Hub 下载分词器词表并将其加载到上面的代码中

- Please note that Meta AI requires that you accept the Llama 3 licensing terms before you can download the files; to do this, you have to create a Hugging Face Hub account and visit the [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) repository to accept the terms
- 请注意,Meta AI 要求您在下载文件之前接受 Llama 3 许可条款;为此,您需要创建一个 Hugging Face Hub 账户并访问 [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) 仓库来接受条款

- Next, you will need to create an access token; to generate an access token with READ permissions, click on the profile picture in the upper right and click on "Settings"
- 接下来,您需要创建一个访问令牌;要生成具有读取权限的访问令牌,请点击右上角的个人头像,然后点击"Settings"(设置)

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/settings.webp?1" width="300px">

- Then, create and copy the access token so you can copy & paste it into the next code cell
- 然后,创建并复制访问令牌,以便您可以将其复制并粘贴到下一个代码单元中

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/access-token.webp?1" width="600px">

In [22]:
# 从 huggingface_hub 导入登录功能
from huggingface_hub import login
# 导入 json 模块用于读取配置文件
import json

# 打开并读取配置文件
with open("config.json", "r") as config_file:
    # 加载 JSON 配置
    config = json.load(config_file)
    # 获取 Hugging Face 访问令牌
    access_token = config["HF_ACCESS_TOKEN"]

# 使用访问令牌登录 Hugging Face
login(token=access_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


- After login via the access token, which is necessary to verify that we accepted the Llama 3 licensing terms, we can now download the tokenizer vocabulary:
- 通过访问令牌登录后(这是验证我们已接受 Llama 3 许可条款所必需的),我们现在可以下载分词器词表:

In [23]:
# 从 huggingface_hub 导入文件下载功能
from huggingface_hub import hf_hub_download

# 从 Hugging Face Hub 下载分词器模型文件
# repo_id: 模型仓库ID
# filename: 要下载的文件名
# local_dir: 本地保存目录
tokenizer_file_path = hf_hub_download(
    repo_id="meta-llama/Meta-Llama-3-8B",
    filename="original/tokenizer.model", 
    local_dir="Llama-3-8B"
)

- Note that for using Llama 3 files, we may need the `blobfile` package, which is used when handling datasets or models stored in cloud storage solutions like Google Cloud Storage (GCS), Azure Blob Storage, or Amazon S3
- 请注意,使用 Llama 3 文件时,我们可能需要 `blobfile` 包,该包用于处理存储在云存储解决方案(如 Google Cloud Storage (GCS)、Azure Blob Storage 或 Amazon S3)中的数据集或模型
- You can install this dependency by uncommenting and executing the `pip` command below
- 您可以通过取消注释并执行下面的 `pip` 命令来安装此依赖项


In [24]:
# pip install blobfile

In [25]:
# 使用下载的分词器模型文件创建分词器对象
tokenizer = Tokenizer(tokenizer_file_path)

- We can now use the `generate` function to have the Llama 3 model generate new text:
- 现在我们可以使用 `generate` 函数让 Llama 3 模型生成新的文本:

In [26]:
# 从 previous_chapters 导入所需的函数
from previous_chapters import generate, text_to_token_ids, token_ids_to_text


# 设置随机种子以确保结果可重现
torch.manual_seed(123)

# 使用模型生成文本
# - model: 使用加载的模型
# - idx: 将输入文本 "Every effort" 转换为 token ID 并移至指定设备
# - max_new_tokens: 最多生成 30 个新 token
# - context_size: 使用 LLAMA3 配置中定义的上下文长度
# - top_k: 设置为 1 表示只选择概率最高的下一个 token
# - temperature: 设置为 0 表示始终选择最可能的 token
token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort", tokenizer).to(device),
    max_new_tokens=30,
    context_size=LLAMA3_CONFIG_8B["context_length"],
    top_k=1,
    temperature=0.
)

# 将生成的 token ID 转换回文本并打印输出
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort_dead aeros Ingredients başında.extensionégor clangmissions güc như submodule.and report官方%，.Reader(",");
ामल ندار Parliamentary !!! HigginsDynamicZhgmt writeln Globalsletion 사진------


- Of course, as we can see above, the text is nonsensical since we haven't trained the Llama 3 model yet
- 当然,正如我们在上面看到的,由于我们还没有训练 Llama 3 模型,所以生成的文本是没有意义的
- In the next section, instead of training it ourselves, which would cost tens to hundreds of thousands of dollars, we load the pretrained weights from Meta AI  
- 在下一节中,我们将不会自己训练模型(这将花费数十万到数十万美元),而是加载来自 Meta AI 的预训练权重

&nbsp;
## 4. Load pretrained weights
## 4. 加载预训练权重

- We are loading the ["meta-llama/Meta-Llama-3-8B"](https://huggingface.co/meta-llama/Meta-Llama-3-8B) base model below, which is a simple text completion model before finetuning
- 我们将在下面加载 ["meta-llama/Meta-Llama-3-8B"](https://huggingface.co/meta-llama/Meta-Llama-3-8B) 基础模型，这是一个在微调之前的简单文本补全模型
- Alternatively, you can load the instruction-finetuned and aligned ["meta-llama/Meta-Llama-3-8B-Instruct"](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model by modifying the string in the next code cell accordingly
- 或者，您可以通过相应地修改下一个代码单元中的字符串来加载经过指令微调和对齐的 ["meta-llama/Meta-Llama-3-8B-Instruct"](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) 模型
- Combined, the weight files are about 16 GB large
- 合并后，权重文件大约有 16 GB

In [27]:
# 从 safetensors 导入加载文件的函数
from safetensors.torch import load_file

# 创建一个空字典来存储合并的权重
combined_weights = {}

# 循环加载4个权重文件
for i in range(1, 5):
    # 从 Hugging Face hub 下载权重文件
    weights_file = hf_hub_download(
        repo_id="meta-llama/Meta-Llama-3-8B", # Meta-Llama-3-8B 模型的仓库 ID
        filename=f"model-0000{i}-of-00004.safetensors", # 权重文件名
        local_dir="Llama-3-8B" # 本地保存目录
    )
    # 加载当前权重文件
    current_weights = load_file(weights_file)
    # 将当前权重更新到合并的权重字典中
    combined_weights.update(current_weights)

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

- The `weights` contains the following tensors (only the first 15 are shown for simplicity):
- `weights` 包含以下张量(为简单起见,仅显示前 15 个):

In [28]:
# 获取合并权重字典的所有键名,并只显示前15个
list(combined_weights.keys())[:15]

['model.embed_tokens.weight',
 'model.layers.0.input_layernorm.weight',
 'model.layers.0.mlp.down_proj.weight',
 'model.layers.0.mlp.gate_proj.weight',
 'model.layers.0.mlp.up_proj.weight',
 'model.layers.0.post_attention_layernorm.weight',
 'model.layers.0.self_attn.k_proj.weight',
 'model.layers.0.self_attn.o_proj.weight',
 'model.layers.0.self_attn.q_proj.weight',
 'model.layers.0.self_attn.v_proj.weight',
 'model.layers.1.input_layernorm.weight',
 'model.layers.1.mlp.down_proj.weight',
 'model.layers.1.mlp.gate_proj.weight',
 'model.layers.1.mlp.up_proj.weight',
 'model.layers.1.post_attention_layernorm.weight']

- The following function, modeled after the `load_weights_into_gpt` function in [chapter 5](../01_main-chapter-code/ch05.ipynb), loads the pretrained weights into our Llama 3 model:
- 以下函数参考了[第5章](../01_main-chapter-code/ch05.ipynb)中的`load_weights_into_gpt`函数，将预训练权重加载到我们的 Llama 3 模型中：

In [29]:
# 定义一个函数用于分配权重参数
def assign(left, right, tensor_name="unknown"):
    # 检查左右张量形状是否匹配
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}")

    # 如果右侧是张量,则克隆并分离,否则转换为张量
    if isinstance(right, torch.Tensor):
        return torch.nn.Parameter(right.clone().detach())
    else:
        return torch.nn.Parameter(torch.tensor(right))


# 定义一个函数用于将预训练权重加载到Llama模型中
def load_weights_into_llama(model, param_config, params):
    # 加载词嵌入层权重
    model.tok_emb.weight = assign(model.tok_emb.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")

    # 遍历每一层Transformer块
    for l in range(param_config["n_layers"]):

        # 加载注意力层权重
        model.trf_blocks[l].att.W_query.weight = assign(
            model.trf_blocks[l].att.W_query.weight,
            params[f"model.layers.{l}.self_attn.q_proj.weight"],
            f"model.layers.{l}.self_attn.q_proj.weight"
        )
        model.trf_blocks[l].att.W_key.weight = assign(
            model.trf_blocks[l].att.W_key.weight,
            params[f"model.layers.{l}.self_attn.k_proj.weight"],
            f"model.layers.{l}.self_attn.k_proj.weight"
        )
        model.trf_blocks[l].att.W_value.weight = assign(
            model.trf_blocks[l].att.W_value.weight,
            params[f"model.layers.{l}.self_attn.v_proj.weight"],
            f"model.layers.{l}.self_attn.v_proj.weight"
        )
        model.trf_blocks[l].att.out_proj.weight = assign(
            model.trf_blocks[l].att.out_proj.weight,
            params[f"model.layers.{l}.self_attn.o_proj.weight"],
            f"model.layers.{l}.self_attn.o_proj.weight"
        )
        model.trf_blocks[l].norm1.weight = assign(
            model.trf_blocks[l].norm1.weight,
            params[f"model.layers.{l}.input_layernorm.weight"],
            f"model.layers.{l}.input_layernorm.weight"
        )

        # 加载前馈网络层权重
        model.trf_blocks[l].ff.fc1.weight = assign(
            model.trf_blocks[l].ff.fc1.weight,
            params[f"model.layers.{l}.mlp.gate_proj.weight"],
            f"model.layers.{l}.mlp.gate_proj.weight"
        )
        model.trf_blocks[l].ff.fc2.weight = assign(
            model.trf_blocks[l].ff.fc2.weight,
            params[f"model.layers.{l}.mlp.up_proj.weight"],
            f"model.layers.{l}.mlp.up_proj.weight"
        )
        model.trf_blocks[l].ff.fc3.weight = assign(
            model.trf_blocks[l].ff.fc3.weight,
            params[f"model.layers.{l}.mlp.down_proj.weight"],
            f"model.layers.{l}.mlp.down_proj.weight"
        )
        model.trf_blocks[l].norm2.weight = assign(
            model.trf_blocks[l].norm2.weight,
            params[f"model.layers.{l}.post_attention_layernorm.weight"],
            f"model.layers.{l}.post_attention_layernorm.weight"
        )

    # 加载最终的归一化层权重
    model.final_norm.weight = assign(model.final_norm.weight, params["model.norm.weight"], "model.norm.weight")

    # 加载输出层权重,如果存在lm_head则使用它,否则使用词嵌入权重(权重共享)
    if "lm_head.weight" in params.keys():
        model.out_head.weight = assign(model.out_head.weight, params["lm_head.weight"], "lm_head.weight")
    else:
        model.out_head.weight = assign(model.out_head.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
        print("Model uses weight tying.")


# 将权重加载到模型中并移动到指定设备
load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)
model.to(device);
del combined_weights  # 释放内存

- Next, we are ready to use the model for text generation
- 接下来，我们准备使用模型进行文本生成

In [30]:
# 设置随机种子以确保可重复性
torch.manual_seed(123)

# 生成文本
token_ids = generate(
    model=model,  # 使用加载的模型
    idx=text_to_token_ids("Every effort", tokenizer).to(device),  # 将输入文本转换为token并移至设备
    max_new_tokens=25,  # 最多生成25个新token
    context_size=LLAMA3_CONFIG_8B["context_length"],  # 使用配置中定义的上下文长度
    top_k=1,  # 只选择概率最高的token
    temperature=0.  # 温度为0,使输出确定性
)

# 将生成的token转换回文本并打印
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any


&nbsp;
## 5. Using the instruction-finetuned model
## 5. 使用指令微调模型

- Above, we used the pretrained base model; if you want to use a model capable of following instructions, use the `"meta-llama/Llama-3-8B-Instruct"` model instead, as shown below
- 上面我们使用了预训练的基础模型；如果你想使用一个能够遵循指令的模型，请改用`"meta-llama/Llama-3-8B-Instruct"`模型，如下所示

In [31]:
# 释放内存
# 导入垃圾回收模块
import gc

# 删除模型对象
del model

# 运行Python垃圾回收器
gc.collect()  

# 如果有CUDA设备可用,清空GPU缓存
if torch.cuda.is_available():
    torch.cuda.empty_cache()

In [32]:
# 创建一个空字典来存储合并的权重
combined_weights = {}

# 循环读取4个权重文件
for i in range(1, 5):
    # 从Hugging Face下载权重文件
    weights_file = hf_hub_download(
        repo_id="meta-llama/Meta-Llama-3-8B-Instruct", # 指定模型仓库
        filename=f"model-0000{i}-of-00004.safetensors", # 构建文件名
        local_dir="Llama-3-8B-Instruct" # 指定本地保存目录
    )
    # 加载当前权重文件
    current_weights = load_file(weights_file)
    # 将当前权重更新到合并的权重字典中
    combined_weights.update(current_weights)


# 创建Llama3模型实例
model = Llama3Model(LLAMA3_CONFIG_8B)
# 将合并的权重加载到模型中
load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)
# 将模型移动到指定设备(CPU/GPU)
model.to(device)
# 删除合并的权重以释放内存
del combined_weights  # free up memory

model-00001-of-00004.safetensors:  36%|###6      | 1.81G/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

- Note that the Llama 3 model should ideally be used with the correct prompt template that was used during finetuning (as discussed in chapter 7)
- 请注意，Llama 3模型最好使用在微调过程中使用的正确提示模板(如第7章所述)

- Below is a wrapper class around the tokenizer based on Meta AI's Llama 3-specific [ChatFormat code](https://github.com/meta-llama/llama3/blob/11817d47e1ba7a4959b025eb1ca308572e0e3963/llama/tokenizer.py#L202) that constructs the prompt template
- 以下是基于Meta AI的Llama 3特定[ChatFormat代码](https://github.com/meta-llama/llama3/blob/11817d47e1ba7a4959b025eb1ca308572e0e3963/llama/tokenizer.py#L202)的分词器包装类，用于构建提示模板

In [33]:
# 定义一个ChatFormat类来处理聊天格式的编码和解码
class ChatFormat:
    def __init__(self, tokenizer):
        # 初始化函数,保存tokenizer实例
        self.tokenizer = tokenizer

    def encode_header(self, message):
        # 编码消息头部的函数
        tokens = []
        # 添加起始标记
        tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])
        # 编码角色信息(user/assistant等)
        tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
        # 添加结束标记
        tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])
        # 添加两个换行符
        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
        return tokens

    def encode(self, text):
        # 编码完整消息的函数
        # 创建消息字典,设置角色为user
        message = {
            "role": "user",
            "content": text
        }

        # 首先编码消息头部
        tokens = self.encode_header(message)
        # 编码消息内容,去除首尾空格
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
        )
        # 添加结束标记
        tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
        return tokens

    def decode(self, token_ids):
        # 解码函数,将token_ids转换回文本
        return self.tokenizer.decode(token_ids)


# 创建ChatFormat实例
chat_tokenizer = ChatFormat(tokenizer)

- The usage is as follows:
- 使用方法如下：

In [34]:
# 使用chat_tokenizer将文本编码为token ids
token_ids = chat_tokenizer.encode("Hello World!")
# 打印编码后的token ids
print(token_ids)

[128006, 882, 128007, 271, 9906, 4435, 0, 128009]


In [35]:
# 使用tokenizer将token ids解码回文本
tokenizer.decode(token_ids)

'<|start_header_id|>user<|end_header_id|>\n\nHello World!<|eot_id|>'

- Let's now see the Llama 3 instruction model in action:
- 让我们看看 Llama 3 指令模型的实际效果:

In [36]:
# 设置随机种子以确保结果可重现
torch.manual_seed(123)

# 使用模型生成回答
# model: 加载的Llama模型
# idx: 将输入文本"What do llamas eat?"转换为token ids并移至指定设备
# max_new_tokens: 限制生成的最大token数为150
# context_size: 使用Llama 3配置中定义的上下文长度
# top_k: 只保留概率最高的1个token
# temperature: 设为0以获得确定性输出
token_ids = generate(
    model=model,
    idx=text_to_token_ids("What do llamas eat?", chat_tokenizer).to(device),
    max_new_tokens=150,
    context_size=LLAMA3_CONFIG_8B["context_length"],
    top_k=1,
    temperature=0.
)

# 将生成的token ids转换回可读文本
output_text = token_ids_to_text(token_ids, tokenizer)


def clean_text(text, header_end="assistant<|end_header_id|>\n\n"):
    """
    清理生成的文本,移除模型输出中的header部分
    
    参数:
        text: 需要清理的原始文本
        header_end: 标记header结束的字符串,默认为"assistant<|end_header_id|>\n\n"
    
    返回:
        清理后的文本(去除header和首尾空格)
    """
    # 查找header结束标记的位置
    index = text.find(header_end)

    if index != -1:
        # 如果找到header结束标记,返回其后的文本内容(去除首尾空格)
        return text[index + len(header_end):].strip()
    else:
        # 如果未找到header结束标记,返回原始文本
        return text

# 打印清理后的输出文本
print("Output text:\n", clean_text(output_text))

Output text:
 Llamas are herbivores, which means they primarily eat plants and plant-based foods. Here are some of the things llamas like to eat:

1. Grass: Llamas love to graze on grass, especially in the spring and summer months.
2. Hay: Hay is a staple in a llama's diet. They like to eat timothy hay, alfalfa hay, and other types of hay.
3. Grains: Llamas may also be fed grains like oats, barley, and corn. However, grains should not make up more than 10-15% of a llama's diet.
4. Fruits and vegetables: Llamas may enjoy fruits and vegetables as treats, such as


&nbsp;
# Llama 3.1 8B
# Llama 3.1 8B 模型

- A few months after the initial Llama 3 release, Meta AI followed up with their Llama 3.1 suite of models (see the official [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/) announcement blog post for details)
在Llama 3最初发布几个月后，Meta AI推出了他们的Llama 3.1系列模型（详情请参阅官方博客文章[介绍Llama 3.1：我们迄今为止最强大的模型](https://ai.meta.com/blog/meta-llama-3-1/)）

- Conveniently, we can reuse our previous Llama 3 code from above to implement Llama 3.1 8B
我们可以方便地重用上面的Llama 3代码来实现Llama 3.1 8B

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/llama3-to-llama31.webp" width="700px">

- The architecture is identical, with the only change being a rescaling of the RoPE frequencies as indicated in the configuration file below
架构是相同的，唯一的变化是如下配置文件中所示的RoPE频率的重新缩放


In [37]:
# Llama 3 8B模型的配置字典
LLAMA3_CONFIG_8B = {
    # 词汇表大小
    "vocab_size": 128_256,   
    # 上下文长度
    "context_length": 8192,  
    # 嵌入维度
    "emb_dim": 4096,         
    # 注意力头数量
    "n_heads": 32,           
    # 模型层数
    "n_layers": 32,          
    # 前馈网络中间维度大小
    "hidden_dim": 14_336,    
    # 分组查询注意力的KV组数
    "n_kv_groups": 8,        
    # RoPE位置编码的基数参数theta
    "rope_base": 500_000.0,  
    # RoPE频率调整的额外配置(此处为空)
    "rope_freq": None,       
    # 使用bfloat16数据类型以节省内存
    "dtype": torch.bfloat16  
}

# Llama 3.1 8B模型的配置字典
LLAMA31_CONFIG_8B = {
    # 词汇表大小
    "vocab_size": 128_256,      
    # 新增:支持更长的上下文长度
    "context_length": 131_072,  
    # 嵌入维度
    "emb_dim": 4096,            
    # 注意力头数量
    "n_heads": 32,              
    # 模型层数
    "n_layers": 32,             
    # 前馈网络中间维度大小
    "hidden_dim": 14_336,       
    # 分组查询注意力的KV组数
    "n_kv_groups": 8,           
    # RoPE位置编码的基数参数theta
    "rope_base": 500_000.0,     
    # 使用bfloat16数据类型以节省内存
    "dtype": torch.bfloat16,    
    # 新增:RoPE频率缩放配置
    "rope_freq": {              
        # 总体缩放因子
        "factor": 8.0,
        # 低频缩放因子
        "low_freq_factor": 1.0,
        # 高频缩放因子
        "high_freq_factor": 4.0,
        # 原始上下文长度
        "original_context_length": 8192,
    }
}

- Reduce the context length so the model would work fine on a MacBook Air (if you have more RAM, feel free to comment out the lines below):
- 减少上下文长度使模型能在 MacBook Air 上正常运行（如果你有更多内存，可以注释掉下面的代码）：

In [10]:
# 保存原始的上下文长度
old_context_length = LLAMA31_CONFIG_8B["context_length"]
# 将上下文长度设置为较小的值8192以适应内存限制
LLAMA31_CONFIG_8B["context_length"] = 8192


# 定义一个函数来根据新的上下文长度重新缩放RoPE的theta参数
def rescale_theta(theta_old, context_length_old, context_length_new):
    # 计算新旧上下文长度的比例作为缩放因子
    scaling_factor = context_length_new / context_length_old
    # 将原始theta乘以缩放因子得到新的theta值
    theta_new = theta_old * scaling_factor
    return theta_new

# 使用rescale_theta函数重新计算rope_base参数
LLAMA31_CONFIG_8B["rope_base"] = rescale_theta(
    LLAMA31_CONFIG_8B["rope_base"],
    old_context_length,
    LLAMA31_CONFIG_8B["context_length"]
)

# 打印新的RoPE theta值
print("New RoPE theta:", LLAMA31_CONFIG_8B["rope_base"])

New RoPE theta: 31250.0


- As we've seen in the code earlier, the RoPE method uses sinusoidal functions (sine and cosine) to embed positional information directly into the attention mechanism
- 如我们在之前的代码中所见，RoPE方法使用正弦函数(正弦和余弦)将位置信息直接嵌入到注意力机制中
- In Llama 3.1, via the additional configuration, we introduce additional adjustments to the inverse frequency calculations
- 在Llama 3.1中，通过额外的配置，我们对逆频率计算引入了额外的调整
- These adjustments influence how different frequency components contribute to the positional embeddings (a detailed explanation is a topic for another time)
- 这些调整影响了不同频率分量对位置嵌入的贡献(详细解释将在另一个时间讨论)
- Let's try out the Llama 3.1 model in practice; first, we clear out the old model to free up some GPU memory
- 让我们实践中试用Llama 3.1模型；首先，我们清除旧模型以释放一些GPU内存

In [38]:
# 释放内存
del model

gc.collect()  # 运行Python垃圾回收器

if torch.cuda.is_available():
    torch.cuda.empty_cache()  # 清空CUDA缓存

- Next, we download the tokenizer
- 接下来，我们下载分词器
- Note that since the Llama 3.1 family is distinct from the Llama 3 family, you'd have to go to the [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) repository and acknowledge the license terms for your Hugging Face access token to work for the download
- 请注意，由于Llama 3.1系列与Llama 3系列是不同的，你需要访问[meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)代码库并同意许可条款，以使你的Hugging Face访问令牌能够正常下载
- Tip: For simplicity, we only load the base model below, but there's also an instruction-finetuned version you can use by replacing `"meta-llama/Llama-3.1-8B"` with `"meta-llama/Llama-3.1-8B-Instruct"`
- 提示：为了简单起见，我们在下面只加载基础模型，但还有一个指令微调版本，你可以通过将`"meta-llama/Llama-3.1-8B"`替换为`"meta-llama/Llama-3.1-8B-Instruct"`来使用

In [39]:
# 从Hugging Face Hub下载tokenizer模型文件
tokenizer_file_path = hf_hub_download(
    repo_id="meta-llama/Llama-3.1-8B",    # 指定模型仓库ID
    filename="original/tokenizer.model",    # 指定tokenizer文件名
    local_dir="Llama-3.1-8B"              # 指定本地保存目录
)

# 使用下载的tokenizer文件初始化Tokenizer对象
tokenizer = Tokenizer(tokenizer_file_path)

In [40]:
# 创建Llama 3模型实例，使用LLAMA31_CONFIG_8B配置
model = Llama3Model(LLAMA31_CONFIG_8B)

# 计算模型的总参数量
total_params = sum(p.numel() for p in model.parameters())
# 打印总参数量，使用千位分隔符格式化
print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 8,030,261,248


In [41]:
# 创建一个空字典来存储合并的权重
combined_weights = {}

# 循环下载和加载4个权重文件
for i in range(1, 5):
    # 从Hugging Face Hub下载权重文件
    weights_file = hf_hub_download(
        repo_id="meta-llama/Llama-3.1-8B",        # 模型仓库ID
        filename=f"model-0000{i}-of-00004.safetensors",  # 权重文件名
        local_dir="Llama-3.1-8B"                  # 本地保存目录
    )
    # 加载当前权重文件
    current_weights = load_file(weights_file)
    # 将当前权重更新到合并的权重字典中
    combined_weights.update(current_weights)

# 将合并的权重加载到模型中
load_weights_into_llama(model, LLAMA31_CONFIG_8B, combined_weights)
# 将模型移动到指定设备(CPU/GPU)
model.to(device);
del combined_weights  # 删除合并的权重以释放内存

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

In [42]:
# 设置随机种子以确保结果可重现
torch.manual_seed(123)

# 生成文本
token_ids = generate(
    model=model,                                                      # 使用加载的模型
    idx=text_to_token_ids("Every effort", tokenizer).to(device),     # 将输入文本转换为token并移至设备
    max_new_tokens=25,                                               # 最大生成25个新token
    context_size=LLAMA31_CONFIG_8B["context_length"],                # 使用配置中定义的上下文长度
    top_k=1,                                                         # 只保留概率最高的1个token
    temperature=0.                                                   # 温度为0,使输出确定性
)

# 将生成的token转换回文本并打印
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any


&nbsp;
# Llama 3.2 1B
# Llama 3.2 1B 模型

- As of this writing, Meta AI's latest models are the Llama 3.2 models announced [here](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)
截至本文撰写时，Meta AI最新发布的模型是Llama 3.2系列，详见[这里](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)

- The code for the Llama 3.2 text model is similar to that of Llama 3.1, except that the model has shrunk in size (there is a 1B and 3B version)
Llama 3.2文本模型的代码与Llama 3.1相似，但模型尺寸有所缩小（有1B和3B两个版本）

- The other efficiency tweak was that they added back weight tying (a concept that was original used in the GPT-2 architecture); here, they reuse the same weight parameter values in the input (token) embedding layer and output layer
另一个效率优化是他们重新引入了权重绑定（这个概念最初用于GPT-2架构）；在这里，他们在输入（token）嵌入层和输出层中重用相同的权重参数值

- The small model size of Llama 3.2 1B is quite convenient, since it can even run on many mobile devices
Llama 3.2 1B的小型模型尺寸非常方便，因为它甚至可以在许多移动设备上运行

- The architectural differences between Llama 3.1 8B and Llama 3.2 1B are illustrated in the figure below
下图展示了Llama 3.1 8B和Llama 3.2 1B在架构上的差异

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/llama31-to-llama32.webp?1" width="700px">

- As we can see based on the figure above, the main difference between the Llama 3.1 8B and Llama 3.2 1B architectures are the respective sizes
- 从上图可以看出，Llama 3.1 8B和Llama 3.2 1B架构之间的主要区别在于它们各自的规模大小
- A small additional change is an increased RoPE rescaling factor, which is reflected in the configuration file below
- 另一个小的变化是增加了RoPE重缩放因子，这一点在下面的配置文件中有所体现

In [43]:
LLAMA31_CONFIG_8B = {
    "vocab_size": 128_256,      # 词汇表大小
    "context_length": 131_072,  # 新增: 支持更大的上下文长度
    "emb_dim": 4096,            # 嵌入维度
    "n_heads": 32,              # 注意力头的数量
    "n_layers": 32,             # 层数
    "hidden_dim": 14_336,       # 前馈网络中间层的维度大小
    "n_kv_groups": 8,           # 分组查询注意力的键值组数
    "rope_base": 500_000.0,     # RoPE中"theta"的基数
    "dtype": torch.bfloat16,    # 使用低精度数据类型以节省内存
    "rope_freq": {              # 新增: RoPE频率缩放
        "factor": 8.0,          # 缩放因子
        "low_freq_factor": 1.0,  # 低频缩放因子
        "high_freq_factor": 4.0, # 高频缩放因子
        "original_context_length": 8192, # 原始上下文长度
    }
}


LLAMA32_CONFIG_1B = {
    "vocab_size": 128_256,      # 词汇表大小
    "context_length": 131_072,  # 上下文长度
    "emb_dim": 2048,            # 新增: 嵌入维度减半
    "n_heads": 32,              # 注意力头的数量
    "n_layers": 16,             # 新增: 层数减半
    "hidden_dim": 8192,         # 新增: 前馈网络中间层维度几乎减半
    "n_kv_groups": 8,           # 分组查询注意力的键值组数
    "rope_base": 500_000.0,     # RoPE中"theta"的基数
    "dtype": torch.bfloat16,    # 使用低精度数据类型以节省内存
    "rope_freq": {              # RoPE频率缩放
        "factor": 32.0,         # 新增: 调整缩放因子
        "low_freq_factor": 1.0,  # 低频缩放因子
        "high_freq_factor": 4.0, # 高频缩放因子
        "original_context_length": 8192, # 原始上下文长度
    }
}

- Reduce the context length so the model would work fine on a MacBook Air (if you have more RAM, feel free to comment out the lines below):
- 减小上下文长度以使模型能在MacBook Air上正常运行（如果你有更多的内存，可以随意注释掉下面的代码行）：

In [10]:
# 保存原始上下文长度
old_context_length = LLAMA32_CONFIG_1B["context_length"]
# 将上下文长度设置为较小的值以节省内存
LLAMA32_CONFIG_1B["context_length"] = 8192

# 根据新的上下文长度重新计算RoPE基数
LLAMA32_CONFIG_1B["rope_base"] = rescale_theta(
    LLAMA32_CONFIG_1B["rope_base"],
    old_context_length,
    LLAMA32_CONFIG_1B["context_length"]
)

# 打印新的RoPE基数值
print("New RoPE theta:", LLAMA32_CONFIG_1B["rope_base"])

New RoPE theta: 31250.0


- Below, we can reuse the code from the Llama 3.1 8B section to load the Llama 3.2 1B model
- 下面，我们可以重用Llama 3.1 8B部分的代码来加载Llama 3.2 1B模型
- Again, since the Llama 3.2 family is distinct from the Llama 3.1 family, you'd have to go to the [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) repository and acknowledge the license terms for your Hugging Face access token to work for the download
- 再次提醒，由于Llama 3.2系列与Llama 3.1系列是不同的，你需要访问[meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)仓库并确认许可条款，以使你的Hugging Face访问令牌能够正常下载
- Tip: For simplicity, we only load the base model below, but there's also an instruction-finetuned version you can use by replacing `"meta-llama/Llama-3.2-1B"` with `"meta-llama/Llama-3.2-1B-Instruct"`
- 提示：为简单起见，我们在下面只加载基础模型，但还有一个指令微调版本，你可以通过将`"meta-llama/Llama-3.2-1B"`替换为`"meta-llama/Llama-3.2-1B-Instruct"`来使用

In [44]:
# 释放内存
del model  # 删除模型对象


gc.collect()  # 运行Python垃圾回收器

if torch.cuda.is_available():
    torch.cuda.empty_cache()  # 清空CUDA缓存

In [45]:
# 从Hugging Face Hub下载分词器文件
tokenizer_file_path = hf_hub_download(
    repo_id="meta-llama/Llama-3.2-1B",  # 模型仓库ID
    filename="original/tokenizer.model", # 分词器文件名
    local_dir="Llama-3.2-1B"           # 本地保存目录
)

# 使用下载的分词器文件初始化Tokenizer对象
tokenizer = Tokenizer(tokenizer_file_path)

In [46]:
# 使用配置创建Llama3模型实例
model = Llama3Model(LLAMA32_CONFIG_1B)

# 计算模型的总参数量
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

# 考虑权重绑定的影响，计算实际的唯一参数量
# 由于输入嵌入和输出层的权重是共享的，所以需要减去一次tok_emb的参数量
total_params_normalized = total_params - model.tok_emb.weight.numel()
print(f"\nTotal number of unique parameters: {total_params_normalized:,}")

Total number of parameters: 1,498,482,688

Total number of unique parameters: 1,235,814,400


In [47]:
# 从Hugging Face Hub下载模型权重文件
weights_file = hf_hub_download(
    repo_id="meta-llama/Llama-3.2-1B",  # 模型仓库ID
    filename=f"model.safetensors",       # 权重文件名
    local_dir="Llama-3.2-1B"            # 本地保存目录
)

# 加载权重文件到内存
current_weights = load_file(weights_file)

# 将权重加载到模型中
load_weights_into_llama(model, LLAMA32_CONFIG_1B, current_weights)

# 将模型移动到指定设备(CPU/GPU)
model.to(device);

# 删除权重变量释放内存
del current_weights  # free up memory

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Model uses weight tying.


In [48]:
# 检查输入嵌入层和输出层的权重是否共享（权重绑定）
print("Weight tying:", torch.equal(model.tok_emb.weight, model.out_head.weight))

Weight tying: True


In [49]:
# 设置随机种子以确保结果可重现
torch.manual_seed(123)

# 生成文本
token_ids = generate(
    model=model,                                                      # 使用加载的模型
    idx=text_to_token_ids("Every effort", tokenizer).to(device),     # 将输入文本转换为token ids并移至设备
    max_new_tokens=25,                                               # 最多生成25个新token
    context_size=LLAMA32_CONFIG_1B["context_length"],                # 使用配置中定义的上下文长度
    top_k=1,                                                         # 只选择概率最高的token
    temperature=0.                                                    # 温度为0,使输出确定性
)

# 将生成的token ids转换回文本并打印
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort is made to ensure that the information on this website is accurate. However, we cannot guarantee that the information is accurate, complete


&nbsp;
# What's next?
# 接下来是什么?

- This notebook concludes the conversion from GPT to Llama 3.2
- 本笔记本完成了从GPT到Llama 3.2的转换
- If you are interested in a more compact, standalone notebook, which only contains the Llama 3.2 code, check out the [standalone-llama32.ipynb](standalone-llama32.ipynb) notebook
- 如果你对一个更紧凑的、独立的笔记本感兴趣，其中只包含Llama 3.2代码，请查看[standalone-llama32.ipynb](standalone-llama32.ipynb)笔记本