<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# # Llama 3.2 From Scratch (A Standalone Notebook)
# 从零开始实现 Llama 3.2 (独立笔记本)

- This notebook is purposefully minimal and focuses on the code to implement the Llama 3.2 1B and 3B LLMs
- 本笔记本刻意保持简洁,专注于实现 Llama 3.2 1B 和 3B 大语言模型的代码

- For a step-by-step guide that explains the individual components and the relationship between GPT, Llama 2, and Llama 3, please see the following companion notebooks:
- 如需了解各个组件以及 GPT、Llama 2 和 Llama 3 之间关系的分步指南,请参阅以下配套笔记本:

  - [Converting a From-Scratch GPT Architecture to Llama 2](converting-gpt-to-llama2.ipynb)
  - [Converting Llama 2 to Llama 3.2 From Scratch](converting-llama2-to-llama3.ipynb)
  
<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/llama32.webp" width="700px">
  
- About the code:
- 关于代码:

  - all code is my own code, mapping the Llama 3 architecture onto the model code implemented in my [Build A Large Language Model (From Scratch)](http://mng.bz/orYv) book; the code is released under a permissive open-source Apache 2.0 license (see [LICENSE.txt](https://github.com/rasbt/LLMs-from-scratch/blob/main/LICENSE.txt))
  - 所有代码均为本人原创,将 Llama 3 架构映射到我的《从零开始构建大型语言模型》一书中实现的模型代码;代码以宽松的开源 Apache 2.0 许可证发布(参见 [LICENSE.txt](https://github.com/rasbt/LLMs-from-scratch/blob/main/LICENSE.txt))

  - the tokenizer code is inspired by the original [Llama 3 tokenizer code](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py), which Meta AI used to to extends the Tiktoken GPT-4 tokenizer
  - 分词器代码的灵感来自原始的 [Llama 3 分词器代码](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py),Meta AI 用它来扩展 Tiktoken GPT-4 分词器

  - the RoPE rescaling section is inspired by the [_compute_llama3_parameters function](https://github.com/huggingface/transformers/blob/5c1027bf09717f664b579e01cbb8ec3ef5aeb140/src/transformers/modeling_rope_utils.py#L329-L347) in the `transformers` library
  - RoPE 重缩放部分的灵感来自 `transformers` 库中的 [_compute_llama3_parameters 函数](https://github.com/huggingface/transformers/blob/5c1027bf09717f664b579e01cbb8ec3ef5aeb140/src/transformers/modeling_rope_utils.py#L329-L347)

In [None]:
# pip install -r https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/refs/heads/main/ch05/07_gpt_to_llama/requirements-extra.txt

In [3]:
from importlib.metadata import version

pkgs = [
    "blobfile",         # to download pretrained weights
    "huggingface_hub",  # to download pretrained weights
    "tiktoken",         # to implement the tokenizer
    "torch",            # to implement the model
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

blobfile version: 3.0.0
huggingface_hub version: 0.25.2
tiktoken version: 0.8.0
torch version: 2.5.0


&nbsp;
# 1. Architecture code
# 1. 架构代码

In [4]:
# 导入PyTorch相关库
import torch
import torch.nn as nn


# 前馈神经网络模块
class FeedForward(nn.Module):
    def __init__(self, cfg):
        # 调用父类初始化
        super().__init__()
        # 第一个全连接层,将输入维度映射到隐藏维度
        self.fc1 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        # 第二个全连接层,将输入维度映射到隐藏维度
        self.fc2 = nn.Linear(cfg["emb_dim"], cfg["hidden_dim"], dtype=cfg["dtype"], bias=False)
        # 第三个全连接层,将隐藏维度映射回输入维度
        self.fc3 = nn.Linear(cfg["hidden_dim"], cfg["emb_dim"], dtype=cfg["dtype"], bias=False)

    def forward(self, x):
        # 通过第一个全连接层
        x_fc1 = self.fc1(x)
        # 通过第二个全连接层
        x_fc2 = self.fc2(x)
        # 应用SiLU激活函数并进行门控操作
        x = nn.functional.silu(x_fc1) * x_fc2
        # 通过第三个全连接层返回结果
        return self.fc3(x)

In [5]:
# 预计算旋转位置编码(RoPE)参数
def precompute_rope_params(head_dim, theta_base=10_000, context_length=4096, freq_config=None):
    # 确保头部维度是偶数
    assert head_dim % 2 == 0, "Embedding dimension must be even"

    # 计算逆频率
    inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim, 2)[: (head_dim // 2)].float() / head_dim))

    # 如果提供了频率配置,进行频率调整
    if freq_config is not None:
        # 计算低频和高频波长
        low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]
        high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]

        # 计算波长
        wavelen = 2 * torch.pi / inv_freq

        # 根据波长条件调整逆频率
        inv_freq_llama = torch.where(
            wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq
        )

        # 计算平滑因子
        smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (
            freq_config["high_freq_factor"] - freq_config["low_freq_factor"]
        )

        # 计算平滑后的逆频率
        smoothed_inv_freq = (
            (1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq
        )

        # 判断中频区域并应用平滑
        is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)
        inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
        inv_freq = inv_freq_llama

    # 生成位置索引
    positions = torch.arange(context_length)

    # 计算角度
    angles = positions[:, None] * inv_freq[None, :]  # 形状: (context_length, head_dim // 2)

    # 扩展角度以匹配head_dim
    angles = torch.cat([angles, angles], dim=1)  # 形状: (context_length, head_dim)

    # 预计算正弦和余弦值
    cos = torch.cos(angles)
    sin = torch.sin(angles)

    return cos, sin


# 计算旋转位置编码
def compute_rope(x, cos, sin):
    # x: (batch_size, num_heads, seq_len, head_dim)
    # 获取输入张量的维度
    batch_size, num_heads, seq_len, head_dim = x.shape
    # 确保头部维度是偶数
    assert head_dim % 2 == 0, "Head dimension must be even"

    # 将输入张量分成前半部分和后半部分
    x1 = x[..., : head_dim // 2]  # 前半部分
    x2 = x[..., head_dim // 2 :]  # 后半部分

    # 调整正弦和余弦的形状
    cos = cos[:seq_len, :].unsqueeze(0).unsqueeze(0)  # 形状: (1, 1, seq_len, head_dim)
    sin = sin[:seq_len, :].unsqueeze(0).unsqueeze(0)

    # 应用旋转变换
    rotated = torch.cat((-x2, x1), dim=-1)
    x_rotated = (x * cos) + (rotated * sin)

    # 返回旋转后的结果,并确保数据类型与输入一致
    return x_rotated.to(dtype=x.dtype)

In [6]:
# 定义一个共享缓冲区类,用于存储和复用计算结果
class SharedBuffers:
    # 类变量,用于存储缓冲区
    _buffers = {}

    @staticmethod
    def get_buffers(context_length, head_dim, rope_base, freq_config, dtype=torch.float32):
        # 根据输入参数生成唯一的键值
        key = (context_length, head_dim, rope_base, tuple(freq_config.values()) if freq_config else freq_config, dtype)

        # 如果缓冲区中没有对应的键值,则创建新的缓冲区
        if key not in SharedBuffers._buffers:
            # 创建因果掩码矩阵
            mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
            # 预计算RoPE参数
            cos, sin = precompute_rope_params(head_dim, rope_base, context_length, freq_config)
            # 转换数据类型
            if dtype is not None:
                cos = cos.to(dtype)
                sin = sin.to(dtype)
            # 存储到缓冲区
            SharedBuffers._buffers[key] = (mask, cos, sin)

        # 返回缓冲区中的值
        return SharedBuffers._buffers[key]


# 分组查询注意力机制类
class GroupedQueryAttention(nn.Module):
    def __init__(
            self, d_in, d_out, context_length, num_heads,
            num_kv_groups,
            rope_base=10_000,
            rope_config=None,
            dtype=None
        ):
        super().__init__()
        # 确保输出维度能被头数整除
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
        # 确保头数能被KV组数整除
        assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"

        # 保存模型参数
        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads

        # 创建Key、Value和Query的线性变换层
        self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
        self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
        self.num_kv_groups = num_kv_groups
        self.group_size = num_heads // num_kv_groups

        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)

        # 获取共享缓冲区
        mask, cos, sin = SharedBuffers.get_buffers(context_length, self.head_dim, rope_base, rope_config, dtype)
        # 注册缓冲区
        self.register_buffer("mask", mask)
        self.register_buffer("cos", cos)
        self.register_buffer("sin", sin)

    def forward(self, x):
        # 获取输入张量的维度
        b, num_tokens, d_in = x.shape

        # 计算查询、键和值
        queries = self.W_query(x)  # 形状: (b, num_tokens, d_out)
        keys = self.W_key(x)  # 形状: (b, num_tokens, num_kv_groups * head_dim)
        values = self.W_value(x)  # 形状: (b, num_tokens, num_kv_groups * head_dim)

        # 重塑张量维度
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
        keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)
        values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)

        # 转置维度顺序
        keys = keys.transpose(1, 2)  # 形状: (b, num_heads, num_tokens, head_dim)
        values = values.transpose(1, 2)  # 形状: (b, num_heads, num_tokens, head_dim)
        queries = queries.transpose(1, 2)  # 形状: (b, num_query_groups, num_tokens, head_dim)

        # 应用RoPE位置编码
        keys = compute_rope(keys, self.cos, self.sin)
        queries = compute_rope(queries, self.cos, self.sin)

        # 扩展键值对以匹配头数
        keys = keys.repeat_interleave(self.group_size, dim=1)  # 形状: (b, num_heads, num_tokens, head_dim)
        values = values.repeat_interleave(self.group_size, dim=1)  # 形状: (b, num_heads, num_tokens, head_dim)

        # 计算注意力分数
        attn_scores = queries @ keys.transpose(2, 3)  # 形状: (b, num_heads, num_tokens, num_tokens)

        # 将原始掩码截断并转换为布尔类型
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # 使用掩码填充注意力分数
        attn_scores.masked_fill_(mask_bool, -torch.inf)

        # 计算注意力权重
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        assert keys.shape[-1] == self.head_dim

        # 计算上下文向量
        context_vec = (attn_weights @ values).transpose(1, 2)

        # 合并多头注意力的结果
        context_vec = context_vec.reshape(b, num_tokens, self.d_out)
        # 应用输出投影
        context_vec = self.out_proj(context_vec)

        return context_vec

In [7]:
# Transformer块类,实现了Transformer架构中的一个基本单元
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        # 初始化父类
        super().__init__()
        
        # 初始化分组查询注意力层
        self.att =  GroupedQueryAttention(
            d_in=cfg["emb_dim"],          # 输入维度
            d_out=cfg["emb_dim"],         # 输出维度
            context_length=cfg["context_length"],  # 上下文长度
            num_heads=cfg["n_heads"],      # 注意力头数
            num_kv_groups=cfg["n_kv_groups"], # KV分组数
            rope_base=cfg["rope_base"],    # RoPE基数
            rope_config=cfg["rope_freq"],  # RoPE频率配置
            dtype=cfg["dtype"]             # 数据类型
        )
        
        # 初始化前馈网络层
        self.ff = FeedForward(cfg)
        
        # 初始化两个RMSNorm层,用于注意力层和前馈层的归一化
        self.norm1 = nn.RMSNorm(cfg["emb_dim"], eps=1e-5)  # 第一个归一化层
        self.norm2 = nn.RMSNorm(cfg["emb_dim"], eps=1e-5)  # 第二个归一化层

    def forward(self, x):
        # 注意力块的残差连接
        shortcut = x                           # 保存输入用于残差连接
        x = self.norm1(x)                      # 应用第一个归一化层
        x = self.att(x.to(torch.bfloat16))    # 应用注意力层,转换为bfloat16类型
        x = x + shortcut                       # 添加残差连接

        # 前馈网络块的残差连接
        shortcut = x                           # 保存中间结果用于残差连接
        x = self.norm2(x)                      # 应用第二个归一化层
        x = self.ff(x.to(torch.bfloat16))     # 应用前馈网络层,转换为bfloat16类型
        x = x + shortcut                       # 添加残差连接

        return x                               # 返回处理后的张量

In [8]:
class Llama3Model(nn.Module):
    def __init__(self, cfg):
        # 继承自nn.Module基类
        super().__init__()
        
        # 创建词嵌入层,将词表索引映射为词向量
        # 参数: 词表大小、嵌入维度、数据类型
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])

        # 创建Transformer块的序列
        # 使用列表推导式创建n_layers个TransformerBlock,然后展开为Sequential的参数
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        # 创建最终的RMSNorm归一化层
        self.final_norm = nn.RMSNorm(cfg["emb_dim"], eps=1e-5)
        
        # 创建输出层,将嵌入维度映射回词表大小
        # bias=False表示不使用偏置项
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])

    def forward(self, in_idx):
        # 将输入的词索引转换为词嵌入向量
        tok_embeds = self.tok_emb(in_idx)
        
        # 将词嵌入赋值给x用于后续处理
        x = tok_embeds
        
        # 依次通过所有Transformer块
        x = self.trf_blocks(x)
        
        # 通过最终的归一化层
        x = self.final_norm(x)
        
        # 通过输出层得到logits,并转换为bfloat16类型
        logits = self.out_head(x.to(torch.bfloat16))
        
        # 返回最终的logits
        return logits

&nbsp;
# 2. Initialize model
# 2. 初始化模型

- The remainder of this notebook uses the Llama 3.2 1B model; to use the 3B model variant, just uncomment the second configuration file in the following code cell
- 本笔记本的其余部分使用 Llama 3.2 1B 模型；如果要使用 3B 模型变体，只需取消下面代码单元中第二个配置文件的注释

In [9]:
# Llama 3.2 1B

LLAMA32_CONFIG = {
    "vocab_size": 128_256,      # 词表大小
    "context_length": 131_072,  # 上下文长度
    "emb_dim": 2048,            # 嵌入维度
    "n_heads": 32,              # 注意力头数量
    "n_layers": 16,             # 层数
    "hidden_dim": 8192,         # 前馈网络中间维度大小
    "n_kv_groups": 8,           # 分组查询注意力的键值组数
    "rope_base": 500_000.0,     # RoPE中"theta"的基数
    "dtype": torch.bfloat16,    # 使用低精度数据类型以节省内存
    "rope_freq": {              # RoPE频率缩放
        "factor": 32.0,         # 缩放因子
        "low_freq_factor": 1.0, # 低频因子
        "high_freq_factor": 4.0,# 高频因子
        "original_context_length": 8192, # 原始上下文长度
    }
}

# Llama 3.2 3B

# LLAMA32_CONFIG = {
#     "vocab_size": 128_256,      # 词表大小
#     "context_length": 131_072,  # 上下文长度
#     "emb_dim": 3072,            # 嵌入维度
#     "n_heads": 24,              # 注意力头数量
#     "n_layers": 28,             # 层数
#     "hidden_dim": 8192,         # 前馈网络中间维度大小
#     "n_kv_groups": 8,           # 分组查询注意力的键值组数
#     "rope_base": 500_000.0,     # RoPE中"theta"的基数
#     "dtype": torch.bfloat16,    # 使用低精度数据类型以节省内存
#     "rope_freq": {              # RoPE频率缩放
#         "factor": 32.0,         # 缩放因子
#         "low_freq_factor": 1.0, # 低频因子
#         "high_freq_factor": 4.0,# 高频因子
#         "original_context_length": 8192, # 原始上下文长度
#     }
# }

# 根据嵌入维度确定模型大小(1B或3B)
LLAMA_SIZE_STR = "1B" if LLAMA32_CONFIG["emb_dim"] == 2048 else "3B"

- Reduce the context length so the model would work fine on a MacBook Air (if you have more RAM, feel free to comment out the lines below):
- 减小上下文长度以使模型能在 MacBook Air 上正常运行（如果你有更多内存，可以注释掉下面的代码）：

In [10]:
# 保存原始上下文长度
old_context_length = LLAMA32_CONFIG["context_length"]
# 将上下文长度设置为较小的值以节省内存
LLAMA32_CONFIG["context_length"] = 8192


def rescale_theta(theta_old, context_length_old, context_length_new):
    """根据新的上下文长度重新缩放RoPE的theta值
    
    参数:
        theta_old: 原始theta值
        context_length_old: 原始上下文长度 
        context_length_new: 新的上下文长度
        
    返回:
        theta_new: 重新缩放后的theta值
    """
    scaling_factor = context_length_new / context_length_old
    theta_new = theta_old * scaling_factor
    return theta_new

# 根据新的上下文长度重新计算RoPE的base值
LLAMA32_CONFIG["rope_base"] = rescale_theta(
    LLAMA32_CONFIG["rope_base"],
    old_context_length, 
    LLAMA32_CONFIG["context_length"]
)

# 打印新的RoPE theta值
print("New RoPE theta:", LLAMA32_CONFIG["rope_base"])

New RoPE theta: 31250.0


In [11]:
# 使用配置创建Llama3模型实例
model = Llama3Model(LLAMA32_CONFIG)

- The following is expected to print True to confirm buffers are reused instead of being (wastefully) recreated:
- 下面预期会打印 True 以确认缓冲区被重用而不是(浪费地)重新创建:

In [12]:
# 检查缓冲区是否被重用
# 比较第一个和最后一个Transformer块的注意力掩码是否是同一个对象
print(model.trf_blocks[0].att.mask is model.trf_blocks[-1].att.mask)
# 比较第一个和最后一个Transformer块的余弦缓存是否是同一个对象
print(model.trf_blocks[0].att.cos is model.trf_blocks[-1].att.cos)
# 比较第一个和最后一个Transformer块的正弦缓存是否是同一个对象
print(model.trf_blocks[0].att.sin is model.trf_blocks[-1].att.sin)

True
True
True


In [13]:
# 计算模型的总参数量
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")

# 考虑权重共享(weight tying)的影响
# 由于输出层和词嵌入层共享权重,需要减去词嵌入层的参数量
total_params_normalized = total_params - model.tok_emb.weight.numel()
print(f"\nTotal number of unique parameters: {total_params_normalized:,}")

Total number of parameters: 1,498,482,688

Total number of unique parameters: 1,235,814,400


In [14]:
# 计算模型内存占用大小的函数
def model_memory_size(model, input_dtype=torch.float32):
    # 初始化参数总量和梯度总量为0
    total_params = 0
    total_grads = 0
    # 遍历模型的所有参数
    for param in model.parameters():
        # 计算每个参数的元素总数
        param_size = param.numel()
        # 累加参数总量
        total_params += param_size
        # 检查该参数是否需要存储梯度
        if param.requires_grad:
            # 累加梯度总量
            total_grads += param_size

    # 计算缓冲区大小(需要内存的非参数部分)
    total_buffers = sum(buf.numel() for buf in model.buffers())

    # 计算每个元素的字节大小
    element_size = torch.tensor(0, dtype=input_dtype).element_size()
    # 计算总内存字节数 = (元素总数) * (每个元素的字节大小)
    total_memory_bytes = (total_params + total_grads + total_buffers) * element_size

    # 将字节转换为GB
    total_memory_gb = total_memory_bytes / (1024**3)

    return total_memory_gb

# 打印float32(PyTorch默认)数据类型下的内存占用
print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")
# 打印bfloat16数据类型下的内存占用
print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")

float32 (PyTorch default): 11.42 GB
bfloat16: 5.71 GB


In [15]:
# 检查是否有可用的CUDA GPU设备
if torch.cuda.is_available():
    device = torch.device("cuda")
# 检查是否有可用的Apple M1/M2 GPU设备
elif torch.backends.mps.is_available():
    device = torch.device("mps") 
# 如果没有GPU设备,则使用CPU
else:
    device = torch.device("cpu")

# 将模型移动到选定的设备上
model.to(device);

&nbsp;
# 3. Load tokenizer
# 3. 加载分词器

In [16]:
# 导入所需的Python标准库
import os
from pathlib import Path

# 导入tiktoken分词器相关库
import tiktoken
from tiktoken.load import load_tiktoken_bpe


class Tokenizer:
    def __init__(self, model_path):
        # 检查模型文件是否存在
        assert os.path.isfile(model_path), f"Model file {model_path} not found"
        # 加载BPE(字节对编码)合并规则
        mergeable_ranks = load_tiktoken_bpe(model_path)

        # 定义特殊token及其对应的ID
        self.special_tokens = {
            "<|begin_of_text|>": 128000,
            "<|end_of_text|>": 128001,
            "<|start_header_id|>": 128006,
            "<|end_header_id|>": 128007,
            "<|eot_id|>": 128009,
        }
        # 添加预留的特殊token
        self.special_tokens.update({
            f"<|reserved_{i}|>": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()
        })

        # 创建tiktoken编码器实例
        self.model = tiktoken.Encoding(
            name=Path(model_path).name,
            pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
            mergeable_ranks=mergeable_ranks,
            special_tokens=self.special_tokens
        )


    def encode(self, text, bos=False, eos=False, allowed_special=set(), disallowed_special=()):
        # 如果需要添加开始标记
        if bos:
            tokens = [self.special_tokens["<|begin_of_text|>"]]
        else:
            tokens = []

        # 对文本进行编码
        tokens += self.model.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special)

        # 如果需要添加结束标记
        if eos:
            tokens.append(self.special_tokens["<|end_of_text|>"])
        return tokens

    def decode(self, tokens):
        # 将token解码为文本
        return self.model.decode(tokens)


class ChatFormat:
    def __init__(self, tokenizer):
        # 初始化聊天格式化器
        self.tokenizer = tokenizer

    def encode_header(self, message):
        # 编码消息头部
        tokens = []
        tokens.append(self.tokenizer.special_tokens["<|start_header_id|>"])
        tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
        tokens.append(self.tokenizer.special_tokens["<|end_header_id|>"])
        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
        return tokens

    def encode(self, text):
        # 将文本格式化为聊天消息并编码
        message = {
            "role": "user",
            "content": text
        }

        tokens = self.encode_header(message)
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
        )
        tokens.append(self.tokenizer.special_tokens["<|eot_id|>"])
        return tokens

    def decode(self, token_ids):
        # 将token解码为文本
        return self.tokenizer.decode(token_ids)

请注意,Meta AI要求您在下载文件之前接受Llama 3.2许可条款;为此,您需要创建一个Hugging Face Hub账户并访问[meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)仓库来接受条款
- Please note that Meta AI requires that you accept the Llama 3.2 licensing terms before you can download the files; to do this, you have to create a Hugging Face Hub account and visit the [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) repository to accept the terms

接下来,您需要创建一个访问令牌;要生成具有READ权限的访问令牌,请点击右上角的个人资料图片,然后点击"Settings"
- Next, you will need to create an access token; to generate an access token with READ permissions, click on the profile picture in the upper right and click on "Settings"

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/settings.webp?1" width="300px">

然后,创建并复制访问令牌,以便您可以将其复制并粘贴到下一个代码单元格中
- Then, create and copy the access token so you can copy & paste it into the next code cell

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/access-token.webp?1" width="600px">

In [17]:
# 从huggingface_hub导入login函数
from huggingface_hub import login

# 登录Hugging Face Hub
login()

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /teamspace/studios/this_studio/.cache/huggingface/token
Login successful


In [18]:
# 从huggingface_hub导入下载函数
from huggingface_hub import hf_hub_download

# 从Hugging Face Hub下载tokenizer模型文件
tokenizer_file_path = hf_hub_download(
    repo_id=f"meta-llama/Llama-3.2-{LLAMA_SIZE_STR}-Instruct",  # 模型仓库ID
    filename="original/tokenizer.model",  # tokenizer文件名
    local_dir=f"Llama-3.2-{LLAMA_SIZE_STR}-Instruct"  # 本地保存目录
)

In [19]:
# 初始化tokenizer对象,用于文本分词
tokenizer = Tokenizer(tokenizer_file_path)
# 初始化chat格式化对象,用于处理对话格式
chat_tokenizer = ChatFormat(tokenizer)

&nbsp;
# 4. 加载预训练权重
# 4. Load pretrained weights

In [20]:
def assign(left, right, tensor_name="unknown"):
    # 检查左右张量形状是否匹配,不匹配则抛出异常
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}")

    # 如果right是张量,则克隆并分离,否则转换为张量
    if isinstance(right, torch.Tensor):
        return torch.nn.Parameter(right.clone().detach())
    else:
        return torch.nn.Parameter(torch.tensor(right))


def load_weights_into_llama(model, param_config, params):
    # 加载词嵌入层权重
    model.tok_emb.weight = assign(model.tok_emb.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")

    # 遍历每一层Transformer块
    for l in range(param_config["n_layers"]):

        # 加载注意力层权重
        model.trf_blocks[l].att.W_query.weight = assign(
            model.trf_blocks[l].att.W_query.weight,
            params[f"model.layers.{l}.self_attn.q_proj.weight"],
            f"model.layers.{l}.self_attn.q_proj.weight"
        )
        model.trf_blocks[l].att.W_key.weight = assign(
            model.trf_blocks[l].att.W_key.weight,
            params[f"model.layers.{l}.self_attn.k_proj.weight"],
            f"model.layers.{l}.self_attn.k_proj.weight"
        )
        model.trf_blocks[l].att.W_value.weight = assign(
            model.trf_blocks[l].att.W_value.weight,
            params[f"model.layers.{l}.self_attn.v_proj.weight"],
            f"model.layers.{l}.self_attn.v_proj.weight"
        )
        model.trf_blocks[l].att.out_proj.weight = assign(
            model.trf_blocks[l].att.out_proj.weight,
            params[f"model.layers.{l}.self_attn.o_proj.weight"],
            f"model.layers.{l}.self_attn.o_proj.weight"
        )
        model.trf_blocks[l].norm1.weight = assign(
            model.trf_blocks[l].norm1.weight,
            params[f"model.layers.{l}.input_layernorm.weight"],
            f"model.layers.{l}.input_layernorm.weight"
        )

        # 加载前馈网络层权重
        model.trf_blocks[l].ff.fc1.weight = assign(
            model.trf_blocks[l].ff.fc1.weight,
            params[f"model.layers.{l}.mlp.gate_proj.weight"],
            f"model.layers.{l}.mlp.gate_proj.weight"
        )
        model.trf_blocks[l].ff.fc2.weight = assign(
            model.trf_blocks[l].ff.fc2.weight,
            params[f"model.layers.{l}.mlp.up_proj.weight"],
            f"model.layers.{l}.mlp.up_proj.weight"
        )
        model.trf_blocks[l].ff.fc3.weight = assign(
            model.trf_blocks[l].ff.fc3.weight,
            params[f"model.layers.{l}.mlp.down_proj.weight"],
            f"model.layers.{l}.mlp.down_proj.weight"
        )
        model.trf_blocks[l].norm2.weight = assign(
            model.trf_blocks[l].norm2.weight,
            params[f"model.layers.{l}.post_attention_layernorm.weight"],
            f"model.layers.{l}.post_attention_layernorm.weight"
        )

    # 加载最终归一化层权重
    model.final_norm.weight = assign(model.final_norm.weight, params["model.norm.weight"], "model.norm.weight")

    # 加载输出层权重,如果存在lm_head则使用,否则使用词嵌入权重(权重共享)
    if "lm_head.weight" in params.keys():
        model.out_head.weight = assign(model.out_head.weight, params["lm_head.weight"], "lm_head.weight")
    else:
        model.out_head.weight = assign(model.out_head.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
        print("Model uses weight tying.")

In [21]:
# 从safetensors库导入load_file函数用于加载模型权重
from safetensors.torch import load_file


# 如果模型大小是1B,则只需要加载一个权重文件
if LLAMA_SIZE_STR == "1B":
    # 从Hugging Face下载权重文件
    weights_file = hf_hub_download(
        repo_id=f"meta-llama/Llama-3.2-{LLAMA_SIZE_STR}-Instruct",
        filename=f"model.safetensors", 
        local_dir=f"Llama-3.2-{LLAMA_SIZE_STR}-Instruct"
    )
    # 加载权重到combined_weights
    combined_weights = load_file(weights_file)


# 如果模型大小不是1B,需要加载多个权重文件
else:
    # 初始化空字典存储合并的权重
    combined_weights = {}
    # 遍历权重文件分片
    for i in range(1, 3):
        # 从Hugging Face下载每个分片
        weights_file = hf_hub_download(
            repo_id=f"meta-llama/Llama-3.2-{LLAMA_SIZE_STR}-Instruct",
            filename=f"model-0000{i}-of-00002.safetensors",
            local_dir=f"Llama-3.2-{LLAMA_SIZE_STR}-Instruct"
        )
        # 加载当前分片权重
        current_weights = load_file(weights_file)
        # 更新合并的权重字典
        combined_weights.update(current_weights)


# 将权重加载到模型中
load_weights_into_llama(model, LLAMA32_CONFIG, combined_weights)
# 将模型移动到指定设备(CPU/GPU)
model.to(device)
# 删除权重释放内存
del combined_weights  # free up memory

Model uses weight tying.


In [22]:
# 检查模型是否使用了权重绑定(weight tying)
# 通过比较token嵌入层和输出层的权重是否相等来验证
print("Weight tying:", torch.equal(model.tok_emb.weight, model.out_head.weight))

Weight tying: True


&nbsp;
# 5. Generate text
# 5. 生成文本

In [23]:
# 将文本转换为token ID的函数
def text_to_token_ids(text, tokenizer):
    # 使用tokenizer将文本编码为token ID序列
    encoded = tokenizer.encode(text)
    # 将编码后的序列转换为tensor,并添加batch维度
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)  # 添加batch维度
    return encoded_tensor


# 将token ID转换回文本的函数
def token_ids_to_text(token_ids, tokenizer):
    # 移除batch维度
    flat = token_ids.squeeze(0)  # 移除batch维度
    # 使用tokenizer将token ID序列解码为文本
    return tokenizer.decode(flat.tolist())


# 文本生成函数
def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):

    # 循环生成指定数量的新token
    for _ in range(max_new_tokens):
        # 获取最后context_size个token作为条件输入
        idx_cond = idx[:, -context_size:]
        # 使用模型进行推理,不计算梯度
        with torch.no_grad():
            logits = model(idx_cond)
        # 只关注最后一个时间步的logits
        logits = logits[:, -1, :]

        # 使用top-k采样过滤logits
        if top_k is not None:
            # 获取logits中最大的top_k个值
            top_logits, _ = torch.topk(logits, top_k)
            # 获取第k大的值作为阈值
            min_val = top_logits[:, -1]
            # 将小于阈值的logits设置为负无穷
            logits = torch.where(logits < min_val, torch.tensor(float('-inf')).to(logits.device), logits)

        # 应用温度缩放
        if temperature > 0.0:
            # 将logits除以温度参数
            logits = logits / temperature

            # 应用softmax获取概率分布
            probs = torch.softmax(logits, dim=-1)  # (batch_size, context_len)

            # 从概率分布中采样下一个token
            idx_next = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)

        # 如果温度为0,直接选择概率最高的token
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch_size, 1)

        # 如果生成了结束符且指定了eos_id,则提前结束生成
        if idx_next == eos_id:  
            break

        # 将新生成的token添加到序列中
        idx = torch.cat((idx, idx_next), dim=1)  # (batch_size, num_tokens+1)

    return idx

In [24]:
# 设置提示词
PROMPT = "What do llamas eat?"

# 设置随机种子以确保结果可复现
torch.manual_seed(123)

# 生成文本
token_ids = generate(
    model=model,  # 使用训练好的模型
    idx=text_to_token_ids(PROMPT, chat_tokenizer).to(device),  # 将提示词转换为token ID并移至指定设备
    max_new_tokens=150,  # 最多生成150个新token
    context_size=LLAMA32_CONFIG["context_length"],  # 使用配置中指定的上下文长度
    top_k=1,  # 只保留概率最高的1个token
    temperature=0.  # 温度为0,即始终选择概率最高的token
)

# 将生成的token ID转换回文本
output_text = token_ids_to_text(token_ids, tokenizer)


def clean_text(text, header_end="assistant<|end_header_id|>\n\n"):
    """
    清理生成的文本,移除头部标记
    
    参数:
        text: 需要清理的原始文本
        header_end: 头部标记的结束字符串,默认为"assistant<|end_header_id|>\n\n"
    """
    # 查找头部标记结束位置
    index = text.find(header_end)

    if index != -1:
        # 如果找到头部标记,返回其后的文本内容(去除首尾空白)
        return text[index + len(header_end):].strip()
    else:
        # 如果未找到头部标记,返回原始文本
        return text

# 打印清理后的生成文本
print("Output text:\n", clean_text(output_text))

Output text:
 Llamas are herbivores, which means they primarily eat plants. Their diet consists mainly of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses and grassy meadows.
2. Hay: Llamas also eat hay, which is a dry, compressed form of grass or other plants.
3. Alfalfa: Alfalfa is a legume that is commonly fed to llamas. It is high in protein and fiber.
4. Other plants: Llamas will also eat other plants, such as wild grasses, shrubs, and trees.

It's worth noting that the diet of llamas can vary depending on the region, climate,


&nbsp;
# What's next?
# 接下来是什么?

- The notebook was kept purposefully minimal; if you are interested in additional explanation about the individual components, check out the following two companion notebooks:
- 本笔记本保持简洁明了；如果您对各个组件的更多解释感兴趣，请查看以下两个配套笔记本：

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gpt-to-llama/gpt-and-all-llamas.webp">

  1. [Converting a From-Scratch GPT Architecture to Llama 2](converting-gpt-to-llama2.ipynb)
  1. [从零开始将GPT架构转换为Llama 2](converting-gpt-to-llama2.ipynb)
  2. [Converting Llama 2 to Llama 3.2 From Scratch](converting-llama2-to-llama3.ipynb)
  2. [从零开始将Llama 2转换为Llama 3.2](converting-llama2-to-llama3.ipynb)
  
- For those interested in a comprehensive guide on building a large language model from scratch and gaining a deeper understanding of its mechanics, you might like my [Build a Large Language Model (From Scratch)](http://mng.bz/orYv)
- 对于那些想要全面了解如何从零开始构建大型语言模型并深入理解其机制的人来说，您可能会喜欢我的[从零开始构建大型语言模型](http://mng.bz/orYv)

<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>