<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Chapter 5 Exercise solutions

In [1]:
from importlib.metadata import version

pkgs = ["numpy", 
        "tiktoken", 
        "torch",
        "tensorflow" # For OpenAI's pretrained weights
       ]
for p in pkgs:
    print(f"{p} version: {version(p)}")

numpy version: 1.26.4
tiktoken version: 0.7.0
torch version: 2.4.0
tensorflow version: 2.16.1


# Exercise 5.1: Temperature-scaled softmax scores and sampling probabilities
# 练习 5.1: 温度缩放的 softmax 分数和采样概率

 - We can print the number of times the word "pizza" is sampled using the `print_sampled_tokens` function we defined in this section
- 我们可以使用本节中定义的 `print_sampled_tokens` 函数来打印单词"pizza"被采样的次数
 - Let's start with the code we defined in section 5.3.1
- 让我们从 5.3.1 节中定义的代码开始
 
 - It is sampled 0x if the temperature is 0 or 0.1, and it is sampled 32x if the temperature is scaled up to 5. The estimated probability is 32/1000 * 100% = 3.2%
- 当温度为 0 或 0.1 时采样次数为 0 次，当温度增加到 5 时采样次数为 32 次。估计概率为 32/1000 * 100% = 3.2%

- The actual probability is 4.3% and contained in the rescaled softmax probability tensor (`scaled_probas[2][6]`)
- 实际概率为 4.3%，包含在重新缩放的 softmax 概率张量中 (`scaled_probas[2][6]`)

- Below is a self-contained example using code from chapter 5:
- 以下是使用第 5 章代码的独立示例:

In [2]:
# 导入PyTorch库
import torch

# 定义词汇表,将单词映射到索引
vocab = { 
    "closer": 0,
    "every": 1, 
    "effort": 2, 
    "forward": 3,
    "inches": 4,
    "moves": 5, 
    "pizza": 6,
    "toward": 7,
    "you": 8,
} 
# 创建反向词汇表,将索引映射回单词
inverse_vocab = {v: k for k, v in vocab.items()}

# 定义下一个token的logits分数
next_token_logits = torch.tensor(
    [4.51, 0.89, -1.90, 6.75, 1.63, -1.62, -1.89, 6.28, 1.79]
)

# 定义函数来打印采样的token及其频率
def print_sampled_tokens(probas):
    # 设置随机种子以保证可重复性
    torch.manual_seed(123)
    # 从概率分布中采样1000次
    sample = [torch.multinomial(probas, num_samples=1).item() for i in range(1_000)]
    # 统计每个token被采样的次数
    sampled_ids = torch.bincount(torch.tensor(sample))
    # 打印每个token的采样频率
    for i, freq in enumerate(sampled_ids):
        print(f"{freq} x {inverse_vocab[i]}")


# 定义带温度参数的softmax函数
def softmax_with_temperature(logits, temperature):
    # 对logits进行温度缩放
    scaled_logits = logits / temperature
    # 返回softmax概率
    return torch.softmax(scaled_logits, dim=0)


# 定义不同的温度值
temperatures = [1, 0.1, 5]  # 原始温度、较高温度和较低温度
# 计算不同温度下的概率分布
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]

- Now, we can iterate over the `scaled_probas` and print the sampling frequencies in each case:
- 现在，我们可以遍历 `scaled_probas` 并打印每种情况下的采样频率：

In [3]:
# 遍历不同温度下的概率分布
for i, probas in enumerate(scaled_probas):
    # 打印当前温度值
    print("\n\nTemperature:", temperatures[i])
    # 打印在当前温度下采样的token及其频率
    print_sampled_tokens(probas)



Temperature: 1
73 x closer
0 x every
0 x effort
582 x forward
2 x inches
0 x moves
0 x pizza
343 x toward


Temperature: 0.1
0 x closer
0 x every
0 x effort
985 x forward
0 x inches
0 x moves
0 x pizza
15 x toward


Temperature: 5
165 x closer
75 x every
42 x effort
239 x forward
71 x inches
46 x moves
32 x pizza
227 x toward
103 x you


- Note that sampling offers an approximation of the actual probabilities when the word "pizza" is sampled
- 注意,当采样"pizza"这个词时,采样提供了实际概率的近似值
- E.g., if it is sampled 32/1000 times, the estimated probability is 3.2%
- 例如,如果在1000次采样中出现32次,估计概率为3.2%
- To obtain the actual probability, we can check the probabilities directly by accessing the corresponding entry in `scaled_probas`
- 要获得实际概率,我们可以通过访问`scaled_probas`中的相应条目直接检查概率
 
- Since "pizza" is the 7th entry in the vocabulary, for the temperature of 5, we obtain it as follows:
- 由于"pizza"是词汇表中的第7个条目,对于温度为5的情况,我们按如下方式获得:

In [4]:
# 获取温度为5的概率分布的索引
temp5_idx = 2
# 获取"pizza"在词汇表中的索引
pizza_idx = 6

# 获取温度为5时"pizza"被采样的概率
scaled_probas[temp5_idx][pizza_idx]

tensor(0.0430)

There is a 4.3% probability that the word "pizza" is sampled if the temperature is set to 5

# Exercise 5.2: Different temperature and top-k settings
# 练习 5.2: 不同的温度和 top-k 设置

- Both temperature and top-k settings have to be adjusted based on the individual LLM (a kind of trial and error process until it generates desirable outputs)
- 温度和top-k设置需要根据具体的LLM模型进行调整(这是一个反复试错的过程,直到生成理想的输出)
- The desirable outcomes are also application-specific, though
- 不过,理想的结果也取决于具体应用
  - Lower top-k and temperatures result in less random outcomes, which is desired when creating educational content, technical writing or question answering, data analyses, code generation, and so forth
  - 较低的top-k和温度会产生较少的随机性输出,这在创建教育内容、技术写作、问答、数据分析、代码生成等方面是理想的
  - Higher top-k and temperatures result in more diverse and random outputs, which is more desirable for brainstorming tasks, creative writing, and so forth
  - 较高的top-k和温度会产生更多样化和随机的输出,这在头脑风暴任务、创意写作等方面更为理想

# Exercise 5.3: Deterministic behavior in the decoding functions
# 练习 5.3: 解码函数中的确定性行为

There are multiple ways to force deterministic behavior with the `generate` function:
有多种方法可以强制`generate`函数产生确定性行为:

1. Setting to `top_k=None` and applying no temperature scaling;
1. 设置`top_k=None`且不应用温度缩放;

2. Setting `top_k=1`.
2. 设置`top_k=1`。

Below is a self-contained example using code from chapter 5:

In [5]:
# 导入所需的库
import tiktoken  # 用于分词
import torch  # PyTorch深度学习框架
from previous_chapters import GPTModel  # 导入之前定义的GPT模型


# GPT-124M模型的配置参数
GPT_CONFIG_124M = {
    "vocab_size": 50257,  # 词汇表大小
    "context_length": 256,       # 上下文长度(原始为1024)
    "emb_dim": 768,       # 嵌入维度
    "n_heads": 12,        # 注意力头数量
    "n_layers": 12,       # 层数
    "drop_rate": 0.1,     # Dropout比率
    "qkv_bias": False     # 是否使用Query-Key-Value偏置
}


# 设置随机种子以确保可重复性
torch.manual_seed(123)

# 初始化tokenizer和模型
tokenizer = tiktoken.get_encoding("gpt2")  # 使用GPT-2的分词器
model = GPTModel(GPT_CONFIG_124M)  # 创建模型实例
model.load_state_dict(torch.load("model.pth", weights_only=True))  # 加载预训练权重
model.eval();  # 将模型设置为评估模式

In [6]:
# 从gpt_generate模块导入生成文本所需的函数
from gpt_generate import generate, text_to_token_ids, token_ids_to_text
# 从previous_chapters模块导入简单的文本生成函数
from previous_chapters import generate_text_simple

In [7]:
# 使用torch.argmax的确定性函数

# 设置起始文本
start_context = "Every effort moves you"

# 使用简单的文本生成函数生成文本
# model: 预训练的GPT模型
# idx: 将起始文本转换为token ID
# max_new_tokens: 生成25个新token
# context_size: 使用模型配置中定义的上下文长度
token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"]
)

# 将生成的token ID转换回文本并打印
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


In [8]:
# 确定性行为：不使用top_k采样，不使用温度缩放

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=None,  # 不限制候选token数量
    temperature=0.0  # 温度为0时总是选择概率最高的token
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


- Note that re-executing the previous code cell will produce the exact same generated text:
- 注意重新执行前面的代码单元将产生完全相同的生成文本：

In [9]:
# 确定性行为：不使用top_k采样，不使用温度缩放
# 这段代码将生成完全相同的文本，因为:
# - top_k=None: 不限制候选token的数量
# - temperature=0.0: 温度为0时总是选择概率最高的token

token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=GPT_CONFIG_124M["context_length"],
    top_k=None,
    temperature=0.0
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you know," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed lun


# Exercise 5.4: Continued pretraining
# 练习 5.4: 继续预训练

- If we are still in the Python session where you first trained the model in chapter 5, to continue the pretraining for one more epoch, we just have to load the model and optimizer that we saved in the main chapter and call the `train_model_simple` function again
- 如果我们仍在第5章首次训练模型的Python会话中,要继续预训练一个epoch,我们只需要加载在主章节中保存的模型和优化器,然后再次调用`train_model_simple`函数

- It takes a couple more steps to make this reproducible in this new code environment
- 在这个新的代码环境中,需要几个额外的步骤来使其可重现

- First, we load the tokenizer, model, and optimizer:
- 首先,我们加载分词器、模型和优化器:

In [10]:
# 导入tiktoken库用于分词
import tiktoken
# 导入PyTorch库
import torch
# 从previous_chapters导入GPTModel类
from previous_chapters import GPTModel


# 定义GPT-124M模型的配置参数
GPT_CONFIG_124M = {
    "vocab_size": 50257,   # 词汇表大小
    "context_length": 256, # 缩短的上下文长度(原始为1024)
    "emb_dim": 768,        # 嵌入维度
    "n_heads": 12,         # 注意力头数量
    "n_layers": 12,        # 层数
    "drop_rate": 0.1,      # Dropout比率
    "qkv_bias": False      # 是否使用Query-Key-Value偏置
}

# 设置设备为GPU(如果可用)或CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 初始化GPT-2分词器
tokenizer = tiktoken.get_encoding("gpt2")

# 加载保存的模型检查点
checkpoint = torch.load("model_and_optimizer.pth", weights_only=True)
# 初始化GPT模型
model = GPTModel(GPT_CONFIG_124M)
# 加载模型状态
model.load_state_dict(checkpoint["model_state_dict"])
# 将模型移至指定设备
model.to(device)

# 初始化AdamW优化器
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
# 加载优化器状态
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
# 将模型设置为训练模式
model.train();

- Next, we initialize the data loader:
 - 接下来,我们初始化数据加载器:

In [11]:
# 导入所需的库
import os
import urllib.request
from previous_chapters import create_dataloader_v1


# 定义文件路径和URL
file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"

# 如果文件不存在则下载
if not os.path.exists(file_path):
    # 从URL下载文本数据
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    # 将文本数据写入文件
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    # 如果文件存在则直接读取
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()


# 设置训练集和验证集的比例
train_ratio = 0.90
# 计算分割索引
split_idx = int(train_ratio * len(text_data))
# 分割训练集和验证集
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]


# 设置随机种子以确保可重复性
torch.manual_seed(123)

# 创建训练数据加载器
train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,                                    # 批次大小
    max_length=GPT_CONFIG_124M["context_length"],    # 最大序列长度
    stride=GPT_CONFIG_124M["context_length"],        # 步长
    drop_last=True,                                  # 丢弃最后不完整的批次
    shuffle=True,                                    # 打乱数据
    num_workers=0                                    # 数据加载的工作进程数
)

# 创建验证数据加载器
val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,                                    # 批次大小
    max_length=GPT_CONFIG_124M["context_length"],    # 最大序列长度
    stride=GPT_CONFIG_124M["context_length"],        # 步长
    drop_last=False,                                 # 保留最后不完整的批次
    shuffle=False,                                   # 不打乱数据
    num_workers=0                                    # 数据加载的工作进程数
)

- Lastly, we use the `train_model_simple` function to train the model:

In [12]:
# 导入训练模型的函数
from gpt_train import train_model_simple

# 设置训练轮数
num_epochs = 1

# 训练模型并获取训练损失、验证损失和已处理的token数
train_losses, val_losses, tokens_seen = train_model_simple(
    model,                                  # 模型
    train_loader,                          # 训练数据加载器
    val_loader,                            # 验证数据加载器 
    optimizer,                             # 优化器
    device,                                # 设备(CPU/GPU)
    num_epochs=num_epochs,                 # 训练轮数
    eval_freq=5,                           # 评估频率
    eval_iter=5,                           # 每次评估的迭代次数
    start_context="Every effort moves you", # 生成文本的起始上下文
    tokenizer=tokenizer                    # 分词器
)

Ep 1 (Step 000000): Train loss 0.271, Val loss 6.545
Ep 1 (Step 000005): Train loss 0.244, Val loss 6.614
Every effort moves you?"  "Yes--quite insensible to the irony. She wanted him vindicated--and by me!"  He laughed again, and threw back his head to look up at the sketch of the donkey. "There were days when I


# Exercise 5.5: Training and validation set losses of the pretrained model
# 练习 5.5: 预训练模型的训练集和验证集损失

 - We can use the following code to calculate the training and validation set losses of the GPT model:
- 我们可以使用以下代码来计算 GPT 模型在训练集和验证集上的损失:

```python
train_loss = calc_loss_loader(train_loader, gpt, device)
val_loss = calc_loss_loader(val_loader, gpt, device)
```

- The resulting losses for the 124M parameter are as follows:
- 124M 参数模型的损失结果如下:

```
Training loss: 3.754748503367106
Validation loss: 3.559617757797241
```

- The main observation is that the training and validation set performances are in the same ballpark
- 主要观察到训练集和验证集的性能在同一水平
- This can have multiple explanations:
- 这可能有多种解释：

1. The Verdict was not part of the pretraining dataset when OpenAI trained GPT-2. Hence, the model is not explicitly overfitting to the training set and performs similarly well on The Verdict's training and validation set portions. (The validation set loss is slightly lower than the training set loss, which is unusual in deep learning. However, it's likely due to random noise since the dataset is relatively small. In practice, if there is no overfitting, the training and validation set performances are expected to be roughly identical).
1. 当OpenAI训练GPT-2时，《判决》(The Verdict)不是预训练数据集的一部分。因此，模型并没有明显地过拟合训练集，在《判决》的训练集和验证集部分表现相似。(验证集损失略低于训练集损失，这在深度学习中比较罕见。但是，由于数据集相对较小，这可能是由随机噪声造成的。实际上，如果没有过拟合，训练集和验证集的性能预期应该大致相同)。

2. The Verdict was part of GPT -2's training dataset. In this case, we can't tell whether the model is overfitting the training data because the validation set would have been used for training as well. To evaluate the degree of overfitting, we'd need a new dataset generated after OpenAI finished training GPT-2 to make sure that it couldn't have been part of the pretraining.
2. 《判决》是GPT-2训练数据集的一部分。在这种情况下，我们无法判断模型是否过拟合训练数据，因为验证集也被用于训练。要评估过拟合程度，我们需要在OpenAI完成GPT-2训练后生成的新数据集，以确保它不可能是预训练的一部分。

The code below is a reproducible standalone example for this new notebook.
以下代码是本新笔记本的可重现独立示例。

In [13]:
# 导入所需的库
import tiktoken
import torch
from previous_chapters import GPTModel


# 定义124M参数GPT模型的配置
GPT_CONFIG_124M = {
    "vocab_size": 50257,   # 词汇表大小
    "context_length": 256, # 缩短的上下文长度(原始:1024)
    "emb_dim": 768,        # 嵌入维度
    "n_heads": 12,         # 注意力头数量
    "n_layers": 12,        # 层数
    "drop_rate": 0.1,      # Dropout比率
    "qkv_bias": False      # 查询-键-值偏置
}


# 设置随机种子以确保可重复性
torch.manual_seed(123)

# 初始化GPT-2分词器
tokenizer = tiktoken.get_encoding("gpt2")

In [14]:
# 从gpt_download模块导入下载和加载GPT-2模型的函数
from gpt_download import download_and_load_gpt2

# 下载并加载124M参数的GPT-2模型
# model_size: 模型大小("124M")
# models_dir: 保存模型的目录("gpt2")
settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")

File already exists and is up-to-date: gpt2/124M/checkpoint
File already exists and is up-to-date: gpt2/124M/encoder.json
File already exists and is up-to-date: gpt2/124M/hparams.json
File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/124M/model.ckpt.index
File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: gpt2/124M/vocab.bpe


In [15]:
# 在字典中定义模型配置以保持代码简洁
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16}, 
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

# 复制基础配置并使用特定模型设置更新
model_name = "gpt2-small (124M)"  # 示例模型名称
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval();

In [16]:
# 从gpt_generate模块导入加载权重的函数
from gpt_generate import load_weights_into_gpt


# 设置设备(GPU如果可用,否则使用CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 将预训练权重加载到GPT模型中
load_weights_into_gpt(gpt, params)
# 将模型移动到指定设备上
gpt.to(device);

In [17]:
# 导入所需的模块
import os
import urllib.request
from previous_chapters import create_dataloader_v1


# 定义文件路径和URL
file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"

# 如果文件不存在则下载并保存,否则直接读取
if not os.path.exists(file_path):
    # 从URL下载文本数据
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    # 将文本数据写入文件
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    # 如果文件已存在,直接读取文本数据
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()


# 设置训练集和验证集的比例
train_ratio = 0.90
# 计算分割索引
split_idx = int(train_ratio * len(text_data))
# 分割数据为训练集和验证集
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]


# 设置随机种子以确保可重复性
torch.manual_seed(123)

# 创建训练数据加载器
train_loader = create_dataloader_v1(
    train_data,                                    # 训练数据
    batch_size=2,                                  # 批次大小
    max_length=GPT_CONFIG_124M["context_length"],  # 最大序列长度
    stride=GPT_CONFIG_124M["context_length"],      # 步长
    drop_last=True,                                # 是否丢弃最后不完整的批次
    shuffle=True,                                  # 是否打乱数据
    num_workers=0                                  # 数据加载的工作进程数
)

# 创建验证数据加载器
val_loader = create_dataloader_v1(
    val_data,                                      # 验证数据
    batch_size=2,                                  # 批次大小
    max_length=GPT_CONFIG_124M["context_length"],  # 最大序列长度
    stride=GPT_CONFIG_124M["context_length"],      # 步长
    drop_last=False,                               # 保留最后不完整的批次
    shuffle=False,                                 # 不打乱数据
    num_workers=0                                  # 数据加载的工作进程数
)

In [18]:
# 导入计算损失函数的工具
from gpt_train import calc_loss_loader

# 设置随机种子以确保可重复性(因为数据加载器中的数据打乱)
torch.manual_seed(123)

# 计算训练集的损失
train_loss = calc_loss_loader(train_loader, gpt, device)
# 计算验证集的损失
val_loss = calc_loss_loader(val_loader, gpt, device)

# 打印训练和验证损失
print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 3.7547486888037787
Validation loss: 3.5596182346343994


 We can also repeat this for the largest GPT-2 model, but don't forget to update the context length:
我们也可以对最大的 GPT-2 模型重复这个过程，但不要忘记更新上下文长度：

In [19]:
# 下载并加载1558M参数的GPT-2模型
settings, params = download_and_load_gpt2(model_size="1558M", models_dir="gpt2")

# 设置模型名称为GPT2-XL
model_name = "gpt2-xl (1558M)"
# 复制基础配置
NEW_CONFIG = GPT_CONFIG_124M.copy()
# 更新模型特定配置
NEW_CONFIG.update(model_configs[model_name])
# 更新上下文长度和qkv偏置设置
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

# 使用新配置初始化GPT模型
gpt = GPTModel(NEW_CONFIG)
# 将模型设置为评估模式
gpt.eval()

# 将预训练权重加载到模型中
load_weights_into_gpt(gpt, params)
# 将模型移至指定设备
gpt.to(device)

# 设置随机种子以确保结果可重复
torch.manual_seed(123)
# 计算训练集损失
train_loss = calc_loss_loader(train_loader, gpt, device)
# 计算验证集损失
val_loss = calc_loss_loader(val_loader, gpt, device)

# 打印训练损失和验证损失
print("Training loss:", train_loss)
print("Validation loss:", val_loss)

checkpoint: 100%|███████████████████████████| 77.0/77.0 [00:00<00:00, 43.5kiB/s]
encoder.json: 100%|███████████████████████| 1.04M/1.04M [00:00<00:00, 2.75MiB/s]
hparams.json: 100%|█████████████████████████| 91.0/91.0 [00:00<00:00, 60.2kiB/s]
model.ckpt.data-00000-of-00001: 100%|█████| 6.23G/6.23G [06:02<00:00, 17.2MiB/s]
model.ckpt.index: 100%|████████████████████| 20.7k/20.7k [00:00<00:00, 171kiB/s]
model.ckpt.meta: 100%|████████████████████| 1.84M/1.84M [00:00<00:00, 4.27MiB/s]
vocab.bpe: 100%|████████████████████████████| 456k/456k [00:00<00:00, 1.73MiB/s]


Training loss: 3.3046312861972384
Validation loss: 3.1195147037506104


# Exercise 5.6: Trying larger models
# 练习 5.6: 尝试更大的模型

- In the main chapter, we experimented with the smallest GPT-2 model, which has only 124M parameters
- 在主章节中,我们实验了最小的GPT-2模型,它只有1.24亿个参数
- The reason was to keep the resource requirements as low as possible
- 这样做的原因是为了尽可能降低资源需求
- However, you can easily experiment with larger models with minimal code changes
- 然而,你可以通过最少的代码更改轻松地尝试更大的模型
- For example, instead of loading the 1558M instead of 124M model in chapter 5, the only 2 lines of code that we have to change are
- 例如,在第5章中加载1558M而不是124M的模型时,我们只需要更改以下2行代码

```python
settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")
model_name = "gpt2-small (124M)"
```

- The updated code becomes
- 更新后的代码

```python
settings, params = download_and_load_gpt2(model_size="1558M", models_dir="gpt2")
model_name = "gpt2-xl (1558M)"
```

In [20]:
# 导入所需的库
import tiktoken
import torch
from previous_chapters import GPTModel


# 定义GPT-2小型模型(124M参数)的配置
GPT_CONFIG_124M = {
    "vocab_size": 50257,   # 词汇表大小
    "context_length": 256, # 缩短的上下文长度(原始:1024)
    "emb_dim": 768,        # 嵌入维度
    "n_heads": 12,         # 注意力头数量
    "n_layers": 12,        # 层数
    "drop_rate": 0.1,      # Dropout比率
    "qkv_bias": False      # 是否使用Query-Key-Value偏置
}


# 初始化GPT-2分词器
tokenizer = tiktoken.get_encoding("gpt2")

In [21]:
# 导入GPT-2模型下载和权重加载相关函数
from gpt_download import download_and_load_gpt2
from gpt_generate import load_weights_into_gpt


# 定义不同规模GPT-2模型的配置参数
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},    # 小型模型配置
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},  # 中型模型配置
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},   # 大型模型配置
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},     # 超大型模型配置
}

# 选择使用GPT-2超大型模型
model_name = "gpt2-xl (1558M)"
# 复制基础配置并更新为选定模型的配置
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
# 更新上下文长度和QKV偏置设置
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

# 使用更新后的配置初始化GPT模型
gpt = GPTModel(NEW_CONFIG)
# 将模型设置为评估模式
gpt.eval()

# 下载并加载GPT-2超大型模型的权重
settings, params = download_and_load_gpt2(model_size="1558M", models_dir="gpt2")
# 将预训练权重加载到模型中
load_weights_into_gpt(gpt, params)

File already exists and is up-to-date: gpt2/1558M/checkpoint
File already exists and is up-to-date: gpt2/1558M/encoder.json
File already exists and is up-to-date: gpt2/1558M/hparams.json
File already exists and is up-to-date: gpt2/1558M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/1558M/model.ckpt.index
File already exists and is up-to-date: gpt2/1558M/model.ckpt.meta
File already exists and is up-to-date: gpt2/1558M/vocab.bpe


In [22]:
# 从gpt_generate模块导入文本生成相关函数:
# generate: 用于生成文本的主函数
# text_to_token_ids: 将文本转换为token ID序列
# token_ids_to_text: 将token ID序列转换回文本
from gpt_generate import generate, text_to_token_ids, token_ids_to_text

In [23]:
# 设置随机种子以确保结果可重现
torch.manual_seed(123)

# 使用GPT模型生成文本
# model: 使用已加载的GPT模型
# idx: 将输入提示文本转换为token ID序列
# max_new_tokens: 生成25个新token
# context_size: 使用模型配置中定义的上下文长度
# top_k: 仅考虑概率最高的50个token
# temperature: 温度参数1.5,增加采样随机性
token_ids = generate(
    model=gpt,
    idx=text_to_token_ids("Every effort moves you", tokenizer),
    max_new_tokens=25,
    context_size=NEW_CONFIG["context_length"],
    top_k=50,
    temperature=1.5
)

# 将生成的token ID序列转换回可读文本并打印
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you toward finding an ideal life. You don't have to accept your current one at once, because if you do you'll never
