<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Chapter 6: Finetuning for Text Classification
# 第六章：文本分类的微调

In [None]:
# 导入版本检查模块
from importlib.metadata import version

# 定义需要检查版本的包列表
pkgs = ["matplotlib",
        "numpy", 
        "tiktoken",
        "torch",
        "tensorflow", # 用于OpenAI的预训练权重
        "pandas"      # 用于数据集加载
       ]

# 遍历包列表并打印每个包的版本
for p in pkgs:
    print(f"{p} version: {version(p)}")

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/chapter-overview.webp" width=500px>

## 6.1 Different categories of finetuning
## 6.1 不同类型的微调

- No code in this section
- 这节没有代码

- The most common ways to finetune language models are instruction-finetuning and classification finetuning
- 微调语言模型最常见的方式是指令微调和分类微调
- Instruction-finetuning, depicted below, is the topic of the next chapter
- 下面将介绍的指令微调是下一章的主题

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/instructions.webp" width=500px>

- Classification finetuning, the topic of this chapter, is a procedure you may already be familiar with if you have a background in machine learning -- it's similar to training a convolutional network to classify handwritten digits, for example
- 分类微调是本章的主题，如果你有机器学习背景的话应该已经很熟悉了 -- 它类似于训练卷积网络来分类手写数字
- In classification finetuning, we have a specific number of class labels (for example, "spam" and "not spam") that the model can output
- 在分类微调中，我们有特定数量的类别标签（例如"垃圾邮件"和"非垃圾邮件"）作为模型的输出
- A classification finetuned model can only predict classes it has seen during training (for example, "spam" or "not spam"), whereas an instruction-finetuned model can usually perform many tasks
- 分类微调模型只能预测它在训练期间见过的类别（例如"垃圾邮件"或"非垃圾邮件"），而指令微调模型通常可以执行多种任务
- We can think of a classification-finetuned model as a very specialized model; in practice, it is much easier to create a specialized model than a generalist model that performs well on many different tasks
- 我们可以将分类微调模型视为一个非常专门化的模型；在实践中，创建一个专门化的模型比创建一个在多个不同任务上表现良好的通用模型要容易得多

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/spam-non-spam.webp" width=500px>

## 6.2 Preparing the dataset
## 6.2 准备数据集

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/overview-1.webp" width=500px>

- This section prepares the dataset we use for classification finetuning
- 本节准备用于分类微调的数据集
- We use a dataset consisting of spam and non-spam text messages to finetune the LLM to classify them  
- 我们使用一个由垃圾短信和非垃圾短信组成的数据集来微调LLM进行分类
- First, we download and unzip the dataset
- 首先，我们下载并解压数据集

In [None]:
# 导入所需的库
import urllib.request
import zipfile
import os
from pathlib import Path

# 定义数据集的URL和相关文件路径
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"  # zip文件保存路径
extracted_path = "sms_spam_collection"  # 解压目录
data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"  # 最终数据文件路径

# 定义下载和解压数据的函数
def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):
    # 检查文件是否已存在
    if data_file_path.exists():
        print(f"{data_file_path} already exists. Skipping download and extraction.")
        return

    # 下载文件
    with urllib.request.urlopen(url) as response:
        with open(zip_path, "wb") as out_file:
            out_file.write(response.read())

    # 解压文件
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extracted_path)

    # 添加.tsv文件扩展名并重命名
    original_file_path = Path(extracted_path) / "SMSSpamCollection"
    os.rename(original_file_path, data_file_path)
    print(f"File downloaded and saved as {data_file_path}")

# 执行下载和解压函数
download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)

- The dataset is saved as a tab-separated text file, which we can load into a pandas DataFrame
- 数据集以制表符分隔的文本文件形式保存，我们可以将其加载到pandas DataFrame中

In [None]:
# 导入pandas库
import pandas as pd

# 使用pandas读取数据文件
# data_file_path: 数据文件路径
# sep="\t": 使用制表符作为分隔符
# header=None: 文件没有标题行
# names=["Label", "Text"]: 为列指定名称
df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])

# 显示DataFrame内容
df

- When we check the class distribution, we see that the data contains "ham" (i.e., "not spam") much more frequently than "spam"
- 当我们检查类别分布时，我们发现数据中"ham"(即"非垃圾邮件")的频率远高于"spam"(垃圾邮件)

In [None]:
# 统计并打印每个标签的出现次数
print(df["Label"].value_counts())

- For simplicity, and because we prefer a small dataset for educational purposes anyway (it will make it possible to finetune the LLM faster), we subsample (undersample) the dataset so that it contains 747 instances from each class
- 为了简单起见，并且因为我们更倾向于使用小型数据集用于教育目的(这样可以更快地对LLM进行微调)，我们对数据集进行下采样，使每个类别包含747个样本
- (Next to undersampling, there are several other ways to deal with class balances, but they are out of the scope of a book on LLMs; you can find examples and more information in the [`imbalanced-learn` user guide](https://imbalanced-learn.org/stable/user_guide.html))
- (除了下采样之外，还有其他几种处理类别平衡的方法，但这些不在LLM书籍的讨论范围内；你可以在[`imbalanced-learn`用户指南](https://imbalanced-learn.org/stable/user_guide.html)中找到更多示例和信息)

In [None]:
def create_balanced_dataset(df):
    """
    创建一个平衡的数据集，通过对多数类进行下采样来平衡类别分布
    
    参数:
        df: pandas DataFrame，包含'Label'和'Text'列的原始数据集
        
    返回:
        balanced_df: pandas DataFrame，包含相同数量的spam和ham样本的平衡数据集
    """
    
    # 统计垃圾邮件(spam)的数量
    num_spam = df[df["Label"] == "spam"].shape[0]
    
    # 从非垃圾邮件(ham)中随机采样，使其数量与垃圾邮件相同
    # random_state=123确保结果可重现
    ham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)
    
    # 将采样后的ham子集与所有spam样本合并
    balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]])

    return balanced_df

# 创建平衡数据集
balanced_df = create_balanced_dataset(df)
# 打印每个类别的样本数量
print(balanced_df["Label"].value_counts())

- Next, we change the string class labels "ham" and "spam" into integer class labels 0 and 1:
- 接下来，我们将字符串类别标签"ham"和"spam"转换为整数类别标签0和1:

In [6]:
# 使用map函数将字符串标签转换为数值标签
# "ham"映射为0, "spam"映射为1
balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})

- Let's now define a function that randomly divides the dataset into training, validation, and test subsets
- 现在让我们定义一个函数，将数据集随机分割为训练集、验证集和测试集子集

In [7]:
def random_split(df, train_frac, validation_frac):
    """
    将数据集随机分割为训练集、验证集和测试集
    
    参数:
        df: pandas DataFrame，要分割的数据集
        train_frac: float，训练集所占比例
        validation_frac: float，验证集所占比例
        
    返回:
        train_df: pandas DataFrame，训练集
        validation_df: pandas DataFrame，验证集 
        test_df: pandas DataFrame，测试集
    """
    # 打乱整个DataFrame
    df = df.sample(frac=1, random_state=123).reset_index(drop=True)

    # 计算分割索引
    train_end = int(len(df) * train_frac)
    validation_end = train_end + int(len(df) * validation_frac)

    # 分割DataFrame
    train_df = df[:train_end]  # 训练集
    validation_df = df[train_end:validation_end]  # 验证集
    test_df = df[validation_end:]  # 测试集(剩余部分)

    return train_df, validation_df, test_df

# 使用70%数据作为训练集,10%作为验证集,20%作为测试集进行分割
train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)

# 将分割后的数据集保存到CSV文件
train_df.to_csv("train.csv", index=None)
validation_df.to_csv("validation.csv", index=None)
test_df.to_csv("test.csv", index=None)

## 6.3 Creating data loaders
## 6.3 创建数据加载器

- Note that the text messages have different lengths; if we want to combine multiple training examples in a batch, we have to either
- 请注意文本消息的长度不同；如果我们想在一个批次中组合多个训练样本，我们必须选择以下其中一种方式：
  1. truncate all messages to the length of the shortest message in the dataset or batch
  1. 将所有消息截断到数据集或批次中最短消息的长度
  2. pad all messages to the length of the longest message in the dataset or batch
  2. 将所有消息填充到数据集或批次中最长消息的长度

- We choose option 2 and pad all messages to the longest message in the dataset
- 我们选择选项2，将所有消息填充到数据集中最长消息的长度

- For that, we use `<|endoftext|>` as a padding token, as discussed in chapter 2
- 为此，我们使用`<|endoftext|>`作为填充标记，正如第2章中讨论的那样

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/pad-input-sequences.webp?123" width=500px>

In [None]:
# 导入tiktoken库用于分词
import tiktoken

# 获取GPT-2的分词器
tokenizer = tiktoken.get_encoding("gpt2")

# 打印<|endoftext|>标记的编码ID
# allowed_special参数允许对特殊标记进行编码
print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

- The `SpamDataset` class below identifies the longest sequence in the training dataset and adds the padding token to the others to match that sequence length
- 下面的`SpamDataset`类识别训练数据集中最长的序列，并为其他序列添加填充标记以匹配该序列长度

In [9]:
# 导入必要的PyTorch库
import torch
from torch.utils.data import Dataset


class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):
        # 读取CSV文件数据
        self.data = pd.read_csv(csv_file)

        # 对文本进行预分词处理
        self.encoded_texts = [
            tokenizer.encode(text) for text in self.data["Text"]
        ]

        if max_length is None:
            # 如果未指定最大长度,则使用数据集中最长序列的长度
            self.max_length = self._longest_encoded_length()
        else:
            # 使用指定的最大长度
            self.max_length = max_length
            # 如果序列长度超过max_length则进行截断
            self.encoded_texts = [
                encoded_text[:self.max_length]
                for encoded_text in self.encoded_texts
            ]

        # 将所有序列填充到最长序列的长度
        self.encoded_texts = [
            encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))
            for encoded_text in self.encoded_texts
        ]

    def __getitem__(self, index):
        # 获取指定索引的编码文本和标签
        encoded = self.encoded_texts[index]
        label = self.data.iloc[index]["Label"]
        # 将数据转换为张量格式
        return (
            torch.tensor(encoded, dtype=torch.long),
            torch.tensor(label, dtype=torch.long)
        )

    def __len__(self):
        # 返回数据集的大小
        return len(self.data)

    def _longest_encoded_length(self):
        # 计算数据集中最长序列的长度
        max_length = 0
        for encoded_text in self.encoded_texts:
            encoded_length = len(encoded_text)
            if encoded_length > max_length:
                max_length = encoded_length
        return max_length

In [None]:
# 创建训练数据集
# 使用SpamDataset类加载训练数据
# max_length设为None表示使用数据集中最长序列的长度
train_dataset = SpamDataset(
    csv_file="train.csv",
    max_length=None,
    tokenizer=tokenizer
)

# 打印训练数据集中最长序列的长度
print(train_dataset.max_length)

- We also pad the validation and test set to the longest training sequence
- 我们还将验证集和测试集填充到与训练序列相同的最大长度
- Note that validation and test set samples that are longer than the longest training example are being truncated via `encoded_text[:self.max_length]` in the `SpamDataset` code 
- 请注意，在SpamDataset代码中，长于最长训练样本的验证集和测试集样本会通过`encoded_text[:self.max_length]`被截断
- This behavior is entirely optional, and it would also work well if we set `max_length=None` in both the validation and test set cases
- 这种行为完全是可选的，如果我们在验证集和测试集中都设置`max_length=None`也同样可以正常工作

In [11]:
# 创建验证数据集
# 使用与训练集相同的最大长度
val_dataset = SpamDataset(
    csv_file="validation.csv",
    max_length=train_dataset.max_length,
    tokenizer=tokenizer
)

# 创建测试数据集 
# 同样使用与训练集相同的最大长度
test_dataset = SpamDataset(
    csv_file="test.csv",
    max_length=train_dataset.max_length,
    tokenizer=tokenizer
)

- Next, we use the dataset to instantiate the data loaders, which is similar to creating the data loaders in previous chapters
- 接下来，我们使用数据集来实例化数据加载器，这与在前几章中创建数据加载器的方式类似

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/batch.webp" width=500px>

In [12]:
# 导入DataLoader类用于加载数据
from torch.utils.data import DataLoader

# 设置数据加载的参数
num_workers = 0  # 数据加载的工作进程数
batch_size = 8   # 每个批次的样本数

# 设置随机种子以确保可重复性
torch.manual_seed(123)

# 创建训练数据加载器
train_loader = DataLoader(
    dataset=train_dataset,     # 训练数据集
    batch_size=batch_size,     # 批次大小
    shuffle=True,              # 随机打乱数据
    num_workers=num_workers,   # 工作进程数
    drop_last=True,           # 丢弃最后不完整的批次
)

# 创建验证数据加载器
val_loader = DataLoader(
    dataset=val_dataset,       # 验证数据集
    batch_size=batch_size,     # 批次大小
    num_workers=num_workers,   # 工作进程数
    drop_last=False,          # 保留最后不完整的批次
)

# 创建测试数据加载器
test_loader = DataLoader(
    dataset=test_dataset,      # 测试数据集
    batch_size=batch_size,     # 批次大小
    num_workers=num_workers,   # 工作进程数
    drop_last=False,          # 保留最后不完整的批次
)

- As a verification step, we iterate through the data loaders and ensure that the batches contain 8 training examples each, where each training example consists of 120 tokens
- 作为验证步骤，我们遍历数据加载器并确保每个批次包含8个训练样本，其中每个训练样本由120个标记组成

In [None]:
# 打印训练数据加载器信息
print("Train loader:")

# 遍历训练数据加载器中的一个批次
for input_batch, target_batch in train_loader:
    pass

# 打印输入批次的维度
print("Input batch dimensions:", input_batch.shape)
# 打印标签批次的维度
print("Label batch dimensions", target_batch.shape)

- Lastly, let's print the total number of batches in each dataset
- 最后，让我们打印每个数据集中的批次总数

In [None]:
# 打印训练数据加载器中的批次总数
print(f"{len(train_loader)} training batches")
# 打印验证数据加载器中的批次总数 
print(f"{len(val_loader)} validation batches")
# 打印测试数据加载器中的批次总数
print(f"{len(test_loader)} test batches")

## 6.4 Initializing a model with pretrained weights
## 6.4 使用预训练权重初始化模型

- In this section, we initialize the pretrained model we worked with in the previous chapter
- 在本节中，我们将初始化在上一章中使用的预训练模型

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/overview-2.webp" width=500px>

In [15]:

# 选择要使用的GPT-2模型大小
CHOOSE_MODEL = "gpt2-small (124M)"
# 设置输入提示文本
INPUT_PROMPT = "Every effort moves"

# 基础配置参数
BASE_CONFIG = {
    "vocab_size": 50257,     # 词汇表大小
    "context_length": 1024,  # 上下文长度
    "drop_rate": 0.0,        # Dropout比率
    "qkv_bias": True         # 是否使用Query-Key-Value偏置
}

# 不同规模GPT-2模型的配置参数
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},    # 小型模型(1.24亿参数)
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},  # 中型模型(3.55亿参数)
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},   # 大型模型(7.74亿参数)
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},     # 超大型模型(15.58亿参数)
}

# 更新基础配置,添加所选模型的特定参数
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

# 确保训练数据集的最大长度不超过模型的上下文长度限制
assert train_dataset.max_length <= BASE_CONFIG["context_length"], (
    f"Dataset length {train_dataset.max_length} exceeds model's context "
    f"length {BASE_CONFIG['context_length']}. Reinitialize data sets with "
    f"`max_length={BASE_CONFIG['context_length']}`"
)

In [None]:
# 导入所需的模块
from gpt_download import download_and_load_gpt2
from previous_chapters import GPTModel, load_weights_into_gpt

# 从CHOOSE_MODEL中提取模型大小信息
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
# 下载并加载预训练的GPT-2模型
settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")

# 使用BASE_CONFIG初始化GPT模型
model = GPTModel(BASE_CONFIG)
# 将预训练权重加载到模型中
load_weights_into_gpt(model, params)
# 将模型设置为评估模式
model.eval();

- To ensure that the model was loaded correctly, let's double-check that it generates coherent text
- 为了确保模型加载正确,让我们再次检查它是否能生成连贯的文本

In [None]:
# 导入所需的函数
from previous_chapters import (
    generate_text_simple,    # 用于生成文本的简单函数
    text_to_token_ids,       # 将文本转换为token ID的函数
    token_ids_to_text        # 将token ID转换回文本的函数
)


# 设置输入提示文本
text_1 = "Every effort moves you"

# 使用模型生成文本
token_ids = generate_text_simple(
    model=model,                            # 预训练的GPT-2模型
    idx=text_to_token_ids(text_1, tokenizer), # 将输入文本转换为token ID
    max_new_tokens=15,                      # 生成15个新token
    context_size=BASE_CONFIG["context_length"] # 使用模型的上下文长度
)

# 将生成的token ID转换回可读文本并打印
print(token_ids_to_text(token_ids, tokenizer))

- Before we finetune the model as a classifier, let's see if the model can perhaps already classify spam messages via prompting
- 在我们将模型微调为分类器之前，让我们看看模型是否已经可以通过提示来分类垃圾邮件

In [None]:
# 设置测试文本,询问模型是否为垃圾邮件
text_2 = (
    "Is the following text 'spam'? Answer with 'yes' or 'no':"  # 提示模型回答是或否
    " 'You are a winner you have been specially"                # 测试文本内容
    " selected to receive $1000 cash or a $2000 award.'"       # 典型的垃圾邮件内容
)

# 使用模型生成回答
token_ids = generate_text_simple(
    model=model,                                               # 使用预训练的GPT-2模型
    idx=text_to_token_ids(text_2, tokenizer),                 # 将提示文本转换为token ID
    max_new_tokens=23,                                        # 生成23个新token
    context_size=BASE_CONFIG["context_length"]                # 使用模型的上下文长度
)

# 将生成的token ID转换回可读文本并打印
print(token_ids_to_text(token_ids, tokenizer))

- As we can see, the model is not very good at following instructions
- 正如我们所见，该模型并不擅长遵循指令
- This is expected, since it has only been pretrained and not instruction-finetuned (instruction finetuning will be covered in the next chapter)
- 这是意料之中的，因为它只经过了预训练，而没有经过指令微调(指令微调将在下一章介绍)

## 6.5 Adding a classification head
## 6.5 添加分类头

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/lm-head.webp" width=500px>

- In this section, we are modifying the pretrained LLM to make it ready for classification finetuning
- 在本节中，我们将修改预训练的LLM，使其准备好进行分类微调
- Let's take a look at the model architecture first
- 让我们先看看模型架构

In [None]:
# 打印模型信息
print(model)

- Above, we can see the architecture we implemented in chapter 4 neatly laid out
- 在上面,我们可以清晰地看到在第4章中实现的架构
- The goal is to replace and finetune the output layer
- 目标是替换和微调输出层
- To achieve this, we first freeze the model, meaning that we make all layers non-trainable
- 为了实现这一点,我们首先冻结模型,这意味着将所有层设置为不可训练

In [20]:
# 冻结所有模型参数,使其不可训练
for param in model.parameters():
    param.requires_grad = False

- Then, we replace the output layer (`model.out_head`), which originally maps the layer inputs to 50,257 dimensions (the size of the vocabulary)
- 然后，我们替换输出层(`model.out_head`)，它原本将层输入映射到50,257维(词汇表的大小)
- Since we finetune the model for binary classification (predicting 2 classes, "spam" and "not spam"), we can replace the output layer as shown below, which will be trainable by default
- 由于我们对模型进行二分类微调(预测"垃圾邮件"和"非垃圾邮件"两个类别)，我们可以像下面所示替换输出层，该层默认是可训练的
- Note that we use `BASE_CONFIG["emb_dim"]` (which is equal to 768 in the `"gpt2-small (124M)"` model) to keep the code below more general
- 注意，我们使用`BASE_CONFIG["emb_dim"]`(在`"gpt2-small (124M)"`模型中等于768)来保持下面的代码更具通用性

In [21]:
# 设置随机种子以确保可重复性
torch.manual_seed(123)

# 定义输出类别数量(垃圾邮件/非垃圾邮件)
num_classes = 2
# 替换输出层,将嵌入维度映射到类别数量
model.out_head = torch.nn.Linear(in_features=BASE_CONFIG["emb_dim"], out_features=num_classes)

- Technically, it's sufficient to only train the output layer
- 从技术上讲,仅训练输出层就足够了
- However, as I found in [Finetuning Large Language Models](https://magazine.sebastianraschka.com/p/finetuning-large-language-models), experiments show that finetuning additional layers can noticeably improve the performance  
- 然而,正如我在[微调大型语言模型](https://magazine.sebastianraschka.com/p/finetuning-large-language-models)中发现的,实验表明微调额外的层可以显著提高性能
- So, we are also making the last transformer block and the final `LayerNorm` module connecting the last transformer block to the output layer trainable
- 因此,我们还将最后一个transformer块和连接最后一个transformer块到输出层的最终`LayerNorm`模块设置为可训练的

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/trainable.webp" width=500px>

In [22]:
# 将最后一个transformer块设置为可训练
for param in model.trf_blocks[-1].parameters():
    param.requires_grad = True

# 将最终的LayerNorm层设置为可训练 
for param in model.final_norm.parameters():
    param.requires_grad = True

- We can still use this model similar to before in previous chapters
- 我们仍然可以像前几章一样使用这个模型
- For example, let's feed it some text input  
- 例如,让我们输入一些文本

In [None]:
# 对输入文本进行编码
inputs = tokenizer.encode("Do you have time")
# 将编码后的输入转换为tensor并增加一个batch维度
inputs = torch.tensor(inputs).unsqueeze(0)
# 打印输入内容
print("Inputs:", inputs)
# 打印输入维度 - 形状为(batch_size, num_tokens)
print("Inputs dimensions:", inputs.shape) # shape: (batch_size, num_tokens)

- What's different compared to previous chapters is that it now has two output dimensions instead of 50,257
- 与前几章不同的是,现在它有两个输出维度而不是50,257个

In [None]:
# 使用torch.no_grad()避免计算梯度
with torch.no_grad():
    # 将输入传入模型得到输出
    outputs = model(inputs)

# 打印输出内容
print("Outputs:\n", outputs)
# 打印输出维度 - 形状为(batch_size, num_tokens, num_classes)
print("Outputs dimensions:", outputs.shape) # shape: (batch_size, num_tokens, num_classes)

- As discussed in previous chapters, for each input token, there's one output vector
- 如前几章所述,对于每个输入token,都会有一个对应的输出向量
- Since we fed the model a text sample with 4 input tokens, the output consists of 4 2-dimensional output vectors above  
- 由于我们输入了包含4个token的文本样本,所以输出包含了上面的4个二维输出向量

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/input-and-output.webp" width=500px>

- In chapter 3, we discussed the attention mechanism, which connects each input token to each other input token
- 在第3章中,我们讨论了注意力机制,它将每个输入token与其他所有输入token相连接

- In chapter 3, we then also introduced the causal attention mask that is used in GPT-like models; this causal mask lets a current token only attend to the current and previous token positions
- 在第3章中,我们还介绍了GPT类模型中使用的因果注意力掩码;这种因果掩码只允许当前token关注当前和之前的token位置

- Based on this causal attention mechanism, the 4th (last) token contains the most information among all tokens because it's the only token that includes information about all other tokens
- 基于这种因果注意力机制,第4个(最后一个)token包含了所有token中最多的信息,因为它是唯一包含了所有其他token信息的token

- Hence, we are particularly interested in this last token, which we will finetune for the spam classification task
- 因此,我们特别关注这个最后的token,我们将用它来微调垃圾邮件分类任务

In [None]:
# 打印最后一个输出token
print("Last output token:", outputs[:, -1, :])

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/attention-mask.webp" width=200px>

## 6.6 Calculating the classification loss and accuracy
## 6.6 计算分类损失和准确率

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/overview-3.webp?1" width=500px>

- Before explaining the loss calculation, let's have a brief look at how the model outputs are turned into class labels
- 在解释损失计算之前,让我们先简单看看模型输出是如何转换为类别标签的

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/class-argmax.webp" width=600px>

In [None]:
# 打印最后一个输出token的值
print("Last output token:", outputs[:, -1, :])

- Similar to chapter 5, we convert the outputs (logits) into probability scores via the `softmax` function and then obtain the index position of the largest probability value via the `argmax` function
- 与第5章类似,我们通过`softmax`函数将输出(logits)转换为概率分数,然后通过`argmax`函数获取最大概率值的索引位置

In [None]:
# 通过softmax函数将输出转换为概率分布
probas = torch.softmax(outputs[:, -1, :], dim=-1)
# 获取概率最大的类别索引作为预测标签
label = torch.argmax(probas)
# 打印预测的类别标签
print("Class label:", label.item())

- Note that the softmax function is optional here, as explained in chapter 5, because the largest outputs correspond to the largest probability scores
- 注意这里的softmax函数是可选的,正如第5章所解释的,因为最大的输出对应最大的概率分数

In [None]:
# 获取最后一个token的输出logits
logits = outputs[:, -1, :]
# 获取概率最大的类别索引作为预测标签
label = torch.argmax(logits)
# 打印预测的类别标签
print("Class label:", label.item())

- We can apply this concept to calculate the so-called classification accuracy, which computes the percentage of correct predictions in a given dataset
- 我们可以应用这个概念来计算所谓的分类准确率,它计算给定数据集中正确预测的百分比
- To calculate the classification accuracy, we can apply the preceding `argmax`-based prediction code to all examples in a dataset and calculate the fraction of correct predictions as follows:
- 为了计算分类准确率,我们可以将前面基于`argmax`的预测代码应用到数据集中的所有样本,并按如下方式计算正确预测的比例:

In [29]:
def calc_accuracy_loader(data_loader, model, device, num_batches=None):
    # 将模型设置为评估模式
    model.eval()
    # 初始化正确预测数和样本总数
    correct_predictions, num_examples = 0, 0

    # 如果未指定批次数,使用整个数据加载器的长度
    if num_batches is None:
        num_batches = len(data_loader)
    else:
        # 否则取指定批次数和数据加载器长度的较小值
        num_batches = min(num_batches, len(data_loader))
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            # 将输入和目标批次移到指定设备
            input_batch, target_batch = input_batch.to(device), target_batch.to(device)

            # 在不计算梯度的情况下进行前向传播
            with torch.no_grad():
                logits = model(input_batch)[:, -1, :]  # 获取最后一个输出token的logits
            # 获取预测标签
            predicted_labels = torch.argmax(logits, dim=-1)

            # 累加样本数和正确预测数
            num_examples += predicted_labels.shape[0]
            correct_predictions += (predicted_labels == target_batch).sum().item()
        else:
            break
    # 返回准确率
    return correct_predictions / num_examples

- Let's apply the function to calculate the classification accuracies for the different datasets:
- 让我们应用这个函数来计算不同数据集的分类准确率:

In [None]:
# 检查是否有可用的CUDA设备,如果有则使用GPU,否则使用CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 注意:
# 取消注释以下代码可以在Apple Silicon芯片上运行(如果适用)
# 根据在M3 MacBook Air上的测试,速度大约是Apple CPU的2倍
# 在PyTorch 2.4版本中,CPU和MPS的结果是相同的
# 但在早期版本的PyTorch中,使用MPS可能会得到不同的结果

#if torch.cuda.is_available():
#    device = torch.device("cuda")
#elif torch.backends.mps.is_available():
#    device = torch.device("mps")
#else:
#    device = torch.device("cpu")
#print(f"Running on {device} device.")

# 将模型移动到指定设备(对于nn.Module类,不需要赋值 model = model.to(device))
model.to(device)

# 设置随机种子以确保训练数据加载器中的随机打乱可重现
torch.manual_seed(123)

# 计算训练、验证和测试集的准确率
train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)
val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)
test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)

# 打印各数据集的准确率
print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

- As we can see, the prediction accuracies are not very good, since we haven't finetuned the model, yet
- 正如我们所看到的,由于我们还没有对模型进行微调,预测准确率并不是很理想

- Before we can start finetuning (/training), we first have to define the loss function we want to optimize during training
- 在开始微调(训练)之前,我们首先需要定义在训练过程中要优化的损失函数

- The goal is to maximize the spam classification accuracy of the model; however, classification accuracy is not a differentiable function
- 目标是最大化模型的垃圾邮件分类准确率;然而,分类准确率不是一个可微函数

- Hence, instead, we minimize the cross-entropy loss as a proxy for maximizing the classification accuracy (you can learn more about this topic in lecture 8 of my freely available [Introduction to Deep Learning](https://sebastianraschka.com/blog/2021/dl-course.html#l08-multinomial-logistic-regression--softmax-regression) class)
- 因此,我们转而通过最小化交叉熵损失来间接实现最大化分类准确率的目标(你可以在我免费提供的[深度学习入门](https://sebastianraschka.com/blog/2021/dl-course.html#l08-multinomial-logistic-regression--softmax-regression)课程的第8讲中了解更多相关内容)

- The `calc_loss_batch` function is the same here as in chapter 5, except that we are only interested in optimizing the last token `model(input_batch)[:, -1, :]` instead of all tokens `model(input_batch)`
- `calc_loss_batch`函数与第5章中的相同,只是我们只关注优化最后一个token `model(input_batch)[:, -1, :]`,而不是所有token `model(input_batch)`

In [31]:
def calc_loss_batch(input_batch, target_batch, model, device):
    # 将输入和目标批次移动到指定设备
    input_batch, target_batch = input_batch.to(device), target_batch.to(device)
    
    # 获取模型输出的最后一个token的logits
    # shape: [batch_size, vocab_size]
    logits = model(input_batch)[:, -1, :]  
    
    # 计算交叉熵损失
    # logits shape: [batch_size, vocab_size]
    # target_batch shape: [batch_size]
    loss = torch.nn.functional.cross_entropy(logits, target_batch)
    
    return loss

The `calc_loss_loader` is exactly the same as in chapter 5
`calc_loss_loader` 函数与第5章中的完全相同

In [32]:
# Same as in chapter 5
def calc_loss_loader(data_loader, model, device, num_batches=None):
    # 初始化总损失为0
    total_loss = 0.
    
    # 如果数据加载器为空,返回nan
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        # 如果未指定批次数,使用数据加载器的全部批次
        num_batches = len(data_loader)
    else:
        # 如果指定的批次数超过数据加载器的批次数
        # 则将批次数减少到与数据加载器的批次数相匹配
        num_batches = min(num_batches, len(data_loader))
        
    # 遍历数据加载器中的批次
    for i, (input_batch, target_batch) in enumerate(data_loader):
        if i < num_batches:
            # 计算当前批次的损失
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            total_loss += loss.item()
        else:
            break
            
    # 返回平均损失
    return total_loss / num_batches

- Using the `calc_closs_loader`, we compute the initial training, validation, and test set losses before we start training
- 使用`calc_closs_loader`函数,我们在开始训练之前计算初始的训练集、验证集和测试集的损失值

In [None]:
with torch.no_grad(): # 禁用梯度跟踪以提高效率,因为此时还未开始训练
    # 计算训练集损失值
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    # 计算验证集损失值 
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)
    # 计算测试集损失值
    test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)

# 打印各数据集的损失值
print(f"Training loss: {train_loss:.3f}")
print(f"Validation loss: {val_loss:.3f}") 
print(f"Test loss: {test_loss:.3f}")

- In the next section, we train the model to improve the loss values and consequently the classification accuracy
- 在下一节中,我们将训练模型以改善损失值,从而提高分类准确率

## 6.7 Finetuning the model on supervised data
## 6.7 在有监督数据上微调模型

- In this section, we define and use the training function to improve the classification accuracy of the model
- 在本节中,我们定义并使用训练函数来提高模型的分类准确率
- The `train_classifier_simple` function below is practically the same as the `train_model_simple` function we used for pretraining the model in chapter 5  
- 下面的`train_classifier_simple`函数实际上与我们在第5章用于预训练模型的`train_model_simple`函数相同
- The only two differences are that we now
- 现在只有两个区别:
  1. track the number of training examples seen (`examples_seen`) instead of the number of tokens seen
  1. 跟踪已处理的训练样本数量(`examples_seen`)而不是已处理的标记数量
  2. calculate the accuracy after each epoch instead of printing a sample text after each epoch
  2. 在每个epoch结束后计算准确率,而不是打印样本文本

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/training-loop.webp?1" width=500px>

In [34]:
# 总体上与第5章的`train_model_simple`相同
def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
                            eval_freq, eval_iter):
    # 初始化列表用于跟踪损失值和已处理的样本数
    train_losses, val_losses, train_accs, val_accs = [], [], [], []
    examples_seen, global_step = 0, -1  # 初始化已处理样本数和全局步数

    # 主训练循环
    for epoch in range(num_epochs):
        model.train()  # 将模型设置为训练模式

        # 遍历训练数据加载器中的每个批次
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()  # 重置上一批次的梯度
            loss = calc_loss_batch(input_batch, target_batch, model, device)  # 计算当前批次的损失值
            loss.backward()  # 计算损失梯度
            optimizer.step()  # 使用梯度更新模型权重
            examples_seen += input_batch.shape[0]  # 更新已处理的样本数
            global_step += 1  # 更新全局步数

            # 定期评估模型性能
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter)  # 评估训练集和验证集的损失
                train_losses.append(train_loss)  # 记录训练损失
                val_losses.append(val_loss)  # 记录验证损失
                print(f"Ep {epoch+1} (Step {global_step:06d}): "  # 打印当前训练状态
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

        # 每个epoch结束后计算准确率
        train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=eval_iter)  # 计算训练集准确率
        val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter)  # 计算验证集准确率
        print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="")  # 打印训练准确率
        print(f"Validation accuracy: {val_accuracy*100:.2f}%")  # 打印验证准确率
        train_accs.append(train_accuracy)  # 记录训练准确率
        val_accs.append(val_accuracy)  # 记录验证准确率

    return train_losses, val_losses, train_accs, val_accs, examples_seen  # 返回训练过程中记录的所有指标

- The `evaluate_model` function used in the `train_classifier_simple` is the same as the one we used in chapter 5
- 在`train_classifier_simple`中使用的`evaluate_model`函数与我们在第5章中使用的相同

In [35]:
# 与第5章相同的函数
def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    """评估模型在训练集和验证集上的性能
    
    Args:
        model: 要评估的模型
        train_loader: 训练数据加载器
        val_loader: 验证数据加载器 
        device: 运行设备(CPU/GPU)
        eval_iter: 评估时使用的批次数量
        
    Returns:
        train_loss: 训练集上的损失值
        val_loss: 验证集上的损失值
    """
    model.eval()  # 将模型设置为评估模式
    with torch.no_grad():  # 不计算梯度
        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)  # 计算训练损失
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)  # 计算验证损失
    model.train()  # 将模型恢复为训练模式
    return train_loss, val_loss

- The training takes about 5 minutes on a M3 MacBook Air laptop computer and less than half a minute on a V100 or A100 GPU
- 在 M3 MacBook Air 笔记本电脑上训练大约需要5分钟，而在 V100 或 A100 GPU 上训练不到半分钟

In [None]:
# 导入时间模块用于计时
import time

# 记录开始时间
start_time = time.time()

# 设置随机种子以确保结果可重现
torch.manual_seed(123)

# 创建优化器,使用AdamW算法,设置学习率和权重衰减
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)

# 设置训练轮数为5
num_epochs = 5
# 调用训练函数开始训练,返回训练过程中的各项指标
train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=50, eval_iter=5,
)

# 记录结束时间
end_time = time.time()
# 计算总训练时间(分钟)
execution_time_minutes = (end_time - start_time) / 60
# 打印训练完成时间
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

- Similar to chapter 5, we use matplotlib to plot the loss function for the training and validation set
- 与第5章类似,我们使用matplotlib绘制训练集和验证集的损失函数图

In [37]:
# 导入matplotlib库用于绘图
import matplotlib.pyplot as plt

def plot_values(epochs_seen, examples_seen, train_values, val_values, label="loss"):
    # 创建一个5x3大小的图形和坐标轴对象
    fig, ax1 = plt.subplots(figsize=(5, 3))

    # 绘制训练和验证损失随epochs变化的曲线
    ax1.plot(epochs_seen, train_values, label=f"Training {label}")  # 绘制训练曲线
    ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Validation {label}")  # 绘制验证曲线
    ax1.set_xlabel("Epochs")  # 设置x轴标签
    ax1.set_ylabel(label.capitalize())  # 设置y轴标签
    ax1.legend()  # 添加图例

    # 创建第二个x轴用于显示已处理的样本数
    ax2 = ax1.twiny()  # 创建共享y轴的第二个x轴
    ax2.plot(examples_seen, train_values, alpha=0)  # 绘制不可见的曲线以对齐刻度
    ax2.set_xlabel("Examples seen")  # 设置第二个x轴的标签

    # 调整布局以适应所有元素
    fig.tight_layout()
    # 保存图形为PDF文件
    plt.savefig(f"{label}-plot.pdf")
    # 显示图形
    plt.show()

In [None]:
# 创建一个从0到num_epochs的等间隔张量,长度与train_losses相同
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
# 创建一个从0到examples_seen的等间隔张量,长度与train_losses相同
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))

# 绘制训练和验证损失曲线
plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses)

- Above, based on the downward slope, we see that the model learns well
- 从上面的下降趋势可以看出，模型学习效果良好
- Furthermore, the fact that the training and validation loss are very close indicates that the model does not tend to overfit the training data  
- 此外，训练损失和验证损失非常接近这一事实表明，模型不倾向于过拟合训练数据
- Similarly, we can plot the accuracy below
- 同样，我们可以在下面绘制准确率图

In [None]:
# 创建一个从0到num_epochs的等间隔张量,长度与train_accs相同
epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))
# 创建一个从0到examples_seen的等间隔张量,长度与train_accs相同 
examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))

# 绘制训练和验证准确率曲线
plot_values(epochs_tensor, examples_seen_tensor, train_accs, val_accs, label="accuracy")

- Based on the accuracy plot above, we can see that the model achieves a relatively high training and validation accuracy after epochs 4 and 5
- However, we have to keep in mind that we specified `eval_iter=5` in the training function earlier, which means that we only estimated the training and validation set performances
- We can compute the training, validation, and test set performances over the complete dataset as follows below
- 根据上面的准确率图，我们可以看到模型在第4和第5个epoch后获得了相对较高的训练和验证准确率
- 然而，我们必须记住，我们在训练函数中指定了`eval_iter=5`，这意味着我们只估计了训练和验证集的性能
- 我们可以按照以下方式计算完整数据集上的训练、验证和测试集性能

In [None]:
# 计算训练集、验证集和测试集的准确率
train_accuracy = calc_accuracy_loader(train_loader, model, device)
val_accuracy = calc_accuracy_loader(val_loader, model, device)
test_accuracy = calc_accuracy_loader(test_loader, model, device)

# 打印各个数据集的准确率
print(f"Training accuracy: {train_accuracy*100:.2f}%")
print(f"Validation accuracy: {val_accuracy*100:.2f}%")
print(f"Test accuracy: {test_accuracy*100:.2f}%")

- We can see that the training and validation set performances are practically identical
- However, based on the slightly lower test set performance, we can see that the model overfits the training data to a very small degree, as well as the validation data that has been used for tweaking some of the hyperparameters, such as the learning rate
- This is normal, however, and this gap could potentially be further reduced by increasing the model's dropout rate (`drop_rate`) or the `weight_decay` in the optimizer setting
- 我们可以看到训练集和验证集的表现几乎相同
- 然而，基于略低的测试集性能，我们可以看出模型在很小程度上过拟合训练数据，以及用于调整一些超参数（如学习率）的验证数据
- 然而，这是正常的，这种差距可能通过增加模型的辍学率（`drop_rate`）或优化器设置中的`weight_decay`来进一步减少

## 6.8 Using the LLM as a spam classifier
## 6.8 使用LLM作为垃圾邮件分类器

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch06_compressed/overview-4.webp" width=500px>

- Finally, let's use the finetuned GPT model in action
- The `classify_review` function below implements the data preprocessing steps similar to the `SpamDataset` we implemented earlier
- Then, the function returns the predicted integer class label from the model and returns the corresponding class name
 - 最后，让我们使用微调后的GPT模型
 - 下面的`classify_review`函数实现了类似于我们之前实现的`SpamDataset`的数据预处理步骤
 - 然后，该函数返回模型预测的整数类标签，并返回相应的类名

In [41]:
def classify_review(text, model, tokenizer, device, max_length=None, pad_token_id=50256):
    model.eval()

    # 准备输入给模型
    input_ids = tokenizer.encode(text)
    supported_context_length = model.pos_emb.weight.shape[0]
    # 注意：在书中，这原本被错误地写成了pos_emb.weight.shape[1]
    # 这不会破坏代码，但会导致不必要的截断（从1024到768）

    # 如果序列过长，则截断
    input_ids = input_ids[:min(max_length, supported_context_length)]

    # 将序列填充到最长序列
    input_ids += [pad_token_id] * (max_length - len(input_ids))
    input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0) # 添加批次维度

    # 模型推理
    with torch.no_grad():
        logits = model(input_tensor)[:, -1, :]  # 最后一个输出标记的对数
    predicted_label = torch.argmax(logits, dim=-1).item()

    # 返回分类结果
    return "spam" if predicted_label == 1 else "not spam"

- Let's try it out on a few examples below
- 让我们结合下面的例子来尝试它

In [None]:
text_1 = (
    "You are a winner you have been specially"
    " selected to receive $1000 cash or a $2000 award."
)

# 对文本进行分类
print(classify_review(
    text_1, model, tokenizer, device, max_length=train_dataset.max_length
))

In [None]:
# 定义待分类的文本
text_2 = (
    "Hey, just wanted to check if we're still on"
    " for dinner tonight? Let me know!"
)

# 调用 classify_review 函数对文本进行分类，并打印结果
print(classify_review(
    text_2, model, tokenizer, device, max_length=train_dataset.max_length
))

 - Finally, let's save the model in case we want to reuse the model later without having to train it again
 - 最后，让我们保存模型，以便将来可以在不重新训练的情况下重用模型

In [44]:
# 保存模型的状态字典到文件 "review_classifier.pth"
torch.save(model.state_dict(), "review_classifier.pth")

- Then, in a new session, we could load the model as follows
 - 然后，在一个新的会话中，我们可以按如下方式加载模型

In [None]:
# 加载模型的状态字典
model_state_dict = torch.load("review_classifier.pth", map_location=device, weights_only=True)
# 将状态字典加载到模型中
model.load_state_dict(model_state_dict)

## Summary and takeaways
## 总结和收获

- See the [./gpt_class_finetune.py](./gpt_class_finetune.py) script, a self-contained script for classification finetuning
- 请查看 [./gpt_class_finetune.py](./gpt_class_finetune.py) 脚本，这是一个用于分类微调的独立脚本
- You can find the exercise solutions in [./exercise-solutions.ipynb](./exercise-solutions.ipynb)
- 您可以在 [./exercise-solutions.ipynb](./exercise-solutions.ipynb) 中找到练习解决方案
- In addition, interested readers can find an introduction to parameter-efficient training with low-rank adaptation (LoRA) in [appendix E](../../appendix-E)
- 此外，感兴趣的读者可以在 [appendix E](../../appendix-E) 中找到关于低秩适应（LoRA）参数高效训练的介绍