## 指令微调流程
Stage 1 Preparing the dataset:
1.Data download and formatting
2.Batching the dataset
3.Creating data loaders
Stage 2 Finetuning the LLM:
4.Loading a pretrained LLM
5.Instruction finetuning the LLM
6.Inspecting the modeling loss
Stage 3 Evaluating the LLM:
7.Extracting responses
8.Qualitative evaluation 
9.Scoring the responses

# Stage 1 Preparing the dataset:
1.Data download and formatting
json格式的数据集包含1100组指令和相应对

In [7]:
# Listing 7.1 Downloading the dataset
import json
import os
import urllib

def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)
    else: #A
        with open(file_path, "r", encoding="utf-8") as file:
            text_data = file.read()
    with open(file_path, "r") as file:
        data = json.load(file)
    return data

file_path = "instruction-data.json"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_mainchapter-code/instruction-data.json"

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))
for a in data:
    pass
print(a)

#A 如果文件已经下载，就跳过下载过程

Number of entries: 1100
{'instruction': "Change the sentence 'You should have called me.' into a question.", 'input': '', 'output': 'Should you have called me?'}


#指令微调也被称为Supervised instruction finetuning ，是指在包含明确的输入和输出对数据集上进行训练。格式化这些json条目，有两种方法或叫做提示风格。ALpaca和Phi-3，我们主要关注Alpaca提示风格

In [9]:
# Listing 7.2 Implementing the prompt formatting function
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    #如果input条目为空那么这行就不存在
    return instruction_text + input_text

In [10]:
model_input = format_input(data[1099])
desired_response = f"\n\n### Response:\n{data[1099]['output']}"
print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Change the sentence 'You should have called me.' into a question.

### Response:
Should you have called me?


In [11]:
#划分训练集 验证集 和测试集
train_portion = int(len(data) * 0.85) # 85% for training
test_portion = int(len(data) * 0.1) # 10% for testing
val_portion = len(data) - train_portion - test_portion # Remaining 5% for validation

train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110


2.Batching the dataset
#在之前的章节我们在对垃圾消息的dataset处理的时候是，使用了dataloader的默认行为default_collate它会将数据样本进行默认堆叠。但是有些情况下默认堆叠并不能满足数据处理的要求：
变长序列数据：如 NLP 任务中句子长度不一致的情况，需要通过填充或裁剪来处理。
图数据：在图神经网络（GNN）中，处理不规则结构的数据也需要通过自定义 collate_fn 进行图的合并。
多模态数据：如图像和文本结合的任务，需要对不同数据模态采用不同的处理方式。
稀疏矩阵或稀疏张量：对于某些稀疏格式的数据，可能需要自定义 collate_fn 来实现特定的合并操作。
这时候就需要去自定义collate_fn函数

批次处理的流程：
1.用提示条目模版去格式化数据
2.将格式化的数据进行token id化
3.将token调整成统一的长度，追加50256
4.创建target token id 用于训练
5.使用-100占位符替换掩码，以便在损失函数中屏蔽填充token

In [14]:
# Listing 7.4 Implementing an instruction dataset class
import torch
from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []
        for entry in data:                                           #A
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

#A 预分词文本

#之所以我们为什么不能向上一章一样对数据进行同一填充到统一长度后，在进行dataset初始化。是因为我们要尝试一种更加精细的填充方法，保证每个批次中的长度一样，所以需要自定义fn_collate函数。
这种方法相较于全局填充，能够减少无效填充，降低计算开销，有些深度学习模型对于过长的序列会表现出不稳定性或效率下降，特别是在自注意力机制（Self-attention）模型中，计算复杂度是序列长度的平方级别。

In [16]:
def custom_collate_draft_1(
    batch,
    pad_token_id=50256,
    device="cpu"
):
    batch_max_length = max(len(item)+1 for item in batch)          #A
    inputs_lst = []

    for item in batch:                                             #B
        new_item = item.copy()
        new_item += [pad_token_id]

        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))

        inputs = torch.tensor(padded[:-1])                         #C
        inputs_lst.append(inputs)

    inputs_tensor = torch.stack(inputs_lst).to(device)             #D
    return inputs_tensor


#A 找出批量中的最长序列
#B 对输入进行填充并准备好输入数据
#C 删除之前添加的多余填充 token
#D 将输入列表转换为张量，并转移到目标设备


#可能会在这里疑惑为什么new_item += [pad_token_id]又inputs = torch.tensor(padded[:-1])其实这个代码想要实现的功能跟下面完整的函数有关，所以在这里可能看起来是冗余的

In [18]:
inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]
batch = (
    inputs_1,
    inputs_2,
    inputs_3
)
print(type(batch))
print(custom_collate_draft_1(batch))

<class 'tuple'>
tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])


In [19]:
def custom_collate_draft_2(
    batch,
    pad_token_id=50256,
    device="cpu"
):
    batch_max_length = max(len(item)+1 for item in batch)
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])             #A
        targets = torch.tensor(padded[1:])             #B
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

#这里就解释了为什么draft1会有冗余的代码，这是为了draft2中进行使用的，方便target token通过索引直接生成
inputs_lst的数据类型类似 [tensor（）,tensor()]，而torch.stack会将list转换为tensor(tensor（）,tensor())

In [21]:
inputs, targets = custom_collate_draft_2(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256, 50256, 50256, 50256],
        [    8,     9, 50256, 50256, 50256]])


In [22]:
#将所有target的填充token设置为占位置-100，这个特殊值能让填充token不参与训练损失计算
#但是我们会保留一个文本结束的token 50256 ，这使得LLM能够学习在收到指令时何时生成结束token
def custom_collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    batch_max_length = max(len(item)+1 for item in batch)
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        # Pad sequences to max_length
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1]) # Truncate the last token for inputs
        targets = torch.tensor(padded[1:]) # Shift +1 to the right for targets

        mask = targets == pad_token_id
        #布尔运算 mask 是一个shape相同的布尔tensor
        indices = torch.nonzero(mask).squeeze()
        #nonzero将会返回张量内容为true的索引，squeeze是将多余的维度进行删除
        if indices.numel() > 1:
            #我们要保留一个50256的token作为文本结束的标志
            targets[indices[1:]] = ignore_index        
        #如果有allow_max_length的要求就需要截断
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]        
            targets = targets[:allowed_max_length]      

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst)
    return inputs_tensor, targets_tensor


In [23]:
inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])


# 为什么将target的填充符号设置为-100?
在cross_entropy函数中有一个默认参数ignore_index=-100,意思就是如果target张量中存在-100的话，将会忽略掉这个target在loss中的计算。要知道填充位只是为了让序列长度一致，模型不应该在这些位置上进行学习或惩罚。如果没有将这些填充值设置为忽略值，模型可能会试图为这些无效的填充位置进行预测，这会导致不合理的梯度更新，影响模型的学习效果。
以上面输出的两个张量为例，在loss计算中只会计算input中[    0,     1,     2,     3,     4]和target中[    1,     2,     3,     4, 50256]的交叉熵，而[    5,     6, 50256, 50256, 50256]和[    6, 50256,  -100,  -100,  -100]只会计算6, 50256的交叉熵


# 创建Datasetloader

在 Python 中，functools.partial 是一个非常有用的工具，它允许你为函数预设部分参数并生成一个新函数。这样，当你在调用这个新函数时，不需要重复输入那些已经预设的参数。因为后面初始化dataloader时需要反复的输入collate_fn的参数，所以这里就创建一个用于复用的新函数

In [27]:
from functools import partial
device = torch.device("cpu")
customized_collate_fn = partial(custom_collate_fn, device=device,allowed_max_length=1024)

In [44]:
# Listing 7.6 Initializing the data loaders
import tiktoken
from torch.utils.data import DataLoader

tokenizer = tiktoken.get_encoding("gpt2")
num_workers = 4         #A
batch_size = 8

torch.manual_seed(123)

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)

test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)


In [None]:
print("Train loader:")
for inputs, targets in train_loader:
    print(inputs.shape, targets.shape)