# 语言模型的监督式微调（Supervised Fine-Tuning, SFT）

SFT 是一种把通用语言模型转换成任务型助手的方法。

通过训练 提示与理想回应的成对 数据，使模型学会模仿示例中的回答，从而能够按照指令行事、展示期望的行为并正确调用工具。

SFT 训练流程如下：
1. 基础模型：一个未经调整的 LLM 。

2. 带标签的数据集：收集并整理用户提示与理想助理回应的配对。

3. SFT训练：通过配对数据集对模型进行微调，利用最小化回应的交叉熵损失来训练模型。

    $$\mathcal{L}_{\text{SFT}} = -\sum_{i=1}^N \log \bigl(p_\theta(\text{Response}(i)\mid \text{Prompt}(i))\bigr)$$

    交叉熵损失函数会惩罚偏离标签回应的输出，因此 SFT 本质上是在教模型“模仿”。

4. 微调后的模型：进过训练后，模型可以针对新的查询给出合适的回复。

![SFT.png](images/SFT.png)


## SFT 使用场景

- 激发新的模型行为

    - 将预训练模型转变为能遵循指令的助理。

    - 让不具备推理能力的模型学会基本推理。

    - 让模型在没有明确说明的情况下使用特定工具。

- 激发模型能力

    - 利用强大的大模型**生成高质量的合成数据**，通过训练把这些能力“蒸馏”到小模型中。

当你需要模型快速适应新行为且有示例数据时，SFT 往往是正确的选择。

## SFT 数据策划原则

SFT 的效果依赖于数据质量，优秀的数据能让模型学习到更有用的行为。

常用的数据策划方法为：

- 蒸馏：用更强的指令模型生成回复，再训练小模型去模仿这些回复，把强模型的能力迁移到弱模型上。

- Best‑of‑K / 拒绝采样：针对同一提示生成多个候选回复，再用奖励函数选出最好的作为训练数据。

- 过滤：从大型 SFT 数据集中挑选出回应质量高且提示多样性好的样本，形成精简的高质量数据集。

SFT 会迫使模型模仿它所见到的一切——包括糟糕的回答。

所以数据策划的核心是**质量比数量重要**。

## SFT 微调方法

### 全参数微调

对模型的每一层权重加入一个完整的权重更新矩阵 $\Delta W$。

这种方法修改了整个模型的参数，可以显著提高性能，但也增加了计算量。

![SFT-full-fine-tuning.png](images/SFT-full-fine-tuning.png)

### 参数高效微调（PEFT）

参数高效微调通过在每层引入小的低秩矩阵 A 和 B 来调整模型参数。

这种方法减少了可训练参数的数量，节省显存，缺点是学习和遗忘都更有限，因为更新的参数更少。

![SFT-PEFT.png](images/SFT-PEFT.png)


# 实践 SFT

## 环境配置

创建虚拟环境：conda create --prefix=F:env/posttraining python=3.12

安装pytorch：pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

安装其他包：pip install numpy==1.26.4 transformers==4.52.4 huggingface-hub==0.33.0 datasets==2.21 trl==0.14.0 jinja2==3.1.2 markupsafe==2.0.1 tabulate==0.9.0 pandas==2.3.0

## 导入相关库

In [1]:
import torch
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM, SFTConfig

# 过滤 warning
import warnings
from tqdm import TqdmWarning

warnings.filterwarnings("ignore", category=TqdmWarning)


  from .autonotebook import tqdm as notebook_tqdm


## 定义函数

### 模型回复函数

In [2]:
"""
生成模型对用户输入的回复

args:
    model: 模型
    tokenizer: 分词器
    user_message: 用户消息，用户的输入
    system_message: 系统消息，可以用于设置模型的行为
    max_new_tokens: 生成回复的最大新 token 数，默认值为 100

return:
    模型生成的回复
"""
def generate_responses(model, tokenizer, user_message, system_message=None, max_new_tokens=100):
    
    # 构建消息列表（列表内的所有消息最后都会被模型“看到“）
    messages = []
    
    # 将系统消息添加到消息列表中
    if system_message:
        messages.append({"role": "system", "content": system_message})

    # 将用户消息添加到消息列表中
    messages.append({"role": "user", "content": user_message})

    # 调用分词器的 apply_chat_template 方法，将消息列表转换为模型能理解的 prompt
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False,
    )

    # 将 prompt 转换为模型可接受的输入张量
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # 模型生成回复
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id, 
        )
    
    # 获取 prompt 长度
    input_len = inputs["input_ids"].shape[1]
    # 切片，只保留模型生成的部分
    generated_ids = outputs[0][input_len:]
    # 解码为可读文本
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return response


### 模型回复测试函数

In [3]:
"""
测试模型生成效果

args:
    model: 模型
    tokenizer: 分词器
    questions: 测试问题列表
    system_message: 系统消息，可以用于设置模型的行为
    title: 测试标题，默认值为 "Model Output"

return:
    无
"""
def test_model_with_questions(model, tokenizer, questions, system_message=None, title="Model Output"):
    print(f"\n=== {title} ===")
    for i, question in enumerate(questions, 1):
        response = generate_responses(model, tokenizer, question, system_message)
        print(f"\nModel Input {i}:\n{question}\nModel Output {i}:\n{response}\n")


### 加载模型和分词器函数

In [4]:
"""
加载模型和分词器

args:
    model_name_or_path: 模型名称或路径（使用本地模型）
    use_gpu: 是否使用 GPU，默认值为 False

return:
    model: 加载好的模型
    tokenizer: 加载好的分词器
"""
def load_model_and_tokenizer(model_name_or_path, use_gpu=False):

    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)

    if use_gpu:
        model = model.to("cuda")

    # 设置默认的聊天模板
    if not tokenizer.chat_template:
        tokenizer.chat_template = """{% for message in messages %}
                {% if message['role'] == 'system' %}System: {{ message['content'] }}\n
                {% elif message['role'] == 'user' %}User: {{ message['content'] }}\n
                {% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }} 
                {% endif %}
                {% endfor %}"""

    # 设置 pad_token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer


### 数据集可视化函数

In [5]:
"""
可视化数据集

参数:
    dataset: 要可视化的数据集

返回:
    None
"""
def display_dataset(dataset):
    rows = []
    for i in range(3):
        example = dataset[i]
        user_msg = next(m['content'] for m in example['messages'] if m['role'] == 'user')
        assistant_msg = next(m['content'] for m in example['messages'] if m['role'] == 'assistant')
        rows.append({
            'User Prompt': user_msg,
            'Assistant Response': assistant_msg
        })

    # Display as table
    df = pd.DataFrame(rows)
    pd.set_option('display.max_colwidth', None)  # Avoid truncating long strings
    display(df)


## 加载 Qwen3-0.6B 模型

国内下载网站：https://www.modelscope.cn/models/Qwen/Qwen3-0.6B/files

![Qwen3-0.6B](images/Qwen-0.6B.png)

In [6]:
USE_GPU = True
# 加载模型和分词器
model, tokenizer = load_model_and_tokenizer("Qwen-0.6B", USE_GPU)


questions = [
    "Give me an 1-sentence introduction of LLM.",
    "Calculate 1+1-1",
    "What's the difference between thread and process?"
]
# 测试模型回复
test_model_with_questions(model, tokenizer, questions, title="Base Model (Before SFT) Output")


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



=== Base Model (Before SFT) Output ===


  attn_output = torch.nn.functional.scaled_dot_product_attention(



Model Input 1:
Give me an 1-sentence introduction of LLM.
Model Output 1:
A large language model is a system designed to understand and generate human language.


Model Input 2:
Calculate 1+1-1
Model Output 2:
The expression $1 + 1 - 1$ can be evaluated step by step:

1. Add the first two 1s:  
   $1 + 1 = 2$

2. Subtract the third 1:  
   $2 - 1 = 1$

So, the final result is:  
**1**.


Model Input 3:
What's the difference between thread and process?
Model Output 3:
The difference between **thread** and **process** is important in operating systems and concurrent programming. Here's a clear breakdown:

### 1. **Process**:
- A **process** is a **thread** that is running independently.
- It is a **unit of execution** in a program.
- Processes are managed by the operating system and can be created, stopped, or terminated.
- Processes are isolated from other processes, meaning they share the same memory space and resources.
- Example:



从问题2和3可以看出，模型生成的回复并不理想，有些许杂乱。

## 加载 SFT 数据集

数据集下载路径：https://hf-mirror.com/datasets/banghua/DL-SFT-Dataset/tree/main/data

![DL-SFT-Dataset](images/DL-SFT-Dataset.png)

In [7]:
# 加载数据集
train_dataset = load_dataset("parquet", data_files="DL-SFT-Dataset/train-00000-of-00001.parquet")["train"]

if not USE_GPU:
    train_dataset = train_dataset.select(range(100))

display_dataset(train_dataset)


Unnamed: 0,User Prompt,Assistant Response
0,"- The left child should have a value less than the parent node's value, and the right child should have a value greater than the parent node's value.","This statement is correct. In a binary search tree, nodes in the left subtree of a particular node have values less than the node's value, while nodes in the right subtree have values greater than the node's value. This property helps in the efficient search, insertion, and deletion of nodes in the tree."
1,"To pass three levels must be the plan.\nThen tackle Two, when that is done.\nOf 100 that start, at the end will be 20.\nFinQuiz is a website that helps you prepare.\nUse it to be stress-free, and not lose your hair.\nThen, take the exam with a smile on your face.\nBe confident that you will gain your place.\nSo make this the goal to which you aspire. How many individuals out of 100 will successfully complete all three levels of preparation for the exam?","Based on the given information, out of 100 individuals who start, only 20 will make it to the end. There is no information provided on how many individuals will successfully complete all three levels of preparation specifically."
2,"Can you translate the text material into Spanish or any other language?: He really is, you know.\nThings a hero should show.\nHe loves me more than a zillion things.\nHe loves me when I sing my jolly folktale rhymes.\nHe's good, not just good, in fact he's great!\nBut because he's my best mate!\nWOW !!! I love it!!!!","¿Puede traducir el texto a español o a cualquier otro idioma?: \nRealmente lo es, ya sabes.\nCosas que un héroe debería demostrar.\nMe quiere más que un millón de cosas.\nMe quiere cuando canto mis alegres rimas de cuentos populares.\nEs bueno, no solo bueno, ¡de hecho es genial!\n¡Pero porque es mi mejor amigo!\n¡WOW! ¡Me encanta!"


## 使用 SFT 训练模型

### 配置参数

In [8]:
# SFTTrainer 设置
sft_config = SFTConfig(
    learning_rate=8e-5, # 学习率，SFT常用较小学习率，以避免破坏预训练获得的知识。
    num_train_epochs=1, # 训练轮数，SFT常用较小轮数，以避免过拟合。
    per_device_train_batch_size=1, # 每块 GPU 的 batch size。
    gradient_accumulation_steps=8, # 梯度累积次数，实现在不增加显存的情况下模拟更大 batch。
    gradient_checkpointing=False, # 不启用梯度检查点。
    logging_steps=2,  # 每两个 step 打印一次 log。
)


### 创建并启动训练器

In [9]:
sft_trainer = SFTTrainer(
    model=model, # 要训练的模型
    args=sft_config, # 训练配置
    train_dataset=train_dataset, # 训练数据集
    processing_class=tokenizer, # 用于处理文本的分词器
)
sft_trainer.train()


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
2,2.6477
4,2.5485
6,2.2975
8,2.1558
10,1.9763
12,2.2618
14,2.3549
16,2.1953
18,2.231
20,2.1549


TrainOutput(global_step=371, training_loss=2.1162383318590026, metrics={'train_runtime': 531.8997, 'train_samples_per_second': 5.567, 'train_steps_per_second': 0.697, 'total_flos': 1255369248866304.0, 'train_loss': 2.1162383318590026, 'epoch': 1.0})

## SFT 后的模型回复

In [10]:
if not USE_GPU:
    sft_trainer.model.to("cpu")
test_model_with_questions(sft_trainer.model, tokenizer, questions,
                          title="Base Model (After SFT) Output")



=== Base Model (After SFT) Output ===

Model Input 1:
Give me an 1-sentence introduction of LLM.
Model Output 1:
I am not capable of creating a specific introduction for a llm. However, I can provide a general introduction for a llm that is designed to assist users in various tasks, such as language processing, natural language generation, and information retrieval.


Model Input 2:
Calculate 1+1-1
Model Output 2:
1+1-1 = 1.


Model Input 3:
What's the difference between thread and process?
Model Output 3:
A thread is a lightweight process that runs within a single program or application. It is a separate instance of the same program or application that can be controlled independently from the main program or application. Threads are created by the operating system when a program or application starts executing, and they are responsible for handling user input and output. 

On the other hand, a process is a separate instance of a program or application that is independent from the mai

通过 SFT 后，可以明显的看到，问题2和3的回复都发生了好的变化，符合我们的预期。

## 保存模型

保存微调后的模型权重（pytorch_model.bin 或 model.safetensors）和配置文件（config.json）

![Qwen-0.6B-SFT](images/Qwen-0.6B-SFT.png)

In [None]:
output_dir = "Qwen-0.6B-SFT"
sft_trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"模型已保存至: {output_dir}")


模型已保存至: Qwen-0.6B-SFT
