## 步骤1: 安装必要的依赖包

首先，我们需要安装Unsloth和vLLM。这些是进行高效模型微调所必需的工具：

- **Unsloth**: 一个专门用于快速微调大语言模型的库，支持LoRA和QLoRA
- **vLLM**: 一个高性能的大语言模型推理引擎

注意：`--no-deps`参数用于避免依赖冲突，确保安装指定版本的包。


In [None]:
# pip install --no-deps unsloth vllm==0.8.5.post1

## 步骤2: 加载预训练模型和分词器

在这一步中，我们将：

1. **导入必要的库**: FastModel是Unsloth提供的快速模型加载接口
2. **设置参数**: 定义最大序列长度，这影响模型能处理的文本长度
3. **加载模型**: 从本地路径加载预训练的Gemma3-4B模型
4. **配置精度**: 使用16位精度进行训练，平衡效果和显存使用

**重要参数说明**：
- `max_seq_length=1024`: 模型能处理的最大token数量
- `load_in_4bit=False, load_in_8bit=False`: 不使用量化加载，保持全精度
- `full_finetuning=False`: 使用LoRA而不是全参数微调

In [4]:
# 导入必要的库
from unsloth import FastModel  # Unsloth的快速模型加载接口
import torch  # PyTorch深度学习框架

# 设置最大序列长度
# 这个参数决定了模型能处理的最大文本长度（以token为单位）
max_seq_length = 1024

# 加载预训练模型和分词器
# 这里我们加载Gemma3-4B的指令微调版本
model, tokenizer = FastModel.from_pretrained(
    model_name = "/opt/tiger/test0/models/gemma-3-4b-it",  # 模型路径
    max_seq_length = max_seq_length,  # 最大序列长度
    load_in_4bit = False,   # 不使用4位量化，保持精度
    load_in_8bit = False,   # 不使用8位量化，保持精度
    full_finetuning = False,  # 使用LoRA微调，不进行全参数微调
)

==((====))==  Unsloth 2025.6.11: Fast Gemma3 patching. Transformers: 4.52.4. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA A800-SXM4-80GB. Num GPUs = 1. Max memory: 79.325 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


## 步骤3: 配置LoRA（Low-Rank Adaptation）

LoRA是一种高效的微调技术，它不直接修改原始模型权重，而是添加低秩适应层。这样可以：

1. **节省显存**: 只训练少量新增参数
2. **保持原模型**: 原始权重不变，易于管理
3. **提高效率**: 训练速度更快

**关键参数解释**：
- `r=8`: LoRA的秩，控制适应层的大小。越大精度越高但可能过拟合
- `lora_alpha=8`: LoRA的缩放因子，通常设置为等于或略大于r
- `lora_dropout=0`: Dropout率，这里设为0
- `finetune_attention_modules=True`: 微调注意力模块，对GRPO特别重要


In [5]:
# 配置LoRA（Low-Rank Adaptation）参数
# 将基础模型转换为PEFT（Parameter Efficient Fine-Tuning）模型
model = FastModel.get_peft_model(
    model,
    # 层级配置：决定哪些层参与微调
    finetune_vision_layers     = False, # 关闭视觉层微调（仅文本任务）
    finetune_language_layers   = True,  # 开启语言层微调（必须）
    finetune_attention_modules = True,  # 开启注意力模块微调（对GRPO很重要）
    finetune_mlp_modules       = True,  # 开启MLP模块微调（建议保持开启）

    # LoRA核心参数
    r = 8,              # LoRA的秩：控制适应层大小，值越大精度越高但可能过拟合
    lora_alpha = 8,     # LoRA的缩放因子：建议设置为r的值或略大
    lora_dropout = 0,   # LoRA的dropout率：防止过拟合，这里设为0
    bias = "none",      # 偏置项设置：不训练偏置项
    random_state = 3407, # 随机种子：确保结果可复现
)

Unsloth: Making `model.base_model.model.model.language_model` require gradients


## 步骤4: 加载和探索GSM8K数据集

GSM8K是一个包含小学数学问题的数据集，非常适合测试模型的推理能力。我们将：

1. **加载数据集**: 从本地路径加载GSM8K训练数据
2. **探索数据结构**: 查看问题和答案的格式
3. **理解答案格式**: GSM8K的答案包含推理过程和最终答案（用####标记）

让我们先看看数据集的基本信息和样例：


In [None]:
# 加载GSM8K数据集
from datasets import load_dataset

# 从本地路径加载GSM8K数据集的训练集
# GSM8K是一个包含小学数学推理问题的数据集
dataset = load_dataset("/opt/tiger/test0/datasets/gsm8k", "main", split = "train")

# 查看数据集基本信息
print(f"数据集大小: {len(dataset)} 条记录")
print(f"数据集特征: {dataset.features}")
dataset

In [7]:
# 查看第一个样例的问题
# 这是一个典型的小学数学问题
print("问题示例:")
print(dataset[0]["question"])

'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?'

In [8]:
# 查看第一个样例的答案
# 注意答案格式：包含推理过程和最终答案（####后面是最终答案）
print("答案示例:")
print(dataset[0]["answer"])
print("\n可以看到：")
print("1. 答案包含详细的推理步骤")
print("2. 最终答案在####符号后面")
print("3. 这种格式有助于模型学习推理过程")

'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'

In [9]:
# 定义函数提取最终答案
# GSM8K数据集中，最终答案位于####符号之后
def extract_hash_answer(text):
    """
    从GSM8K答案中提取最终数值答案
    
    Args:
        text (str): 包含推理过程和最终答案的完整文本
    
    Returns:
        str or None: 提取的最终答案，如果没有####标记则返回None
    """
    if "####" not in text: 
        return None
    # 分割文本，取####后面的部分并去除空格
    return text.split("####")[1].strip()

# 测试提取函数
final_answer = extract_hash_answer(dataset[0]["answer"])
print(f"提取的最终答案: {final_answer}")

# 验证提取结果
final_answer

'72'

## 步骤5: 设计输出格式和系统提示词

为了让模型学会按照特定格式输出答案，我们需要：

1. **定义格式标记**: 设置开始和结束标记来界定不同部分
2. **创建系统提示词**: 指导模型如何结构化输出
3. **确保格式一致性**: 训练过程中检查格式合规性

我们设计的格式包含两个部分：
- **推理过程**: 放在`<start_working_out>`和`<end_working_out>`之间
- **最终答案**: 放在`<SOLUTION>`和`</SOLUTION>`之间

这种结构化输出有助于：
- 评估模型的推理质量
- 方便提取最终答案
- 提高输出的可读性


In [10]:
# 定义输出格式的标记符号
# 这些标记帮助我们识别和评估模型输出的不同部分
reasoning_start = "<start_working_out>"  # 推理过程开始标记
reasoning_end   = "<end_working_out>"    # 推理过程结束标记
solution_start = "<SOLUTION>"            # 最终答案开始标记
solution_end = "</SOLUTION>"             # 最终答案结束标记

# 创建系统提示词
# 这个提示词指导模型按照我们期望的格式输出答案
system_prompt = f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

print("系统提示词内容:")
print(system_prompt)
print("\n这个提示词告诉模型:")
print("1. 需要思考问题")
print("2. 将推理过程放在指定标记之间")
print("3. 将最终答案放在SOLUTION标记之间")

system_prompt

'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION>'

## 步骤6: 转换数据集格式

现在我们需要将原始的GSM8K数据转换为适合GRPO训练的格式：

1. **创建对话格式**: 将每个问题转换为系统消息+用户消息的对话格式
2. **提取标准答案**: 使用之前定义的函数提取最终答案
3. **构建训练样本**: 每个样本包含提示（prompt）和标准答案（answer）

转换后的格式：
- `prompt`: 包含系统消息和用户问题的对话列表
- `answer`: 提取的数值答案，用于评估模型输出的正确性


In [11]:
# 转换数据集格式
# 将原始数据转换为对话格式，便于模型训练
dataset = dataset.map(lambda x: {
    # 构建对话prompt，包含系统提示和用户问题
    "prompt" : [
        {"role": "system", "content": system_prompt},  # 系统消息：指导输出格式
        {"role": "user",   "content": x["question"]},  # 用户消息：具体的数学问题
    ],
    # 提取标准答案，用于后续的奖励计算
    "answer": extract_hash_answer(x["answer"]),
})

print("转换后的数据格式示例:")
print("1. prompt包含系统提示和用户问题")
print("2. answer是提取的数值答案")
print(f"3. 数据集大小保持不变: {len(dataset)} 条")

# 查看转换后的第一个样本
dataset[0]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': '72',
 'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION>',
   'role': 'system'},
  {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
   'role': 'user'}]}

## 步骤7: 设计奖励函数系统

GRPO的核心是通过奖励函数来指导模型学习。我们将设计4个奖励函数来评估模型输出的不同方面：

1. **格式完全匹配** (`match_format_exactly`): 检查输出是否严格遵循格式
2. **格式近似匹配** (`match_format_approximately`): 检查格式标记的出现情况
3. **答案正确性** (`check_answer`): 验证提取的答案是否正确
4. **数值提取** (`check_numbers`): 检查是否能正确提取数值

### 7.1 首先定义正则表达式来匹配期望的格式


In [12]:
# 导入正则表达式库
import re

# 定义正则表达式来匹配期望的输出格式
# 这个正则表达式确保模型输出包含所有必需的标记并按正确顺序排列
match_format = re.compile(
    rf"^[\s]{{0,}}"      # 开头可以有任意数量的空白字符
    rf"{reasoning_start}.+?{reasoning_end}.*?"  # 推理过程部分（非贪婪匹配）
    rf"{solution_start}(.+?){solution_end}"     # 解决方案部分（捕获组获取答案）
    rf"[\s]{{0,}}$",     # 结尾可以有任意数量的空白字符
    flags = re.MULTILINE | re.DOTALL  # 多行模式，.匹配换行符
)

print("正则表达式说明:")
print("1. 匹配从<start_working_out>到<end_working_out>的推理过程")
print("2. 匹配从<SOLUTION>到</SOLUTION>的最终答案")
print("3. 捕获SOLUTION标记内的内容作为答案")
print("4. 允许前后有空白字符")

In [13]:
# 测试正则表达式
# 用一个简单的示例来验证正则表达式是否能正确匹配格式
test_text = "<start_working_out>Let me think!<end_working_out>" + "<SOLUTION>2</SOLUTION>"

match_result = match_format.search(test_text)

print("测试文本:", test_text)
print("匹配结果:", match_result)
if match_result:
    print("提取的答案:", match_result.group(1))
    print("✓ 正则表达式工作正常")
else:
    print("✗ 正则表达式匹配失败")

match_result

<re.Match object; span=(0, 71), match='<start_working_out>Let me think!<end_working_out>>

### 7.2 奖励函数1: 精确格式匹配

这个函数检查模型输出是否完全符合我们定义的格式要求。如果格式完全正确，给予最高奖励（3.0分）。


In [14]:
def match_format_exactly(completions, **kwargs):
    """
    奖励函数1: 检查输出是否严格遵循指定格式
    
    Args:
        completions: 模型生成的完成文本列表
        **kwargs: 其他参数（未使用）
    
    Returns:
        list: 每个完成文本的奖励分数列表
    """
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        
        # 如果输出完全匹配期望格式，给予高分奖励
        if match_format.search(response) is not None: 
            score += 3.0
            
        scores.append(score)
    
    return scores

print("奖励函数1说明:")
print("- 检查输出是否包含完整的推理过程和解决方案格式")
print("- 格式正确: +3.0分")
print("- 格式不正确: 0分")

### 7.3 奖励函数2: 近似格式匹配

这个函数检查格式标记的出现次数，鼓励模型使用正确的标记，但对标记过多进行惩罚。


In [15]:
def match_format_approximately(completions, **kwargs):
    """
    奖励函数2: 检查格式标记的出现次数
    
    这个函数更宽松，检查各个格式标记是否恰好出现1次。
    如果某个标记出现1次，获得奖励；如果出现0次或多次，会被惩罚。
    
    Args:
        completions: 模型生成的完成文本列表
        **kwargs: 其他参数（未使用）
    
    Returns:
        list: 每个完成文本的奖励分数列表
    """
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        
        # 检查每个标记的出现次数，理想情况下每个标记应该恰好出现1次
        score += 0.5 if response.count(reasoning_start) == 1 else -0.5
        score += 0.5 if response.count(reasoning_end)   == 1 else -0.5
        score += 0.5 if response.count(solution_start)  == 1 else -0.5
        score += 0.5 if response.count(solution_end)    == 1 else -0.5
        
        scores.append(score)
    return scores

print("奖励函数2说明:")
print("- 检查每个格式标记的出现次数")
print("- 每个标记出现1次: +0.5分")
print("- 每个标记出现0次或多次: -0.5分")
print("- 总分范围: -2.0 到 +2.0")

### 7.4 奖励函数3: 答案正确性检查

这是最重要的奖励函数，它评估模型是否给出了正确的答案。它包含多层评估机制：

1. **完全匹配**: 答案完全正确 (+3.0分)
2. **去空格匹配**: 忽略空格后匹配 (+1.5分)  
3. **数值接近**: 答案在正确范围内 (+0.5或+0.25分)
4. **错误惩罚**: 答案错误会被扣分


In [16]:
def check_answer(prompts, completions, answer, **kwargs):
    """
    奖励函数3: 检查答案的正确性
    
    这个函数实现了多层次的答案评估机制，从严格匹配到近似匹配。
    
    Args:
        prompts: 输入提示列表
        completions: 模型生成的完成文本列表
        answer: 标准答案列表
        **kwargs: 其他参数
    
    Returns:
        list: 每个完成文本的奖励分数列表
    """
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    # 从模型输出中提取答案
    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        
        # 如果无法提取答案，得分为0
        if guess is None:
            scores.append(0)
            continue
            
        # 完全匹配：最高奖励
        if guess == true_answer:
            score += 3.0
        # 去除空格后匹配：高奖励
        elif guess.strip() == true_answer.strip():
            score += 1.5
        else:
            # 数值接近性检查：对于数值答案，允许一定误差
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1:    # 10%误差内
                    score += 0.5
                elif ratio >= 0.8 and ratio <= 1.2:    # 20%误差内
                    score += 0.25
                else: 
                    score -= 1.0 # 错误答案惩罚
            except:
                score -= 0.5 # 无法转换为数值的惩罚
                
        scores.append(score)
    return scores

print("奖励函数3说明:")
print("- 完全匹配: +3.0分")
print("- 去空格匹配: +1.5分")
print("- 10%误差内: +0.5分")
print("- 20%误差内: +0.25分")
print("- 错误答案: -1.0分")
print("- 无法解析: -0.5分")

### 7.5 奖励函数4: 数值提取检查

这个函数专门检查模型是否能在SOLUTION标记内输出有效的数值。它使用一个更简单的正则表达式来提取数字。


In [17]:
# 定义用于提取数字的正则表达式
# 这个正则表达式专门用于从SOLUTION标记中提取数值
match_numbers = re.compile(
    rf"{solution_start}.*?([\d\.]{{1,}})",  # 匹配SOLUTION标记内的数字（包括小数）
    flags = re.MULTILINE | re.DOTALL        # 多行模式
)

# 测试数字提取功能
test_solution = "<SOLUTION>  0.34  </SOLUTION>"
extracted_numbers = match_numbers.findall(test_solution)

print(f"测试文本: {test_solution}")
print(f"提取的数字: {extracted_numbers}")
print("✓ 数字提取正则表达式工作正常" if extracted_numbers else "✗ 数字提取失败")

extracted_numbers

['0.34']

In [18]:
def check_numbers(prompts, completions, answer, **kwargs):
    """
    奖励函数4: 检查数值提取能力
    
    这个函数专门检查模型是否能在SOLUTION标记内输出有效的数值，
    并与标准答案进行精确数值比较。
    
    Args:
        prompts: 输入提示列表
        completions: 模型生成的完成文本列表
        answer: 标准答案列表
        **kwargs: 其他参数
    
    Returns:
        list: 每个完成文本的奖励分数列表
    """
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    # 使用数字提取正则表达式从响应中提取数值
    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    
    # 打印调试信息（训练时会显示）
    print('*'*20, f"Question:\n{question}", 
          f"\nAnswer:\n{answer[0]}", 
          f"\nResponse:\n{responses[0]}", 
          f"\nExtracted:\n{extracted_responses[0]}")
    
    for guess, true_answer in zip(extracted_responses, answer):
        # 如果无法提取数字，得分为0
        if guess is None:
            scores.append(0)
            continue
            
        # 尝试将提取的答案和标准答案转换为数值进行比较
        try:
            true_answer_num = float(true_answer.strip())
            guess_num = float(guess.strip())
            # 数值完全匹配时给予奖励，否则为0
            scores.append(1.5 if guess_num == true_answer_num else 0.0)
        except:
            # 转换失败时得分为0
            scores.append(0)
            continue
            
    return scores

print("奖励函数4说明:")
print("- 专门检查SOLUTION标记内的数值提取")
print("- 数值完全匹配: +1.5分")
print("- 无法提取数值或不匹配: 0分")
print("- 用于确保模型输出包含有效数字")

## 步骤8: 配置GRPO训练参数

现在我们配置GRPO训练的各种参数。这些参数控制训练过程的方方面面：

### 关键参数说明：

**学习率相关**:
- `learning_rate=5e-6`: 较小的学习率，确保稳定训练
- `warmup_ratio=0.1`: 学习率预热，前10%步骤逐渐增加学习率

**批次和生成**:
- `per_device_train_batch_size=1`: 每个设备的批次大小
- `num_generations=4`: 每个提示生成4个候选答案进行比较

**序列长度**:
- `max_prompt_length=256`: 输入提示的最大长度
- `max_completion_length`: 输出完成文本的最大长度

**训练控制**:
- `max_steps=50`: 快速演示训练（实际训练建议更多步骤）
- `report_to="swanlab"`: 使用SwanLab进行可视化监控


In [21]:
# 设置提示词的最大长度
max_prompt_length = 256

# 导入GRPO相关的配置和训练器
from trl import GRPOConfig, GRPOTrainer

# 创建GRPO训练配置
training_args = GRPOConfig(
    # 优化器参数
    learning_rate = 5e-6,           # 学习率：GRPO通常使用较小的学习率
    adam_beta1 = 0.9,               # Adam优化器的beta1参数
    adam_beta2 = 0.99,              # Adam优化器的beta2参数
    weight_decay = 0.1,             # 权重衰减，防止过拟合
    optim = "adamw_torch_fused",    # 使用融合的AdamW优化器，更高效
    
    # 学习率调度
    warmup_ratio = 0.1,             # 学习率预热比例
    lr_scheduler_type = "cosine",   # 余弦学习率调度
    
    # 训练批次设置
    per_device_train_batch_size = 1,        # 每个设备的批次大小
    gradient_accumulation_steps = 1,        # 梯度累积步数（可以增加到4获得更平滑的训练）
    num_generations = 4,                    # 每个提示生成的候选数量（显存不足时可减少）
    
    # 序列长度控制
    max_prompt_length = max_prompt_length,                      # 提示的最大长度
    max_completion_length = max_seq_length - max_prompt_length, # 完成文本的最大长度
    
    # 训练控制
    max_steps = 50,                 # 最大训练步数（演示用，实际训练建议更多）
    # num_train_epochs = 1,         # 或者使用训练轮数而非步数
    save_steps = 50,                # 保存模型的步数间隔
    max_grad_norm = 0.1,            # 梯度裁剪阈值
    
    # 日志和监控
    logging_steps = 1,              # 日志记录间隔
    report_to = "swanlab",          # 实验跟踪工具（也可以使用"wandb"）
    output_dir = "outputs",         # 输出目录
)

print("GRPO训练配置已设置完成!")
print(f"- 最大训练步数: {training_args.max_steps}")
print(f"- 每步生成候选数: {training_args.num_generations}")
print(f"- 学习率: {training_args.learning_rate}")
print(f"- 使用SwanLab进行可视化监控")

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 4


## 步骤9: 执行GRPO训练

现在我们创建GRPO训练器并开始训练过程。训练器会：

1. **初始化训练器**: 设置模型、奖励函数和训练参数
2. **开始训练**: 循环执行以下步骤：
   - 生成多个候选答案
   - 使用奖励函数评估每个候选
   - 根据奖励信号更新模型参数
3. **监控训练**: 通过SwanLab实时查看训练进度

### 训练过程中会看到：
- 训练进度条和损失值
- 每个奖励函数的得分统计
- SwanLab的可视化界面链接
- 样例问题和模型输出


In [22]:
# 创建GRPO训练器
# 训练器整合了模型、奖励函数、训练参数和数据集
trainer = GRPOTrainer(
    model = model,                  # 要训练的模型
    processing_class = tokenizer,   # 分词器（用于文本处理）
    
    # 奖励函数列表：这些函数将评估模型输出质量
    reward_funcs = [
        match_format_exactly,           # 奖励函数1：严格格式匹配
        match_format_approximately,     # 奖励函数2：近似格式匹配
        check_answer,                   # 奖励函数3：答案正确性
        check_numbers,                  # 奖励函数4：数值提取
    ],
    
    args = training_args,           # 训练配置参数
    train_dataset = dataset,        # 训练数据集
)

print("GRPO训练器创建完成!")
print("包含的奖励函数:")
print("1. match_format_exactly - 检查完整格式")
print("2. match_format_approximately - 检查标记使用")
print("3. check_answer - 检查答案正确性")
print("4. check_numbers - 检查数值提取")
print("\n开始训练...")

# 开始GRPO训练
# 注意：训练过程中会显示大量调试信息，包括问题、答案和模型输出
trainer.train()

Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 50
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 14,901,248 of 4,314,980,720 (0.35% trained)


[1m[34mswanlab[0m[0m: Tracking run with swanlab version 0.6.4                                   
[1m[34mswanlab[0m[0m: Run data will be saved locally in [35m[1m/opt/tiger/test0/swanlog/run-20250701_195941-0e8cd89d[0m[0m
[1m[34mswanlab[0m[0m: 👋 Hi [1m[39mtwosugar[0m[0m, welcome to swanlab!
[1m[34mswanlab[0m[0m: Syncing run [33moutputs[0m to the cloud
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@twosugar/test0[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@twosugar/test0/runs/cmax5v7at0zpzpqk94cbg[0m[0m


******************** Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
 <start_working_out> 

 <end_working_out> 

 <SOLUTION> 

 **1.** Calculate the number of tickets bought that exceed 10.  Mr. Benson bought 12 tickets, so 12 - 10 = 2 tickets exceed 10.

 **2.** Calculate the discount per ticket bought that exceeds 10. The discount is 5% of $40.  0.05 * $40 = $2.  However, the discount is calculated *only* for those tickets exceeding 10.  Therefore, the discount per ticket is $2.

 **3.**  Calculate the total discount.  Mr. Benson bought 2 tickets that exceed 10, so the total discount is 2 * $2 = $4.

 **4.**  Calculate the total cost before discount. Mr. Benson bought 12 tickets at $40 each, so the total cost is 12 * $40 = $480.

 **5.**  Calculate the total cost after discount.  The total cost after discount is $480 - $4 = $476.

 **6.** 

Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / match_format_exactly / mean,rewards / match_format_exactly / std,rewards / match_format_approximately / mean,rewards / match_format_approximately / std,rewards / check_answer / mean,rewards / check_answer / std,rewards / check_numbers / mean,rewards / check_numbers / std
1,-0.0,0.5,1.0,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1.0,0.0,0.0,0.0,0.0
2,-0.0,-0.625,1.108678,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.0,0.75,1.5,-1.25,0.957427,-0.125,0.25,0.0,0.0
3,0.0,0.625,2.625992,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.000177,0.75,1.5,0.0,1.414214,-0.125,0.25,0.0,0.0
4,0.0,3.625,3.944933,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.00062,0.75,1.5,1.75,0.5,0.75,1.5,0.375,0.75
5,0.0,3.75,2.020726,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.001447,1.5,1.732051,2.0,0.0,0.25,0.288675,0.0,0.0
6,0.0,5.0,1.683251,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.000681,2.25,1.5,1.75,0.5,-0.125,0.478714,1.125,0.75
7,0.0,0.0,1.154701,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.000422,0.0,0.0,0.0,1.154701,0.0,0.0,0.0,0.0
8,0.0,0.25,1.5,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.000513,0.0,0.0,0.25,1.5,0.0,0.0,0.0,0.0
9,0.0,1.375,2.688711,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.000646,0.75,1.5,0.75,1.892969,-0.125,0.25,0.0,0.0
10,0.0,1.75,2.020726,768.0,768.0,768.0,1.0,0.0,0.0,0.0,0.000739,2.25,1.5,-0.5,1.914854,-0.375,0.25,0.375,0.75


******************** Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
Here's the working out, placed between <start_working_out> and <end_working_out>.

<start_working_out>

Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer?

1.  Calculate the monthly payment for the house.
    *   Loan amount: $480,000
    *   Number of months: 20 years * 12 months/year = 240 months
    *   Monthly payment: $480,000 / 240 months = $2,000/month

2.  Calculate the monthly payment for the trailer.
    *   Loan amount: $120,000
    *   Number of months: 20 years * 12 months/

## 步骤10: 测试训练后的模型

训练完成后，让我们测试一下模型是否学会了按照我们期望的格式回答问题。我们将：

1. **构建测试消息**: 使用系统提示词和一个新的数学问题
2. **生成回答**: 让微调后的模型回答问题
3. **观察输出**: 检查模型是否遵循了我们定义的格式

这个测试将帮助我们验证GRPO训练的效果。


In [None]:
# 构建测试消息
# 使用训练时相同的系统提示词，但提出一个新问题
messages = [
    {"role": "system", "content": system_prompt},  # 使用相同的格式指导
    {"role": "user",   "content": "What is the sqrt of 101?"},  # 新的数学问题
]

# 将消息转换为模型输入格式
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,  # 添加生成提示，告诉模型开始回答
    tokenize = False,              # 先不分词，保持文本格式
)

print("测试问题: What is the sqrt of 101?")
print("期望输出格式:")
print("- 包含 <start_working_out> ... <end_working_out> 的推理过程")
print("- 包含 <SOLUTION> ... </SOLUTION> 的最终答案")
print("\n模型输出:")

# 导入文本流输出器，用于实时显示生成过程
from transformers import TextStreamer

# 生成回答
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),  # 将输入转换为张量并移到GPU
    max_new_tokens = 64,       # 限制输出长度（可以根据需要增加）
    
    # Gemma-3推荐的生成参数
    temperature = 1.0,         # 控制输出的随机性
    top_p = 0.95,             # 核采样参数
    top_k = 64,               # top-k采样参数
    
    # 实时输出流
    streamer = TextStreamer(tokenizer, skip_prompt = True),  # 跳过输入提示，只显示生成内容
)

# <start_working_out>
# We want to find the square root of 101, which is written as √101.

# Since 101 is a prime number, its only factors are 1 and 101. Therefore, its square root is not an integer.

## 步骤11: 保存训练后的模型

训练完成并验证效果后，我们需要保存模型以便后续使用。有几种保存方式：

1. **LoRA适配器保存**: 只保存训练的LoRA权重（文件小，推荐）
2. **完整模型保存**: 将LoRA权重合并到原模型中保存
3. **GGUF格式保存**: 保存为量化的GGUF格式，便于部署

### LoRA适配器保存（推荐）

这种方式只保存训练过程中新增的LoRA权重，文件很小（通常几十MB），使用时需要配合原始模型一起加载。


In [None]:
# 保存LoRA适配器（推荐方式）
# 这种方式只保存训练过程中新增的LoRA权重，文件很小
print("正在保存LoRA适配器...")

model.save_pretrained("gemma-3")      # 保存模型（包含LoRA权重）
tokenizer.save_pretrained("gemma-3")  # 保存分词器

print("✓ LoRA适配器和分词器已保存到 'gemma-3' 目录")
print("保存内容:")
print("- adapter_config.json: LoRA配置文件")
print("- adapter_model.safetensors: LoRA权重文件")
print("- tokenizer相关文件")
print("\n使用方法:")
print("1. 先加载原始Gemma3-4B模型")
print("2. 再加载这个LoRA适配器")
print("3. 即可获得微调后的模型")

### 可选：保存完整模型

如果你希望将LoRA权重合并到原模型中并保存完整的模型文件，可以使用以下代码：


In [None]:
# 可选：保存完整的微调模型
# 将LoRA权重合并到原模型中，生成一个完整的模型文件
if False:  # 设置为True以执行保存
    print("正在保存完整的微调模型...")
    model.save_pretrained_merged("gemma-3-finetune", tokenizer)
    print("✓ 完整模型已保存到 'gemma-3-finetune' 目录")
    print("注意：完整模型文件很大（几GB），但使用时不需要原始模型")

### 可选：保存GGUF格式

GGUF是一种优化的模型格式，支持量化压缩，适合部署到资源受限的环境：


In [None]:
# 可选：保存为GGUF格式
# GGUF格式支持量化，文件更小，推理速度更快
if False:  # 设置为True以执行保存
    print("正在保存GGUF格式模型...")
    model.save_pretrained_gguf(
        "gemma-3-finetune",
        quantization_type = "Q8_0",  # 量化类型：目前支持Q8_0, BF16, F16
    )
    print("✓ GGUF格式模型已保存")
    print("特点:")
    print("- 文件更小（通过量化压缩）")
    print("- 推理速度更快")
    print("- 适合部署到边缘设备")
    print("- 可以用llama.cpp等工具加载")