# Phi-4 GRPO 强化学习训练及 SwanLab 可视化教程

## 概述
本教程将展示如何使用 GRPO (Generalized Reward-based Policy Optimization) 算法对 Phi-4 模型进行强化学习训练，并使用 SwanLab 进行训练过程的可视化监控。

GRPO 是一种基于奖励的策略优化方法，能够帮助模型学习更好的推理能力和格式化输出。

## 第一步：环境准备和依赖安装

首先需要安装必要的依赖包，包括 Unsloth 和 vLLM。

In [None]:
# 安装必要的依赖包
# unsloth: 用于快速模型训练和推理的优化库
# vllm: 高性能的大语言模型推理引擎
# pip install unsloth vllm

In [1]:
# 导入必要的库
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch

# 模型配置参数
max_seq_length = 512  # 最大序列长度，可以增加以支持更长的推理链
lora_rank = 16  # LoRA 秩，更大的秩会让模型更聪明但训练更慢

# 加载预训练模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-4",  # Phi-4 模型路径，使用量化模型以节省内存
    max_seq_length = max_seq_length,  # 设置最大序列长度
    load_in_4bit = True,  # 启用 4bit 量化以减少内存使用，设为 False 则使用 16bit LoRA
    fast_inference = True,  # 启用 vLLM 快速推理引擎
    max_lora_rank = lora_rank,  # 设置最大 LoRA 秩
    gpu_memory_utilization = 0.7,  # GPU 内存使用率，如果内存不足可以减少
)

# 配置 PEFT (Parameter-Efficient Fine-Tuning) 模型
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,  # LoRA 秩，建议值：8, 16, 32, 64, 128
    target_modules = ["gate_proj", "up_proj", "down_proj"],  # 目标模块，这些是 MLP 层的关键组件
    lora_alpha = lora_rank,  # LoRA alpha 参数，通常设为与秩相同的值
    use_gradient_checkpointing = "unsloth",  # 启用梯度检查点以支持长上下文微调
    random_state = 3407,  # 随机种子，确保结果可复现
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-02 17:05:40 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.6.12: Fast Llama patching. Transformers: 4.52.4. vLLM: 0.8.2.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.151 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading /opt/tiger/test0/Phi-4 with actual GPU utilization = 69.6%
Unsloth: Your GPU has CUDA compute capability 8.0 with VRAM = 79.15 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 512. Num Sequences = 320.
Unsloth: vLLM's KV Cache can use up to 27.66 GB. Also swap spa

Loading safetensors checkpoint shards:   0% Completed | 0/6 [00:00<?, ?it/s]


INFO 07-02 17:08:01 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 07-02 17:08:01 [model_runner.py:1146] Model loading took 8.6253 GB and 7.670299 seconds
INFO 07-02 17:08:07 [worker.py:267] Memory profiling takes 5.35 seconds
INFO 07-02 17:08:07 [worker.py:267] the current vLLM instance can use total_gpu_memory (79.15GiB) x gpu_memory_utilization (0.70) = 55.09GiB
INFO 07-02 17:08:07 [worker.py:267] model weights take 8.63GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.16GiB; the rest of the memory reserved for KV Cache is 45.21GiB.
INFO 07-02 17:08:07 [executor_base.py:111] # cuda blocks: 14815, # CPU blocks: 1966
INFO 07-02 17:08:07 [executor_base.py:116] Maximum concurrency for 512 tokens per request: 462.97x
INFO 07-02 17:08:12 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If

Capturing CUDA graph shapes: 100%|██████████| 43/43 [00:58<00:00,  1.37s/it]

INFO 07-02 17:09:11 [model_runner.py:1570] Graph capturing finished in 59 secs, took 1.12 GiB
INFO 07-02 17:09:11 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 70.01 seconds





Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'k_norm', 'q_norm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'k_norm', 'q_norm', 'post_feedforward_layernorm']


Not an error, but Unsloth cannot patch Attention layers with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Not an error, but Unsloth cannot patch O projection layer with our manual autograd engine since either LoRA adapters
are not enabled or a bias term (like in Qwen) is used.
Unsloth 2025.6.12 patched 40 layers with 0 QKV layers, 0 O layers and 40 MLP layers.


In [2]:
import re
from datasets import load_dataset, Dataset

# 1) 系统级提示，指导模型输出所需的 XML-CoT 模板         
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...                      
</reasoning>
<answer>
...                      
</answer>
"""

# 2) XML_COT_FORMAT 是后续 reward 函数用来校验格式的占位字符串
XML_COT_FORMAT = """\
<reasoning>
...
</reasoning>
<answer>
...
</answer>"""

# -----------------------------------------------------------------
# 2-1) GSM8K 数据 → 多轮对话格式
# -----------------------------------------------------------------
def extract_hash_answer(text: str) -> str | None:
    """GSM8K 答案位于 '#### <数字>'，此函数抽取并去除前后空格。"""
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

def get_gsm8k_questions(split="train") -> Dataset:
    """把 GSM8K 问题包装成 ChatGPT prompt 所需的格式。"""
    data = load_dataset("openai/gsm8k", "main")[split]

    # 使用 map 批量转换：每条样本都生成 system + user 两个 role
    data = data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user",   "content": x["question"]},
            ],
            "answer": extract_hash_answer(x["answer"]),
        }
    )
    return data

dataset = get_gsm8k_questions("train")

# -----------------------------------------------------------------
# 2-2) 五个奖励函数（GRPO 会对它们求和 / 组合）            
# -----------------------------------------------------------------
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """
    若生成文本同时含有 <reasoning>…</reasoning> 与 <answer>…</answer>
    即视为格式正确，奖励 0.5；否则 0。
    """
    pattern   = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [c[0]["content"] for c in completions]
    return [0.5 if re.match(pattern, r) else 0.0 for r in responses]

def count_xml(text: str) -> float:
    """
    更细粒度的 XML 计数奖励：
      每个正确出现的标签给 0.125 分；
      如果 answer 标签后还有多余字符，则按字符个数惩罚。
    """
    count = 0.0
    if text.count("<reasoning>\n")      == 1: count += 0.125
    if text.count("\n</reasoning>\n")   == 1: count += 0.125
    if text.count("\n<answer>\n")       == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1]) * 0.001
    if text.count("\n</answer>")        == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    """把单条文本级别的计算应用到 batch。"""
    contents = [c[0]["content"] for c in completions]
    return [count_xml(c) for c in contents]

#   还可以继续添加 soft_format_reward_func / int_reward_func / correctness_reward_func
#   这里省略实现细节，它们通常计算「数值正确性」「是否输出整数」等指标。      

## 第四步：设置训练参数以及开始训练

In [3]:
from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    use_vllm                   = True,          # 若显存够，用 vLLM 当推理器
    learning_rate              = 5e-6,          # 通常 1e-6 ～ 5e-5 之间
    adam_beta1                 = 0.9,
    adam_beta2                 = 0.99,
    weight_decay               = 0.1,
    warmup_ratio               = 0.1,
    lr_scheduler_type          = "cosine",      # 余弦退火
    optim                      = "paged_adamw_8bit",  # 8-bit AdamW
    logging_steps              = 1,             # 每步打印日志
    per_device_train_batch_size= 1,             # LoRA 微调：batch=1 足够
    gradient_accumulation_steps= 1,             # 可设 4 以平滑梯度
    num_generations            = 6,             # 每个输入采样的输出数
    max_prompt_length          = 600,           # prompt 预算
    max_completion_length      = 128,           # 模型答复预算
    beta                       = 0.1,           # GRPO β；越大惩罚越强
    epsilon                    = 0.1,           # KL 约束
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [4]:
trainer = GRPOTrainer(
    model            = model,
    processing_class = tokenizer,
    reward_funcs     = [
        xmlcount_reward_func,
        # soft_format_reward_func,
        # strict_format_reward_func,
        # int_reward_func,
        # correctness_reward_func,
    ],
    args             = training_args,
    train_dataset    = dataset,
)
trainer.train()          #  开始 LoRA + GRPO 训练

Detected kernel version 5.4.143, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 44,236,800 of 7,888,000,000 (0.56% trained)


[1m[34mswanlab[0m[0m: Tracking run with swanlab version 0.6.4                                   
[1m[34mswanlab[0m[0m: Run data will be saved locally in [35m[1m/opt/tiger/test0/swanlog/run-20250702_170943-0e8cd89d[0m[0m
[1m[34mswanlab[0m[0m: 👋 Hi [1m[39mtwosugar[0m[0m, welcome to swanlab!
[1m[34mswanlab[0m[0m: Syncing run [33moutputs[0m to the cloud
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@twosugar/test0[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@twosugar/test0/runs/rnvceekqy652of27de2o9[0m[0m


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
To determine how much Mr. Benson paid for the concert tickets, we need to break down the cost based on the discount structure:

1. **Basic Ticket Cost:**  
   Each ticket costs $40.

2. **Number of Tickets Bought:**  
   Mr. Benson bought 12 tickets.

3. **Discount Structure:**  
   A 5% discount is applied to each ticket beyond the 10th ticket.

4. **Calculating the Cost:**
   - **First 10 Tickets:**  
     These tickets are at full price.  
     \[
     10 \times 40 = 400
     \]

   - **Remaining 2 Tickets:**  
     These tickets receive a 5% discount.  
     The discount per ticket is 5% of $40:  
     \[
     0.05 \times 40 = 2
     \]  
     Therefore, the discounted price per ticket is:  
     \[
     40 
Extracted:
To determine how much Mr. Benson paid for the conce

Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / xmlcount_reward_func / mean,rewards / xmlcount_reward_func / std,rewards / soft_format_reward_func / mean,rewards / soft_format_reward_func / std,rewards / strict_format_reward_func / mean,rewards / strict_format_reward_func / std,rewards / int_reward_func / mean,rewards / int_reward_func / std,rewards / correctness_reward_func / mean,rewards / correctness_reward_func / std
1,0.0,0.0,0.0,200.0,200.0,200.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,167.0,2.0,200.0,0.833333,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,200.0,200.0,200.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,200.0,200.0,200.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.104167,0.051031,200.0,200.0,200.0,1.0,0.0,0.0,0.0,0.0,0.104167,0.051031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.020833,0.051031,200.0,200.0,200.0,1.0,0.0,0.0,0.0,0.000168,0.020833,0.051031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,200.0,200.0,200.0,1.0,0.0,0.0,0.0,0.000169,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.020833,0.051031,200.0,200.0,200.0,1.0,0.0,0.0,0.0,0.000385,0.020833,0.051031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.020833,0.051031,168.833344,13.0,200.0,0.833333,13.0,13.0,13.0,0.00059,0.020833,0.051031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,188.666672,132.0,200.0,0.833333,132.0,132.0,132.0,0.000202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


-------------------- Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
Anne: To answer userJane's question about the monthly payments for a house versus a trailer, we need to calculate the monthly payments for each using the formula for a fixed-rate mortgage. However, to accurately determine these payments, we need additional information such as the interest rate. If we assume a common interest rate, we can provide a hypothetical comparison.

### Calculations

Assuming an interest rate of 4% per year (a common rate for mortgages), we can calculate the monthly payments using the formula:

\[ M = P \frac{r(1 + r)^n}{(1 + r)^n - 1} \]

Where:
- \( M \) is the monthly payment.
- \( P \) is the principal loan amount.
- \( r \) is the monthly interest rate (a

TrainOutput(global_step=100, training_loss=5.479767559535276e-07, metrics={'train_runtime': 1933.7955, 'train_samples_per_second': 0.31, 'train_steps_per_second': 0.052, 'total_flos': 0.0, 'train_loss': 5.479767559535276e-07})

## 第五步：验证训练结果并保存

In [None]:
text = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Which is bigger? 9.11 or 9.9?"}],
    tokenize             = False,
    add_generation_prompt= True,      # 追加 <assistant> 开场标识
)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p       = 0.95,
    max_tokens  = 1024,
)

output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request    = None,           # 推理时可选关闭 LoRA
)[0].outputs[0].text

Processed prompts: 100%|██████████| 1/1 [00:16<00:00, 16.59s/it, est. speed input: 1.27 toks/s, output: 9.89 toks/s]


'9.11 is bigger than 9.9. When comparing decimal numbers, you look at the digits from left to right. Both numbers have the same whole number part (9), so you compare the digits in the tenths place next. In 9.11, the tenths place is 1, and in 9.9, the tenths place is 9. Since 1 is less than 9, you might initially think 9.9 is larger, but you also need to consider the hundredths place in 9.11, which is 1. When you express 9.9 as 9.90 for comparison, you see that 9.11 is greater than 9.90. Therefore, 9.11 is bigger than 9.9.'

In [None]:
model.save_lora("grpo_saved_lora")   # 仅保存 LoRA 权重（≈几 MB）

In [None]:
text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": "Which is bigger? 9.11 or 9.9?"},
    ],
    tokenize             = False,
    add_generation_prompt= True,
)

sampling_params = SamplingParams(
    temperature = 0.8,
    top_p       = 0.95,
    max_tokens  = 1024,
)
output = model.fast_generate(text, sampling_params)[0].outputs[0].text

Processed prompts: 100%|██████████| 1/1 [00:27<00:00, 27.72s/it, est. speed input: 1.70 toks/s, output: 10.03 toks/s]


'<reasoning>\nTo determine which number is bigger between 9.11 and 9.9, we should compare the two numbers digit by digit from left to right. \n\n1. First, compare the digits in the units place:\n   - Both numbers have a 9 in the units place.\n\n2. Next, compare the digits in the tenths place:\n   - The number 9.11 has a 1 in the tenths place.\n   - The number 9.9 has a 9 in the tenths place.\n\nSince 1 is less than 9, the number 9.11 is less than 9.9 based on the tenths place comparison.\n\n3. For thoroughness, consider the hundredths place:\n   - The number 9.11 has a 1 in the hundredths place.\n   - The number 9.9 can be written as 9.90, which has a 0 in the hundredths place.\n\nEven if we compare the hundredths place, 1 is greater than 0, but this is irrelevant since the comparison in the tenths place already determines that 9.11 is smaller than 9.9.\n\nTherefore, 9.9 is greater than 9.11.\n</reasoning>\n\n<answer>\n9.9 is bigger than 9.11.\n</answer>'

In [None]:
print(output)

<reasoning>
To determine which number is bigger between 9.11 and 9.9, we should compare the two numbers digit by digit from left to right. 

1. First, compare the digits in the units place:
   - Both numbers have a 9 in the units place.

2. Next, compare the digits in the tenths place:
   - The number 9.11 has a 1 in the tenths place.
   - The number 9.9 has a 9 in the tenths place.

Since 1 is less than 9, the number 9.11 is less than 9.9 based on the tenths place comparison.

3. For thoroughness, consider the hundredths place:
   - The number 9.11 has a 1 in the hundredths place.
   - The number 9.9 can be written as 9.90, which has a 0 in the hundredths place.

Even if we compare the hundredths place, 1 is greater than 0, but this is irrelevant since the comparison in the tenths place already determines that 9.11 is smaller than 9.9.

Therefore, 9.9 is greater than 9.11.
</reasoning>

<answer>
9.9 is bigger than 9.11.
</answer>


In [None]:
# 如果要把 LoRA 与基座模型合并，并量化后再保存 / 上传：
# Merge to 16-bit
# model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
# model.push_to_hub_merged("hf/your-repo", tokenizer, save_method="merged_16bit", token="")

# Merge to 4-bit
# model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit")
# model.push_to_hub_merged("hf/your-repo", tokenizer, save_method="merged_4bit", token="")


In [None]:
# Save to 8-bit Q8_0 GGUF
# model.save_pretrained_gguf("model", tokenizer)
# model.push_to_hub_gguf("hf/your-repo", tokenizer, token="")

# Save to 16-bit GGUF
# model.save_pretrained_gguf("model", tokenizer, quantization_method="f16")
# model.push_to_hub_gguf("hf/your-repo", tokenizer, token="")