# 使用 GRPO、RLVR 算法微调模型

本教程将演示如何使用 **Group Relative Policy Optimization (GRPO)** 算法微调大型语言模型（以 **Llama-3.1-8B-Instruct** 模型为例）。通过本教程，你将学习如何在 **Align Anything** 框架下，为你的任务自定义奖励函数（reward function），并结合 **Reinforcement Learning with Verifiable Rewards (RLVR)** 方法，进一步提升模型在特定任务上的性能。



## 1.1 什么是 GRPO 算法？

**Group Relative Policy Optimization (GRPO)** 是一种强化学习算法，旨在通过分组和相对奖励机制提升模型的推理能力。GRPO 最早在论文 *DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models* 中提出，并在 DeepSeek-R1 的后训练阶段中被成功应用。

GRPO 的目标是通过相对比较策略（而非绝对奖励）来优化模型的行为。具体而言，GRPO 会将多个模型输出分组，并根据它们的相对表现计算奖励值。这种方法能够缓解传统强化学习中绝对奖励难以定义或不够精确的问题，尤其适用于复杂推理任务。


## 1.2 什么是 RLVR？

**Reinforcement Learning with Verifiable Rewards (RLVR)** 是一种新颖的语言模型训练方法，专为具有可验证结果的任务（如数学问题求解和指令跟随）设计。RLVR 使用现有的强化学习奖励机制（如 RLHF），但用一种验证函数替代传统的奖励模型。

与传统方法不同，RLVR 通过答案匹配或约束验证（例如答案是否正确）作为二元信号训练模型。当应用于数学领域或其他可验证任务时，RLVR 不仅能够提升特定基准（如 GSM8K）的性能，还能在其他任务中保持稳定表现。

可以将 RLVR 看作是现有方法的简化版本，比如基于执行反馈的强化学习（RL with execution feedback）或语言模型推理的自举方法。它的核心思想是利用可验证信号作为直接奖励，避免了构建复杂奖励模型的繁琐过程。


## 2. 环境配置

在开始之前，请确保您已安装 ``align-anything`` 包。

```bash
# 克隆仓库
git clone git@github.com:PKU-Alignment/align-anything.git
cd align-anything

# 使用conda创建虚拟环境
conda create -n align-anything python==3.11
conda activate align-anything
```

- **`[Optional]`** We recommend installing [CUDA](https://anaconda.org/nvidia/cuda) in the conda environment and set the environment variable.

```bash
# 我们在 H800 计算集群上测试过，这个版本的 CUDA 效果很好。
# 您可以根据计算集群的实际情况调整此版本。

conda install nvidia/label/cuda-12.2.0::cuda
export CUDA_HOME=$CONDA_PREFIX
```

> 如果您的 CUDA 安装在不同的位置，例如 `/usr/local/cuda/bin/nvcc`，您可以按如下方式设置环境变量：

```bash
export CUDA_HOME="/usr/local/cuda"
```

接着通过以下命令安装 `align-anything`：

```bash
# 我们为训练和评估准备了快速安装。
# 如果您只需要使用训练或评估模块，
# 您可以安装相应的依赖项。
pip install -e .[train] # 安装训练依赖项
pip install -e .[evaluate] # 安装评估依赖项

# 如果您需要安装所有依赖项，可以使用以下命令：
pip install -e .[all]
```

最后, 参照 https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/models/remote_rm

您还需要:
```bash
pip install Levenshtein flask latex2sympy2_extended math_verify
```

## 3. Llama-3.1-8B-Instruct模型输出示例
下面，让我们首先测试Llama-3.1-8B-Instruct模型的zero-shot能力。
### 3.1 导入所需的库

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import torch

os.environ["TRANSFORMERS_OFFLINE"] = "1"
os.environ["HF_DATASETS_OFFLINE"] = "1"

  from .autonotebook import tqdm as notebook_tqdm


[1742778596.498488] [dsw-519274-66f65ff576-678dh:4051137:f]        vfs_fuse.c:281  UCX  ERROR inotify_add_watch(/tmp) failed: No space left on device


### 3.2 加载原始的Llama 模型

In [None]:
device = "cuda"  # 将device设置为"cuda"以使用GPU
model_path = "/PATH/TO/YOUR/Llama-3.1-8B-Instruct"  # 请更换为实际的模型路径
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

# 将模型设置为eval模式
model.eval()

Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.29it/s]


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_

### 3.3 测试原始模型的性能

让我们用一个示例问题测试 Llama-3.1-8B-Instruct 模型。

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that answers user queries."},
    {
        "role": "user",
        "content": "How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    },
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([input_text], return_tensors="pt").to(device)

# the model generate new tokens
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)
# convert the generated tokens to text
generated_text = tokenizer.decode(
    output[0][len(inputs['input_ids'][0]) :], skip_special_tokens=True
)
print("\nGenerated Text:", generated_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Generated Text: The sequence of square roots of the positive integers is increasing. The largest term of the sequence that is less than or equal to 20 is $\sqrt{19}$, the square root of 16. Therefore, 16 terms of the sequence are less than or equal to 20. The sequence of 16 terms is

$\sqrt{1},\sqrt{2},\sqrt{3},\sqrt{4},\sqrt{5},\sqrt{6},\sqrt{7},\sqrt{8},\sqrt{9},\sqrt{10},\sqrt{11},\sqrt{12},\sqrt{13},\sqrt{14},\sqrt{15},\sqrt{16}$


而正确答案是400, 由此可见，llama 3.1的数学能力仍有提升的空间

## 4. 使用GRPO算法对齐模型

**注意**：如果您无法访问huggingface.co，请将huggingface的endpoint设置为hf-mirror.com。您可以进行以下操作：

`export HF_ENDPOINT="https://hf-mirror.com"`

在这里，我们以 **Align Anything** 框架自带的示例数据集 mathvl_345_example.json 为例。mathvl_345_example 是一个简单数学数据集, 包含了10个数学问题和答案

可以参考如下的训练脚本：

```bash
# NOTE need to start the remote rm server first
bash start_remote_rm.sh

# NOTE need to change the model path
ACTOR_MODEL_NAME_OR_PATH="meta-llama/Llama-3.1-8B-Instruct" # actor model path

TRAIN_DATASETS="../align_anything/models/remote_rm/math_verify_dataset/mathvl_345_example.json" # dataset path
TRAIN_TEMPLATE="Math-Zero-RL" # math zero rlhf dataset template, note that for math zero rl, you are recommended to expand token length to longer length such as 18000
TRAIN_SPLIT="train" # split the input dataset

OUTPUT_DIR="../output/llama_grpo_remote_rm" # output dir
# For wandb online logging
export WANDB_API_KEY=""

export REMOTE_RM_URL="http://127.0.0.1:6000/get_reward"
# Source the setup script
source ./setup.sh

# Execute deepspeed command
deepspeed \
  --master_port ${MASTER_PORT} \
  --module align_anything.trainers.text_to_text.grpo_remote_rm \
  --actor_model_name_or_path ${ACTOR_MODEL_NAME_OR_PATH} \
  --remote_rm_url ${REMOTE_RM_URL} \
  --train_datasets ${TRAIN_DATASETS} \
  --train_split ${TRAIN_SPLIT} \
  --train_template ${TRAIN_TEMPLATE} \
  --output_dir ${OUTPUT_DIR}
```

训练完成后，您可以在`OUTPUT_DIR`下找到训练的模型权重。

## 5. 测试GRPO训练后的模型性能

在训练结束后，我们试图测试训练后的模型数学能力是否有所改观。

### 5.1 加载新的模型权重


In [None]:
model_path = "/PATH/TO/YOUR/TRAINED_MODEL"  # 请更换为实际的模型路径
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

# 将模型设置为eval模式
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128257, 4096, padding_idx=128256)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps

### 5.2 测试新模型的性能

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that answers user queries."},
    {
        "role": "user",
        "content": "How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    },
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([input_text], return_tensors="pt").to(device)

# the model generate new tokens
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)
# convert the generated tokens to text
generated_text = tokenizer.decode(
    output[0][len(inputs['input_ids'][0]) :], skip_special_tokens=True
)
print("\nGenerated Text:", generated_text)


Generated Text: To find out how many terms are less than or equal to $20$, we can find out which term is greater than $20$, and then subtract $1$ to find the answer.

Recognize that $\sqrt{400} = 20$.

The sequence goes by consecutive integers (1, 2, 3, 4, ect), so $\sqrt{400}$ will be the 400th term.

Thus, we can say every term up to the 400th term is less than or equal to $20$, except $\sqrt{400}$.


由此可见，训练后的模型确实答对了问题. 

(当然严格来说, 测试的题目是在训练数据集里的, 这是 in distribution test)

# 6. 自定义奖励函数
在本节中,我们将学习如何自定义奖励函数,以便您可以根据具体任务需求设计专属的评分机制。

### 6.1 创建奖励函数文件
首先需要在项目的reward_functions目录下创建新的奖励函数文件:
```bash
cd align-anything/align_anything/models/remote_rm/reward_functions/
touch my_verifier.py
```
我们可以参考examples.py中的示例来实现自己的奖励函数。本示例中,我们实现了一个简单的格式验证奖励函数,主要关注回答的格式是否正确,暂不考虑答案的准确性。
下面是具体实现代码
```python
# align_anything/models/remote_rm/reward_functions/my_verifier.py
import random
import re
from typing import List, Optional

from flask import jsonify

format_pattern = r'^<think>(?:(?!</think>).)*</think><answer>(?:(?!</answer>).)*</answer>\Z'


def verify_format(content):
    """
    Verify if the string meets the format requirements:
    - Must start with <think> and end with </answer>
    - Must contain exactly one pair of <think>...</think> and <answer>...</answer> tags
    - No extra characters allowed between </think> and <answer> tags
    """
    think_count = content.count('<think>')
    answer_count = content.count('<answer>')
    return (
        bool(re.match(format_pattern, content, re.DOTALL))
        and think_count == 1
        and answer_count == 1
    )

def my_verifier_reward_function(
    prompts: List[str], responses: List[str], golden_responses: Optional[List[str]] = None
) -> List[float]:
    """
    Math verifier reward function, evaluate the accuracy of the answer

    Args:
        prompts: List of math problems
        responses: List of model answers
        golden_responses: Optional list of golden responses
    Returns:
        List of reward scores for each (prompt, response) pair
    """
    rewards = []
    format_rewards = []
    for prompt, response, golden_response in zip(prompts, responses, golden_responses):
        if prompt is None:
            return jsonify({'error': f'problem not found from {prompt}'}), 400
        if golden_response is None:
            return jsonify({'error': f'golden response not found from {prompt}'}), 400
        # TODO: processing the error code 400

        format_reward = float(verify_format(response))
        rewards.append(format_reward)
        format_rewards.append(format_reward)

        do_print = random.randint(1, 10) == 1
        if do_print:
            info = f'Query: {prompt}\n\nAnswer: {golden_response}\n\nResponse: {response}\n\nFormat Reward: {format_reward}\n\n'
            info = re.sub(r'<\|.*?\|>', '', info)
            print(info)
    return rewards
```

### 6.2 注册自定义奖励函数

完成奖励函数实现后,需要将其注册到框架中:

1. 在`align_anything/models/remote_rm/reward_functions/__init__.py`添加:
```python
from .my_verifier import *
```

2. 在`align_anything/models/remote_rm/run_reward_server.py`中注册函数:
```python
reward_functions = {
    'example_math': example_math_reward_function,
    'example_coding': example_coding_reward_function,
    'example_safety': example_safety_reward_function,
    'math_verifier': math_verifier_reward_function,
    'my_verifier': my_verifier_reward_function,
}
```

3. 修改`scripts/start_remote_rm.sh`中的配置:
```bash
export REWARD_TYPE="my_verifier"
```

至此,自定义奖励函数配置完成。

### 6.3 在自定义奖励函数后训练
我们通过同样的命令进行训练, 但背后是我们的自定义奖励函数在计算reward

```bash
# NOTE need to start the remote rm server first
bash start_remote_rm.sh

# NOTE need to change the model path
ACTOR_MODEL_NAME_OR_PATH="meta-llama/Llama-3.1-8B-Instruct" # actor model path

TRAIN_DATASETS="../align_anything/models/remote_rm/math_verify_dataset/mathvl_345_example.json" # dataset path
TRAIN_TEMPLATE="Math-Zero-RL" # math zero rlhf dataset template, note that for math zero rl, you are recommended to expand token length to longer length such as 18000
TRAIN_SPLIT="train" # split the input dataset

OUTPUT_DIR="../output/llama_grpo_remote_rm" # output dir
# For wandb online logging
export WANDB_API_KEY=""

export REMOTE_RM_URL="http://127.0.0.1:6000/get_reward"
# Source the setup script
source ./setup.sh

# Execute deepspeed command
deepspeed \
  --master_port ${MASTER_PORT} \
  --module align_anything.trainers.text_to_text.grpo_remote_rm \
  --actor_model_name_or_path ${ACTOR_MODEL_NAME_OR_PATH} \
  --remote_rm_url ${REMOTE_RM_URL} \
  --train_datasets ${TRAIN_DATASETS} \
  --train_split ${TRAIN_SPLIT} \
  --train_template ${TRAIN_TEMPLATE} \
  --output_dir ${OUTPUT_DIR}
```

### 6.4 检查奖励输出

为防止reward hacking(奖励函数被模型钻空子),需要检查模型行为是否符合预期:

1. 查看reward server日志:
```bash
tail -f align-anything/debug_logs/reward_server.log
```

如发现异常,及时调整奖励函数的判定逻辑。

## 6. 致谢

- [Hugging Face Transformers 文档](https://huggingface.co/docs/transformers/index)
- [GRPO 论文](https://arxiv.org/pdf/2402.03300)
- [DeepSeek-R1 论文](https://arxiv.org/abs/2501.12948)