# Fine-tuning Models with GRPO and RLVR Algorithms

This tutorial demonstrates how to fine-tune large language models (using **Llama-3.1-8B-Instruct** as an example) using the **Group Relative Policy Optimization (GRPO)** algorithm. Through this tutorial, you will learn how to customize reward functions for your tasks under the **Align Anything** framework, and combine them with **Reinforcement Learning with Verifiable Rewards (RLVR)** to further improve model performance on specific tasks.

## 1.1 What is GRPO?

**Group Relative Policy Optimization (GRPO)** is a reinforcement learning algorithm designed to enhance model reasoning capabilities through grouping and relative reward mechanisms. GRPO was first introduced in the paper *DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models* and was successfully applied in the post-training phase of DeepSeek-R1.

GRPO aims to optimize model behavior through relative comparison policies rather than absolute rewards. Specifically, GRPO groups multiple model outputs and calculates reward values based on their relative performance. This approach helps mitigate issues in traditional reinforcement learning where absolute rewards are difficult to define or lack precision, making it particularly suitable for complex reasoning tasks.

## 1.2 What is RLVR?

**Reinforcement Learning with Verifiable Rewards (RLVR)** is a novel language model training method designed for tasks with verifiable outcomes (such as mathematical problem-solving and instruction following). RLVR uses existing reinforcement learning reward mechanisms (like RLHF) but replaces traditional reward models with a verification function.

Unlike traditional methods, RLVR trains models using binary signals through answer matching or constraint verification (e.g., whether an answer is correct). When applied to mathematical domains or other verifiable tasks, RLVR not only improves performance on specific benchmarks (like GSM8K) but also maintains stable performance across other tasks.

RLVR can be viewed as a simplified version of existing methods, such as RL with execution feedback or bootstrapping methods for language model reasoning. Its core idea is to use verifiable signals as direct rewards, avoiding the complex process of building sophisticated reward models.

## 2. Environment Setup

Before starting, please make sure you have installed the ``align-anything`` package.

```bash
# Clone the repository
git clone git@github.com:PKU-Alignment/align-anything.git
cd align-anything

# Create a virtual environment using conda
conda create -n align-anything python==3.11
conda activate align-anything
```

- **`[Optional]`** We recommend installing [CUDA](https://anaconda.org/nvidia/cuda) in the conda environment and set the environment variable.

```bash
# We have tested this version of CUDA on the H800 computing cluster and it worked well.
# You can adjust this version according to your actual computing cluster.

conda install nvidia/label/cuda-12.2.0::cuda
export CUDA_HOME=$CONDA_PREFIX
```

> If your CUDA is installed in a different location, such as `/usr/local/cuda/bin/nvcc`, you can set the environment variable as follows:

```bash
export CUDA_HOME="/usr/local/cuda"
```

Finally, install `align-anything` using the following command:

```bash
# We have prepared a quick installation for training and evaluation.
# If you only need to use the training or evaluation module,
# you can install the corresponding dependencies.
pip install -e .[train] # Install training dependencies
pip install -e .[evaluate] # Install evaluation dependencies

# If you need to install all dependencies, you can use the following command:
pip install -e .[all]
```

At last, according to https://github.com/PKU-Alignment/align-anything/tree/main/align_anything/models/remote_rm

You should 
```bash
pip install Levenshtein flask latex2sympy2_extended math_verify
```

## 3. Llama-3.1-8B-Instruct Model Output Example
Next, let's first test the zero-shot capability of the Llama-3.1-8B-Instruct model.

### 3.1 Import Required Libraries


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import torch

os.environ["TRANSFORMERS_OFFLINE"] = "1"
os.environ["HF_DATASETS_OFFLINE"] = "1"

  from .autonotebook import tqdm as notebook_tqdm


[1742778596.498488] [dsw-519274-66f65ff576-678dh:4051137:f]        vfs_fuse.c:281  UCX  ERROR inotify_add_watch(/tmp) failed: No space left on device


### 3.2 Load the Original Llama Model

In [None]:
device = "cuda"  # Set device to "cuda" to use GPU
model_path = (
    "/PATH/TO/YOUR/Meta-Llama-3.1-8B-Instruct"  # Please replace with your actual model path
)
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

# Set the model to evaluation mode
model.eval()

Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.29it/s]


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps=1e-05)
    (rotary_

### 3.3 Test the Performance of the Original Model

Let's test the Llama-3.1-8B-Instruct model with a sample question.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that answers user queries."},
    {
        "role": "user",
        "content": "How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    },
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([input_text], return_tensors="pt").to(device)

# the model generate new tokens
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)
# convert the generated tokens to text
generated_text = tokenizer.decode(
    output[0][len(inputs['input_ids'][0]) :], skip_special_tokens=True
)
print("\nGenerated Text:", generated_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Generated Text: The sequence of square roots of the positive integers is increasing. The largest term of the sequence that is less than or equal to 20 is $\sqrt{19}$, the square root of 16. Therefore, 16 terms of the sequence are less than or equal to 20. The sequence of 16 terms is

$\sqrt{1},\sqrt{2},\sqrt{3},\sqrt{4},\sqrt{5},\sqrt{6},\sqrt{7},\sqrt{8},\sqrt{9},\sqrt{10},\sqrt{11},\sqrt{12},\sqrt{13},\sqrt{14},\sqrt{15},\sqrt{16}$


As the correct answer is 400, this demonstrates that there is still room for improvement in Llama 3.1's mathematical capabilities.

## 4. Training the Model Using the GRPO Algorithm

**Note**: If you cannot access huggingface.co, set the Hugging Face endpoint to hf-mirror.com. You can do this with the following command:

`export HF_ENDPOINT="https://hf-mirror.com"`

Here, we take the PKU-SafeRLHF series dataset as an example. The PKU-SafeRLHF dataset is a preference dataset focused on safety alignment. Each data entry in this dataset contains two responses to the same question, along with their corresponding safety meta-tags and preference annotations.

You can refer to the training script below:

```bash
# NOTE need to start the remote rm server first
bash start_remote_rm.sh

# NOTE need to change the model path
ACTOR_MODEL_NAME_OR_PATH="meta-llama/Llama-3.1-8B-Instruct" # actor model path

TRAIN_DATASETS="../align_anything/models/remote_rm/math_verify_dataset/mathvl_345_example.json" # dataset path
TRAIN_TEMPLATE="Math-Zero-RL" # math zero rlhf dataset template, note that for math zero rl, you are recommended to expand token length to longer length such as 18000
TRAIN_SPLIT="train" # split the input dataset

OUTPUT_DIR="../output/llama_grpo_remote_rm" # output dir
# For wandb online logging
export WANDB_API_KEY=""

export REMOTE_RM_URL="http://127.0.0.1:6000/get_reward"
# Source the setup script
source ./setup.sh

# Execute deepspeed command
deepspeed \
  --master_port ${MASTER_PORT} \
  --module align_anything.trainers.text_to_text.grpo_remote_rm \
  --actor_model_name_or_path ${ACTOR_MODEL_NAME_OR_PATH} \
  --remote_rm_url ${REMOTE_RM_URL} \
  --train_datasets ${TRAIN_DATASETS} \
  --train_split ${TRAIN_SPLIT} \
  --train_template ${TRAIN_TEMPLATE} \
  --output_dir ${OUTPUT_DIR}
```

After training is completed, you can find the trained model weights under the `OUTPUT_DIR`.

## 5. Test the Performance of the Model After GRPO Training

After the training is complete, we try to test whether the math of the trained model has improved.

### 5.1 Load the New Model Weights


In [None]:
model_path = "/PATH/TO/YOUR/TRAINED_MODEL"  # Please replace with your actual model path
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

# Set the model to evaluation mode
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128257, 4096, padding_idx=128256)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps

### 5.2 测试新模型的性能

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that answers user queries."},
    {
        "role": "user",
        "content": "How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    },
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([input_text], return_tensors="pt").to(device)

# the model generate new tokens
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)
# convert the generated tokens to text
generated_text = tokenizer.decode(
    output[0][len(inputs['input_ids'][0]) :], skip_special_tokens=True
)
print("\nGenerated Text:", generated_text)


Generated Text: To find out how many terms are less than or equal to $20$, we can find out which term is greater than $20$, and then subtract $1$ to find the answer.

Recognize that $\sqrt{400} = 20$.

The sequence goes by consecutive integers (1, 2, 3, 4, ect), so $\sqrt{400}$ will be the 400th term.

Thus, we can say every term up to the 400th term is less than or equal to $20$, except $\sqrt{400}$.


This shows that the fine-tuned model did indeed solve the problem correctly.

(Strictly speaking, the test question was from the training dataset, so this is an in-distribution test)

# 6. Customizing Reward Functions
In this section, we will learn how to customize reward functions, allowing you to design specific scoring mechanisms based on your task requirements.

### 6.1 Creating Reward Function Files
First, create a new reward function file in the reward_functions directory of the project:
```bash
cd align-anything/align_anything/models/remote_rm/reward_functions/
touch my_verifier.py
```
We can refer to the examples in examples.py to implement our own reward function. In this example, we'll implement a simple format verification reward function that focuses on whether the answer format is correct, without considering the accuracy of the answer.
Here's the specific implementation code
```python
# align_anything/models/remote_rm/reward_functions/my_verifier.py
import random
import re
from typing import List, Optional

from flask import jsonify

format_pattern = r'^<think>(?:(?!</think>).)*</think><answer>(?:(?!</answer>).)*</answer>\Z'


def verify_format(content):
    """
    Verify if the string meets the format requirements:
    - Must start with <think> and end with </answer>
    - Must contain exactly one pair of <think>...</think> and <answer>...</answer> tags
    - No extra characters allowed between </think> and <answer> tags
    """
    think_count = content.count('<think>')
    answer_count = content.count('<answer>')
    return (
        bool(re.match(format_pattern, content, re.DOTALL))
        and think_count == 1
        and answer_count == 1
    )

def my_verifier_reward_function(
    prompts: List[str], responses: List[str], golden_responses: Optional[List[str]] = None
) -> List[float]:
    """
    Math verifier reward function, evaluate the accuracy of the answer

    Args:
        prompts: List of math problems
        responses: List of model answers
        golden_responses: Optional list of golden responses
    Returns:
        List of reward scores for each (prompt, response) pair
    """
    rewards = []
    format_rewards = []
    for prompt, response, golden_response in zip(prompts, responses, golden_responses):
        if prompt is None:
            return jsonify({'error': f'problem not found from {prompt}'}), 400
        if golden_response is None:
            return jsonify({'error': f'golden response not found from {prompt}'}), 400
        # TODO: processing the error code 400

        format_reward = float(verify_format(response))
        rewards.append(format_reward)
        format_rewards.append(format_reward)

        do_print = random.randint(1, 10) == 1
        if do_print:
            info = f'Query: {prompt}\n\nAnswer: {golden_response}\n\nResponse: {response}\n\nFormat Reward: {format_reward}\n\n'
            info = re.sub(r'<\|.*?\|>', '', info)
            print(info)
    return rewards
```

### 6.2 Registering Custom Reward Functions

After implementing the reward function, you need to register it in the framework:

1. Add the following to `align_anything/models/remote_rm/reward_functions/__init__.py`:
```python
from .my_verifier import *
```

2. Register the function in `align_anything/models/remote_rm/run_reward_server.py`:
```python
reward_functions = {
    'example_math': example_math_reward_function,
    'example_coding': example_coding_reward_function,
    'example_safety': example_safety_reward_function,
    'math_verifier': math_verifier_reward_function,
    'my_verifier': my_verifier_reward_function,
}
```

3. Modify the configuration in `scripts/start_remote_rm.sh`:
```bash
export REWARD_TYPE="my_verifier"
```

With this, the custom reward function configuration is complete.

### 6.3 Training with Custom Reward Functions
We use the same training command, but now our custom reward function is calculating the rewards behind the scenes

```bash
# NOTE need to start the remote rm server first
bash start_remote_rm.sh

# NOTE need to change the model path
ACTOR_MODEL_NAME_OR_PATH="meta-llama/Llama-3.1-8B-Instruct" # actor model path

TRAIN_DATASETS="../align_anything/models/remote_rm/math_verify_dataset/mathvl_345_example.json" # dataset path
TRAIN_TEMPLATE="Math-Zero-RL" # math zero rlhf dataset template, note that for math zero rl, you are recommended to expand token length to longer length such as 18000
TRAIN_SPLIT="train" # split the input dataset

OUTPUT_DIR="../output/llama_grpo_remote_rm" # output dir
# For wandb online logging
export WANDB_API_KEY=""

export REMOTE_RM_URL="http://127.0.0.1:6000/get_reward"
# Source the setup script
source ./setup.sh

# Execute deepspeed command
deepspeed \
  --master_port ${MASTER_PORT} \
  --module align_anything.trainers.text_to_text.grpo_remote_rm \
  --actor_model_name_or_path ${ACTOR_MODEL_NAME_OR_PATH} \
  --remote_rm_url ${REMOTE_RM_URL} \
  --train_datasets ${TRAIN_DATASETS} \
  --train_split ${TRAIN_SPLIT} \
  --train_template ${TRAIN_TEMPLATE} \
  --output_dir ${OUTPUT_DIR}

```

### 6.4 Checking Reward Outputs

To prevent reward hacking (where the model exploits loopholes in the reward function), we need to verify if the model's behavior meets expectations:

1. Check the reward server logs:
```bash
tail -f align-anything/debug_logs/reward_server.log
```

If any anomalies are detected, adjust the reward function's evaluation logic promptly.

## 6. Acknowledgements

- [Hugging Face Transformers 文档](https://huggingface.co/docs/transformers/index)
- [GRPO Paper](https://arxiv.org/pdf/2402.03300)
- [DeepSeek-R1 Paper](https://arxiv.org/abs/2501.12948)