# 🧠 Post-Training an LLM for Reasoning with GRPO in TRL

In this notebook, we guide you through the process of **post-training a Large Language Model (LLM)** using **Group Relative Policy Optimization (GRPO)**, a method introduced in the [DeepSeekMath paper](https://arxiv.org/abs/2402.03300).

GRPO is particularly effective for **scaling test-time compute for extended reasoning**, making it an ideal approach for tackling complex tasks such as mathematical problem-solving.

---

#### 🧐 What is GRPO?

**Group Relative Policy Optimization (GRPO)** is a reinforcement learning (RL) post-training technique developed and used in the training of [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1). It builds on concepts from PPO and DPO, but introduces a **group-wise reward normalization**, enabling better reasoning and more stable learning.

Unlike earlier techniques that relied heavily on search heuristics, GRPO **relies exclusively on RL** to fine-tune the LLM post-SFT (Supervised Fine-Tuning), enhancing its capacity to solve nuanced, multi-step tasks.

> 🔎 **Note**: Unlike DPO, GRPO does **not** use pairwise preference data. Instead, it relies on a **grouping of prompts by category or difficulty**, and optimizes based on **reward normalization within those groups**.

The GRPO method is available in the [TRL library](https://huggingface.co/docs/trl/main/en/grpo_trainer#quick-start), and the Hugging Face Science team is actively working to reproduce the full DeepSeek-R1 training process via the [Open-R1 project](https://github.com/huggingface/open-r1).

---

#### 🔄 Comparing GRPO, DPO, and PPO

| Aspect                     | PPO (Proximal Policy Optimization)        | DPO (Direct Preference Optimization)         | GRPO (Group Relative Policy Optimization)           |
|---------------------------|-------------------------------------------|----------------------------------------------|------------------------------------------------------|
| **Type**                  | RL algorithm                              | Supervised objective (no reward model)       | RL-based with group-wise normalization               |
| **Training Signal**       | Uses a learned reward model               | Uses pairwise preference labels              | Uses task-specific rewards, normalized across groups |
| **Stability**             | Prone to instability in large-scale LLMs  | More stable due to no sampling/rollouts      | More stable than PPO via group normalization         |
| **Compute Requirements**  | High (sampling, rollout + reward model)   | Low (no sampling or reward model inference)  | Medium-High (RL training, no reward model)           |
| **Alignment Type**        | Reward-based RL                           | Implicit via preference-based supervision    | Reward-based RL on grouped task data                 |
| **Strengths**             | Proven RL method, widely used             | Simplicity, fast training, stability         | Better reasoning ability, handles outliers           |
| **Weaknesses**            | Reward model is hard to train well        | Might underperform in reasoning-heavy tasks  | Needs clear grouping logic and high-quality tasks    |

---

#### 💡 Why GRPO?

GRPO was specifically designed to **enhance reasoning ability** by promoting group-aware learning. It:

- Encourages **relative improvement within groups of samples** (e.g., math questions of similar difficulty)
- Promotes generalization across problem types
- Improves robustness to reward outliers by normalizing over similar examples

These advantages make GRPO especially promising for tasks that require **multi-step reasoning and consistency**, such as math, code generation, or logic-based problem-solving.

---

#### 📘 About This Notebook

We focus specifically on **post-training with GRPO** using Hugging Face's TRL library. This notebook provides:

- A hands-on demonstration of using `GRPOTrainer`
- An overview of how group-based preferences are formatted
- A look at how this training compares to other RLHF techniques

> For a deeper dive into the full DeepSeek-R1 training procedure, check out the [Open-R1 repository](https://github.com/huggingface/open-r1) (**HINT** you might need to check this afterwards for the exercise)

---

#### 🧩 GRPO Training Pipeline (Illustrated)

The diagram below highlights the main differences between **PPO** (Proximal Policy Optimization) and **GRPO** (Group Relative Policy Optimization), specifically the removal of the value model in GRPO. For more detailed information on the key differences, you can refer to the [full explanation here](https://www.philschmid.de/deepseek-r1).
![image](https://miro.medium.com/v2/resize:fit:1400/1*84PSf3d1-OGN10y_2H-XdQ.png)

#### 1. Install Dependencies

Let’s start by installing the essential libraries we’ll need for fine-tuning! 🚀


In [1]:
!pip install --q unsloth vllm math_verify pypdf wandb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.8/46.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.5/218.5 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m326.4/326.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.4/98.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m99.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.6/87.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Import needed packages

In [4]:
# Python standard library
import os
import re
from typing import List

# Third-party libraries
import torch
import warnings
from datasets import load_dataset, Dataset
from pypdf import PdfReader
from sentence_transformers import CrossEncoder
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from trl import GRPOConfig, GRPOTrainer

# Custom/project-specific libraries
from math_verify import LatexExtractionConfig, parse, verify
from unsloth import FastLanguageModel, is_bfloat16_supported

Authenticate with your Hugging Face account to save and share your model directly from this notebook 🗝️.

In [6]:
# set device
device = "cuda" if torch.cuda.is_available() else "cpu" # set device to cuda if available

# Set the huggingface token (get your token from https://huggingface.co/settings/tokens)
os.environ["HF_TOKEN"] = "YOU API TOKEN"

# Set the wandb token (get your token from https://wandb.ai/authorize)
os.environ["WANDB_API_KEY"] = "YOU API TOKEN"

# Ignore warnings
warnings.filterwarnings("ignore")

General Config

In [8]:
# ----------------------------------------
# ✅ Candidate 4-bit Models (memory-efficient, quantized)
# ----------------------------------------
# Note: These models are optimized with Unsloth's 4-bit quantization,
# providing faster inference and lower memory usage on limited hardware.
fourbit_models = [
    # Qwen series (good performance, various sizes)
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit",     # ✅ Smallest Qwen model, fast & memory-efficient
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",       # ⚖️ Middle ground
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",       # 💪 Better reasoning, more memory needed
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",      # 🚀 High performance, needs more compute
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",      # 🧠 Best accuracy but resource-intensive

    # Other strong general-purpose or task-tuned 4-bit models
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit", # Fine-tuned Gemma 12B
    "unsloth/Phi-4",                           # Microsoft's compact model, good for reasoning
    "unsloth/Llama-3.1-8B",                    # Meta's LLaMA 3 variant
    "unsloth/Llama-3.2-3B",                    # Lightweight LLaMA 3
    "unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit" # [NEW] TTS-capable model
]
# 🔗 More models: https://huggingface.co/unsloth

# ----------------------------------------
# ✅ Model Configuration
# ----------------------------------------
MODEL = "unsloth/Qwen3-1.7B"  # Default model for training/inference (change as needed)
max_seq_length = 2048         # Allows longer reasoning chains (e.g., for math or code tasks)
lora_rank = 32                # LoRA rank: higher values = better adaptation, more VRAM usage
NEW_MODEL = "Qwen3_1.7B-GRPO-math-reasoning"

# ----------------------------------------
# ✅ Prompting Strategy
# ----------------------------------------
SYSTEM_PROMPT = """
Respond in the following format:
/nothink
<reasoning>
Briefly explain your reasoning. Be concise and avoid unnecessary detail.
</reasoning>
<answer>
answer here
</answer>
"""
# ----------------------------------------
# ✅ Dataset
# ----------------------------------------
DATASET = "lighteval/MATH-Hard"  # Benchmark dataset for evaluating math reasoning difficulty

#### 2. Load Dataset 📁

These models excel at tasks that require **complex reasoning**. A prime example is **mathematical problem-solving**, which often demands multi-step reasoning to arrive at a correct solution.

For this project, we'll use the **lighteval/MATH-Hard** dataset on Hugging Face is a curated benchmark designed to evaluate large language models (LLMs) on challenging high school-level mathematics problems. It focuses exclusively on Level 5 questions from the original MATH dataset, which are the most difficult problems sourced from competitions like AMC 10/12 and AIME. These problems require multi-step reasoning and often involve algebra, geometry, number theory, and combinatorics.

In [9]:
def get_math_questions(split="train") -> Dataset:
    # Load the raw dataset from the hub
    data = load_dataset(DATASET, 'default')[split]
    data = data.map(lambda x: {
            'prompt': [
                {'role': 'system', 'content': SYSTEM_PROMPT},
                {'role': 'user', 'content': x['problem']}
            ],
            'answer': x['solution'],
            'question': x['problem']
        }).remove_columns(['problem', 'solution','level','type'])
    return data

# To get the three splits, call:
train_dataset=  get_math_questions(split="train")
test_dataset =  get_math_questions(split="test")


README.md:   0%|          | 0.00/4.27k [00:00<?, ?B/s]

algebra.jsonl:   0%|          | 0.00/412k [00:00<?, ?B/s]

counting_and_probability.jsonl:   0%|          | 0.00/373k [00:00<?, ?B/s]

geometry.jsonl:   0%|          | 0.00/726k [00:00<?, ?B/s]

intermediate_algebra.jsonl:   0%|          | 0.00/568k [00:00<?, ?B/s]

number_theory.jsonl:   0%|          | 0.00/333k [00:00<?, ?B/s]

prealgebra.jsonl:   0%|          | 0.00/240k [00:00<?, ?B/s]

precalculus.jsonl:   0%|          | 0.00/329k [00:00<?, ?B/s]

algebra.jsonl:   0%|          | 0.00/284k [00:00<?, ?B/s]

counting_and_probability.jsonl:   0%|          | 0.00/135k [00:00<?, ?B/s]

geometry.jsonl:   0%|          | 0.00/227k [00:00<?, ?B/s]

intermediate_algebra.jsonl:   0%|          | 0.00/391k [00:00<?, ?B/s]

number_theory.jsonl:   0%|          | 0.00/152k [00:00<?, ?B/s]

prealgebra.jsonl:   0%|          | 0.00/179k [00:00<?, ?B/s]

precalculus.jsonl:   0%|          | 0.00/242k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2304 [00:00<?, ? examples/s]

Map:   0%|          | 0/1324 [00:00<?, ? examples/s]

In [10]:
print("train dataset",train_dataset)
print("test dataset",test_dataset)
print("train sample",train_dataset[0])

train dataset Dataset({
    features: ['prompt', 'answer', 'question'],
    num_rows: 2304
})
test dataset Dataset({
    features: ['prompt', 'answer', 'question'],
    num_rows: 1324
})
train sample {'prompt': [{'content': '\nRespond in the following format:\n/nothink \n<reasoning>\nBriefly explain your reasoning. Be concise and avoid unnecessary detail.\n</reasoning>\n<answer>\nanswer here\n</answer>\n', 'role': 'system'}, {'content': 'What is the range of the function $y = \\frac{x^2 + 3x + 2}{x+1}$?  (Express your answer using interval notation.)', 'role': 'user'}], 'answer': 'We can factor the numerator to get $y = \\frac{(x+1)(x+2)}{x+1}$. If we exclude the case where $x = -1$, the function is equivalent to $y = x+2$. However, because $x$ cannot equal $-1$, $y$ cannot equal 1. Therefore, the range is all real numbers except for 1, which we may write as $y \\in \\boxed{(-\\infty, 1)\\cup(1, \\infty)}.$', 'question': 'What is the range of the function $y = \\frac{x^2 + 3x + 2}{x+1}

## 3. Post-Training the Base Model Using GRPO



### 3.1 Loading the Baseline Model

To begin, we'll load [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B). With only 1.7 billion parameters, it is lightweight and fits within the available resources. However, for better results, a larger [alternative](https://qwenlm.github.io/blog/qwen3/) should be considered.

[Benchmark](https://dev.to/best_codes/qwen-3-benchmarks-comparisons-model-specifications-and-more-4hoa)



In [11]:
# Load the language model with optional 4-bit quantization and LoRA
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL,              # Name or path of the pretrained model
    max_seq_length=max_seq_length,  # Sets maximum input size handled by the model
    load_in_4bit=True,              # Use 4-bit quantization to save GPU memory and speed up inference
    max_lora_rank=lora_rank,        # Sets the rank for the LoRA adaptation
    full_finetuning=False,          # If True, fine-tunes all weights; if False, only fine-tunes LoRA layers
    # fast_inference=True           # Optional: Enable vLLM-style fast inference (commented out here)
)


==((====))==  Unsloth 2025.4.7: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

This model would normally require ~6.8 GB of memory (assuming 32-bit floating point: 1.7B × 4 bytes), However since we decided to apply Quantization(load_in_4bit=True)
This reduces memory footprint by 8× compared to 32-bit:
From 4 bytes per parameter → 0.5 bytes per parameter
1.7B parameters × 0.5 bytes = ~0.85 GB

In [12]:
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
    num_params = param.numel()
    # if using DS Zero 3 and the weights are initialized empty
    if num_params == 0 and hasattr(param, "ds_numel"):
        num_params = param.ds_numel

    # Due to the design of 4bit linear layers from bitsandbytes
    # one needs to multiply the number of parameters by 2 to get
    # the correct number of parameters
    if param.__class__.__name__ == "Params4bit":
        if hasattr(param, "element_size"):
            num_bytes = param.element_size()
        elif not hasattr(param, "quant_storage"):
            num_bytes = 1
        else:
            num_bytes = param.quant_storage.itemsize
        num_params = num_params * 2 * num_bytes

    all_param += num_params
    if param.requires_grad:
        trainable_params += num_params

print(f"trainable params: {trainable_params:,} || {all_param:,} || trainable%: {trainable_params/all_param:.2f}")


trainable params: 37,872,640 || 1,720,574,976 || trainable%: 0.02


### 3.2 Configuring LoRA

Next, we will configure LoRA for model training. This technique will allow us to efficiently fine-tune the model with a reduced number of parameters, enabling faster and more resource-efficient training.

In [13]:
# Apply LoRA (Low-Rank Adaptation) using PEFT to the base model
model = FastLanguageModel.get_peft_model(
    model,

    r=lora_rank,  # LoRA rank: Controls number of trainable parameters. Common values are 8, 16, 32, 64.
                  # Higher rank → more capacity to adapt → better accuracy but slower and more memory usage.

    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",        # Attention projections
        "gate_proj", "up_proj", "down_proj"           # Feed-forward projections
    ],
    # These modules are typically the most sensitive and impactful in transformer adaptation.

    lora_alpha=lora_rank,  # Scaling factor for LoRA updates. Alpha = rank or 2×rank is a common rule of thumb.

    lora_dropout=0,        # Dropout applied to LoRA layers.
                           # 0 is best for most tasks and is optimized in frameworks like Unsloth.

    bias="none",           # Bias configuration: "none", "all", or "lora_only".
                           # "none" is memory-efficient and recommended when biases don’t significantly affect results.

    use_gradient_checkpointing="unsloth",  # Use gradient checkpointing to save memory during backpropagation.
                                           # "unsloth" mode is optimized for long-context tasks (e.g., long reasoning chains).

    random_state=3407,     # Ensures reproducibility of LoRA weight initialization.

    use_rslora=False,      # If True, enables Rank-Stabilized LoRA (adds rank flexibility).
                           # Off here to keep configuration standard and stable.

    loftq_config=None      # Optional: If using LoftQ quantization-aware LoRA training.
                           # Not used here — defaulting to standard LoRA without LoftQ.
)


Unsloth 2025.4.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [14]:
model.print_trainable_parameters()


trainable params: 34,865,152 || all params: 1,755,440,128 || trainable%: 1.9861


There was a significant reduction in traing size from  37,872,640 to  34,865,152, due to LORA (Low-Rank Adaptation), it is a parameter-efficient fine-tuning method that dramatically reduces the number of trainable parameters while maintaining performance. Here's how to understand and calculate its parameter efficiency.

How LoRA Works

Instead of updating the entire weight matrix during fine-tuning, LoRA approximates weight updates using low-rank decomposition:
W' = W + ΔW = W + BA
Where:

- W is the original weight matrix (dimensions m×n)
- B is a matrix of dimension m×r
- A is a matrix of dimension r×n
- r is the rank (much smaller than m and n) (32 in this case)

Calculating Parameter Reduction

For each weight matrix W with dimensions m×n:

Full fine-tuning parameters: m×n
LoRA parameters: r(m+n) = (m×r) + (r×n)

For example, if you have a weight matrix of size 4096×4096 and r=16:

Full parameters: 4096×4096 = 16,777,216
LoRA parameters: 16×(4096+4096) = 131,072
Reduction: ~99.2%
![image](https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dfbd169-eb7e-41e1-a050-556ccd6fb679_1600x672.png)

### 3.3 Loading Reward Functions
In Group Relative Policy Optimization (GRPO), reward functions are essential because they guide the preference-based optimization of the policy by comparing the quality of generated outputs within a group. Unlike standard Reinforcement Learning (RL), where absolute rewards are assigned to single actions or trajectories, GRPO relies on relative comparisons—often derived from human preferences or heuristics—to update the policy.
In this case, we will utilize these reward functions:

1. **Format Enforcement:** Ensures that the generation follows a specific format using `<think> </think> <answer> </answer>` tags for reasoning.  

In [22]:
def tag_presence_reward(completions: List[dict], **kwargs) -> List[float]:
    """Reward for presence of <reasoning> and <answer> tags"""

    print(completions[0])
    rewards = []
    for completion in completions:
        content = completion[0]['content']
        has_reasoning = bool(re.search(r'<reasoning>.*?</reasoning>', content, re.DOTALL))
        has_answer = bool(re.search(r'<answer>.*?</answer>', content, re.DOTALL))
        reward = 0.5 * has_reasoning + 0.5 * has_answer
        rewards.append(reward)
    return rewards

2. **Solution Accuracy:** Verifies that checks whether each generated model completion matches the expected ground truth solution. If the model's output matches the gold solution (e.g., mathematical expression), it rewards the completion; otherwise, it assigns a penalty (0.0 or near 0). This helps reinforce accurate model behavior in tasks where precision is crucial, such as solving math problems or parsing structured data.correct.

In [25]:
def accuracy_reward(completions:List[dict], **kwargs)-> List[float]:
    """Reward function that checks if the completion is the same as the ground truth."""
    solutions = kwargs['answer']
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        answer_parsed = parse(content, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        if len(gold_parsed) != 0:
            try:
                rewards.append(float(verify(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)
    return rewards

3. **Semantic correctness:** reward uses a cross-encoder model (cross-encoder/stsb-roberta-base) to evaluate the similarity between generated responses and reference answers. When no valid answer is extracted (i.e., an empty response), the function assigns a reward of -1.0 to indicate failure in producing an answer.

In [26]:
def semantic_correctness(completions: List[str], **kwargs) -> List[float]:
    """answers semantic similarity using cross-encoder"""
    answers = kwargs['answer']
    model_ss = CrossEncoder('cross-encoder/stsb-roberta-base', device="cuda")
    responses = [completion[0]["content"] for completion in completions]
    inputs = list(zip(responses, answers))
    with torch.no_grad():
        similarities = model_ss.predict(inputs, show_progress_bar=False)
        similarities = torch.tensor(similarities).clone().tolist()
        # Set similarity to -1 if the response is an empty string
        similarities = [-1.0 if response == "" else similarity for response, similarity in zip(responses, similarities)]
        return similarities

==> Infer the original model on a a given question (to test the generation config)

In [17]:
# text = "Find all values of $x$ that satisfy the equation $|x-3|=2x+4$. Express your answers in simplest fractional form."
messages = [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': train_dataset[5]['question']}
        ]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 2048, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<reasoning>
To find the $y$-coordinate of the point on the $y$-axis equidistant from points $A(-2, 0)$ and $B(-1, 4)$, we set up the equation of equidistance. Let the point be $(0, y)$. The distance from $(0, y)$ to $A(-2, 0)$ is $\sqrt{(-2 - 0)^2 + (0 - y)^2} = \sqrt{4 + y^2}$. The distance from $(0, y)$ to $B(-1, 4)$ is $\sqrt{(-1 - 0)^2 + (4 - y)^2} = \sqrt{1 + (4 - y)^2}$. Setting these equal: $\sqrt{4 + y^2} = \sqrt{1 + (4 - y)^2}$. Squaring both sides: $4 + y^2 = 1 + (4 - y)^2$. Simplifying: $4 + y^2 = 1 + 16 - 8y + y^2$. Cancel $y^2$ and simplify: $4 = 17 - 8y$. Solving: $8y = 13$, so $y = \frac{13}{8}$. The answer is $\frac{13}{8}$.
</reasoning>
<answer>
$\frac{13}{8}$
</answer><|im_end|>


### 3.4 Configuring GRPO Training Parameters

To keep things simple, we’ll start by training for just one epoch and reducing the `max_completion_length`, `num_generations`, and `max_prompt_length` from their default values.

In [29]:

# Configuration for the GRPO training setup
training_args = GRPOConfig(
    # use_vllm=True,  # Optional: Use vLLM for fast inference, but not compatible with Qwen 3 models (commented out)

    lr_scheduler_type="cosine",  # Cosine learning rate scheduler for smooth and natural decaying
                                # Good for tasks where we want gradual adjustments of the learning rate.

    per_device_train_batch_size=1,  # Batch size per GPU; 1 for memory efficiency on small GPUs like T4
                                    # Larger batch size will use more memory but may speed up training.

    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients before performing an update.
                                    # Increase this if you want larger effective batch sizes without running out of memory.

    warmup_steps=5,  # Number of steps to warm-up the learning rate. A small value for a quick adjustment.
                    # You may want to adjust this if the model's training is unstable at the start.

    max_steps=50,  # Number of training steps. Typically set based on your available compute and the model’s convergence speed.
                   # For long training runs, this may need to be adjusted for fine-grained control.

    learning_rate=2e-4,  # Base learning rate. A moderate value to start with; can be reduced (e.g., to 2e-5) for longer runs.
                         # Lower values tend to work better for large models or fine-tuning tasks.

    optim="adamw_8bit",  # Use 8-bit AdamW optimizer to save memory and speed up training.
                         # Good choice for larger models where memory and speed are concerns.

    max_grad_norm=0.1,  # Gradient clipping to prevent exploding gradients; 0.1 is a standard value.
                       # You can adjust if training becomes unstable or to improve convergence.

    max_prompt_length=500,  # Maximum input length (tokens) for the model’s prompt.
                           # Useful to control memory usage and ensure you don’t exceed model’s input limit.

    max_completion_length=1024,  # Maximum output length (tokens) for generated completions.
                               # Adjust according to the expected complexity or verbosity of the model’s response.

    seed=3407,  # Random seed for reproducibility of experiments.

    report_to="wandb",  # Reporting to Weights & Biases

    output_dir="qwen3_1_7B_grpo_math",  # Directory where model checkpoints and logs are saved.
                             # You can adjust this to store results in a more appropriate location.
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


 Another interesting parameter is reward_weights in order to Weights for each reward function. Must match the number of reward functions. If `None`, all "
            "rewards are weighted equally with weight `1.0`."
    

### 3.5 Training the Model 🏃

Now, let's configure the trainer and start training the model!


In [30]:
# Initialize the GRPOTrainer with the necessary configurations
trainer = GRPOTrainer(
    model=model,  # Pretrained language model that will be fine-tuned during training.

    processing_class=tokenizer,  # Tokenizer used to process input text for the model.
                                 # Ensures proper encoding and decoding of text into model-friendly formats.

    reward_funcs=[               # List of reward functions to evaluate and optimize the model’s outputs.
        tag_presence_reward,     # Reward function focusing on the presence of specific tags or keywords.
        semantic_correctness,    # Reward function evaluating how semantically accurate the model’s response is.
        accuracy_reward          # Reward function for the accuracy of model predictions, based on ground truth.
    ],

    args=training_args,          # Training configurations such as batch size, learning rate, etc. (from previous GRPOConfig).

    train_dataset=train_dataset, # Training dataset used to fine-tune the model.
                                # Should contain relevant examples for the task the model is being adapted for.
)

These settings are tuned to reduce hallucinations while preserving fluency and diversity, which is critical for math-heavy or logic-intensive datasets like lighteval/MATH-Hard.


In [32]:
# Modify the generation config
trainer.generation_config.temperature = 0.7
trainer.generation_config.top_p = 0.8
trainer.generation_config.top_k = 20

In [33]:
# Start the training process
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,304 | Num Epochs = 1 | Total steps = 50
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 34,865,152/7,000,000,000 (0.50% trained)
[34m[1mwandb[0m: Currently logged in as: [33mafaf[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`generation_config` default values have been modified to match model-specific defaults: {'max_length': 40960, 'bos_token_id': 151643, 'eos_token_id': [151645, 151643]}. If this is not desired, please set these values explicitly.


[{'role': 'assistant', 'content': '<think>\n\n</think>\n\nWe are given the equation:\n\n$$\n\\frac{x}{y} + \\frac{y}{x} = 6\n$$\n\nLet’s rewrite this as:\n\n$$\n\\frac{x^2 + y^2}{xy} = 6\n$$\n\nMultiply both sides by $xy$:\n\n$$\nx^2 + y^2 = 6xy\n$$\n\nRearranging:\n\n$$\nx^2 - 6xy + y^2 = 0\n$$\n\nThis is a quadratic in $x$:\n\n$$\nx^2 - 6y x + y^2 = 0\n$$\n\nUse the quadratic formula:\n\n$$\nx = \\frac{6y \\pm \\sqrt{(6y)^2 - 4(1)(y^2)}}{2}\n= \\frac{6y \\pm \\sqrt{36y^2 - 4y^2}}{2}\n= \\frac{6y \\pm \\sqrt{32y^2}}{2}\n= \\frac{6y \\pm 4y\\sqrt{2}}{2}\n= 3y \\pm 2y\\sqrt{2}\n$$\n\nSo the two solutions are:\n\n$$\nx = y(3 + 2\\sqrt{2}) \\quad \\text{and} \\quad x = y(3 - 2\\sqrt{2})\n$$\n\nSince $y > x > 0$, we take the smaller value of $x$:\n\n$$\nx = y(3 - 2\\sqrt{2})\n$$\n\nNow compute $\\frac{x + y}{x - y}$:\n\n$$\n\\frac{x + y}{x - y} = \\frac{y(3 - 2\\sqrt{2}) + y}{y(3 - 2\\sqrt{2}) - y}\n= \\frac{y(3 - 2\\sqrt{2} + 1)}{y(3 - 2\\sqrt{2} - 1)}\n= \\frac{4 - 2\\sqrt{2}}{2 - 2\\sqr

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / tag_presence_reward,rewards / semantic_correctness,rewards / accuracy_reward
1,-0.0,1.091846,0.547578,844.5,0.0,0.0,0.591846,0.5
2,0.0,0.593945,0.04936,932.125,0.0,0.0,0.593945,0.0
3,-0.0,0.579495,0.049599,976.625,2.3e-05,0.0,0.579495,0.0
4,0.0,1.491293,0.174118,521.375,0.000222,0.9375,0.553793,0.0
5,0.0001,0.56844,0.023266,842.125,0.003074,0.0,0.56844,0.0
6,0.0066,1.367671,0.179944,92.625,0.16509,0.9375,0.430171,0.0
7,0.0065,1.963586,0.36567,173.0,0.161573,0.625,0.588586,0.75
8,0.0001,0.581754,0.045383,1024.0,0.003755,0.0,0.581754,0.0
9,0.0193,1.65548,0.391732,174.625,0.481988,1.0,0.53048,0.125
10,0.0267,1.540992,0.017998,97.125,0.668415,1.0,0.540992,0.0


[{'role': 'assistant', 'content': '<think>\n\n</think>\n\nTo find the matrix $\\mathbf{M}$ satisfying the equation:\n\n$$\n\\mathbf{M}^3 - 4 \\mathbf{M}^2 + 5 \\mathbf{M} = \\begin{pmatrix} 10 & 20 \\\\ 5 & 10 \\end{pmatrix},\n$$\n\nwe recognize that this is a **linear equation** in $\\mathbf{M}$. Let’s denote:\n\n$$\n\\mathbf{A} = \\mathbf{M}.\n$$\n\nThen the equation becomes:\n\n$$\n\\mathbf{A}^3 - 4 \\mathbf{A}^2 + 5 \\mathbf{A} = \\begin{pmatrix} 10 & 20 \\\\ 5 & 10 \\end{pmatrix}.\n$$\n\nThis is a **linear equation** in $\\mathbf{A}$, and the left-hand side is a **linear transformation** of $\\mathbf{A}$. However, solving this equation directly for $\\mathbf{A}$ is non-trivial due to the cubic and quadratic terms.\n\nTo simplify, we can consider the **trace** and **determinant** of $\\mathbf{A}$, but this approach is not straightforward for a general 2x2 matrix.\n\nInstead, we can attempt to **guess** a form of $\\mathbf{M}$ that satisfies the equation. Let’s assume $\\mathbf{M}$ 

ERROR:math_verify.grader:Timeout during comparison


[{'role': 'assistant', 'content': '<think>\n\n</think>\n\n<reasoning>Since 3^2 = 9, 3^{2n} is congruent to 0 modulo 9. Therefore, 3^{2n} + 4 is congruent to 4 modulo 9. The inverse of 4 modulo 9 is a number x such that 4x ≡ 1 mod 9. Testing x=7: 4*7=28≡1 mod 9. So the inverse is 7. Thus, a ≡ 7 mod 9.</reasoning>\n<answer>7</answer>'}]
[{'role': 'assistant', 'content': "<think>\n\n</think>\n\n<reasoning>Since the equation involves $(2+i)$ raised to various powers, we can factor out $(2+i)^2$ from the terms involving $(2+i)^4$ and $(2+i)^3$. Let's rewrite the equation as $a(2+i)^4 + b(2+i)^3 + c(2+i)^2 + b(2+i) + a = 0$. Factoring out $(2+i)^2$, we get $(2+i)^2 [a(2+i)^2 + b(2+i) + c + \\frac{b}{(2+i)} + \\frac{a}{(2+i)}] = 0$. However, this approach may not be straightforward. Instead, let's substitute $x = 2 + i$ and rewrite the equation as $a x^4 + b x^3 + c x^2 + b x + a = 0$. Since $x = 2 + i$, we can compute $x^2 = (2 + i)^2 = 4 + 4i + i^2 = 3 + 4i$, $x^3 = x^2 \\cdot x = (3 + 4i)(

TrainOutput(global_step=50, training_loss=0.013173042598996858, metrics={'train_runtime': 4226.4256, 'train_samples_per_second': 0.095, 'train_steps_per_second': 0.012, 'total_flos': 0.0, 'train_loss': 0.013173042598996858})

!!!!!!**Normal GRPO Training Behavior: Loss Starting at Zero Then Increasing**!!!!!
This is completely normal. Initially, your model policy (πθ) equals your reference policy (πref), so their KL divergence is zero, meaning your loss starts at zero.
During training, as πθ optimizes toward higher rewards, it naturally diverges from πref, causing the KL penalty term (βDKL[πθ∥πref]) to increase - which increases your loss.
The simplified math shows this clearly:

When πθold = πθ (single exploration step)
After simplification and considering normalized advantages (∑A^i = 0)
The loss becomes: JGRPO(θ) = -β·DKL[πθ∥πref]

Your increasing loss actually indicates successful training - your policy is moving away from the reference to maximize rewards, constrained by the KL penalty.

### 3.6 Saving the Model to float16 for VLLM
We can save the model to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account!

In [94]:
user_name = "YOUR_USEER_NAME"
if True:
    model.save_pretrained_merged(NEW_MODEL, tokenizer, save_method = "merged_16bit",)
    model.push_to_hub_merged(f"{user_name}/{NEW_MODEL}", tokenizer, save_method="merged_16bit")



Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 1.4G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 2.01 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:02<00:00, 13.07it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Qwen3_1.7B-GRPO-math-reasoning/pytorch_model.bin...
Done.


Unsloth: You are pushing to hub, but you passed your HF username = Afaf.
We shall truncate Afaf/Qwen3_1.7B-GRPO-math-reasoning to Qwen3_1.7B-GRPO-math-reasoning


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 1.53 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 52.74it/s]


Unsloth: Saving tokenizer...

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

 Done.
Unsloth: Saving Qwen3_1.7B-GRPO-math-reasoning/pytorch_model.bin...


README.md:   0%|          | 0.00/587 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/Afaf/Qwen3_1.7B-GRPO-math-reasoning


We observe that the model demonstrates some reasoning capabilities, although these are limited. This can be attributed to several factors: the use of a small model, a limited subset of the dataset, and a short training duration to keep the process simple and practical for a notebook environment.

Despite these constraints, this technique shows great promise. The release of DeepSeek-R1 and the adoption of this training approach could lead to significant breakthroughs in the coming months!

## 4. Test the model

In case you didn't manage to finish the training, feel free to call and load the model from the hub running it on test dataset



In [98]:
# Load the language model with optional 4-bit quantization and LoRA
model_inf, tokenizer_inf = FastLanguageModel.from_pretrained(
    model_name="Afaf/Qwen3_1.7B-GRPO-math-reasoning",              # Name or path of the pretrained model
    max_seq_length=max_seq_length,  # Sets maximum input size handled by the model
    max_lora_rank=lora_rank,        # Sets the rank for the LoRA adaptation
    full_finetuning=False,          # If True, fine-tunes all weights; if False, only fine-tunes LoRA layers
    fast_inference=True)

Let us test it on Math&Maroc competition exercise:

In [36]:
# Extract text
reader = PdfReader("MMC_2024_day1.pdf")
all_text = '\n'.join([page.extract_text() for page in reader.pages[:1]]) # only eng page

# Find all problems (Problem 1:, Problem 2:, etc.)
pattern = r"Problem\s*(\d+)\s*:(.*?)(?=Problem\s*\d+:|$)"

problems = re.findall(pattern, all_text, re.DOTALL)

# problems is a list of tuples: (problem_number, problem_text)
for num, statement in problems:
    print(f"Problem {num}:\n{statement.strip()}\n")

In [149]:
messages = [
            {'role': 'user', 'content': problems[0]}
        ]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 2048, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

To show that the function $ f: \mathbb{R} \to \mathbb{R} $, which is differentiable and satisfies $ f(x)^2 + (f'(x))^2 \neq 0 $ for all $ x \in \mathbb{R} $, has at most a finite number of zeros in any bounded subset $ E \subseteq \mathbb{R} $, we proceed as follows:

---

### Step 1: Consider the function $ g(x) = f(x)^2 + (f'(x))^2 $

We are given that $ g(x) \neq 0 $ for all $ x \in \mathbb{R} $. Since $ g(x) $ is continuous (as $ f $ is differentiable), and it is never zero, it must be strictly positive or strictly negative everywhere. However, the problem states that $ g(x) \neq 0 $, so it is not zero at any point.

---

### Step 2: Consider the derivative of $ g(x) $

Compute the derivative of $ g(x) $:

$$
g'(x) = 2f(x)f'(x) + 2f'(x) \cdot f''(x) = 2f'(x)(f(x) + f''(x))
$$

But this is not directly useful. Instead, consider the following:

---

### Step 3: Use the Mean Value Theorem

Let’s suppose $ f(x) = 0 $ at some point $ x_0 \in E $. Then $ f'(x_0) \neq 0 $, since $ f(x)^2 

## 🏁 5.Team Exercise: GRPO Understanding Challenge

### Overview
This exercise tests your understanding of Group Relative Policy Optimization (GRPO) without requiring additional model training. Your team's performance will be evaluated on a leaderboard to determine the best understanding of the concepts.

### 🧩 Exercise: Designing a GRPO-Based Fine-Tuning Strategy

#### Task Description
Your team must design an improved GRPO fine-tuning strategy for a math reasoning model by making strategic decisions about various components of the pipeline.

#### Requirements

1. **Reward Function Design (35 points)**
   - Design a comprehensive set of reward functions for math reasoning
   - Explain how each reward function addresses a specific aspect of high-quality math solutions
   - Justify the relative weighting of different reward components

2. **Training Configuration Optimization (35 points)**
   - Recommend specific adjustments to the training hyperparameters
   - Justify each adjustment with clear reasoning
   - Provide a complete `GRPOConfig` code snippet with your optimized values

3. **Evaluation Methodology (30 points)**
   - Design a robust evaluation protocol for your GRPO-trained model
   - Specify metrics to track during and after training
   - Describe how you would determine if the GRPO approach is working better than simpler alternatives

### 📊 Evaluation Criteria

Your submission will be evaluated on:

1. **Technical Correctness:** Proper understanding of GRPO concepts
2. **Innovation:** Novel but practical ideas for improving the training process
3. **Implementation Feasibility:** How feasible your approach is to implement
4. **Justification Quality:** The depth and clarity of your reasoning

### 📝 Submission Format

Submit a Markdown or Python file containing:

```python
# Team Name: [Your Team Name]
# Team Members: [List of team members]

"""
REWARD FUNCTION DESIGN
---------------------
[Your detailed response here]
"""

"""
TRAINING CONFIGURATION OPTIMIZATION
---------------------------------
[Your detailed response here]
"""

"""
EVALUATION METHODOLOGY
--------------------
[Your detailed response here]
"""

# Bonus: Sample code snippet for any one component of your solution
```

### 🏆 Leaderboard Assessment

Your team's submission will be evaluated using a scoring system that assigns points based on:

1. **Correctness Score (50%):** Assessment of technical accuracy and proper GRPO understanding
2. **Innovation Score (30%):** Originality and effectiveness of your proposed strategies
3. **Clarity Score (20%):** Clear articulation and organization of ideas

The total score (100 points maximum) will determine your team's position on the leaderboard.

### 📋 Hints for Success

- Focus on the unique aspects of GRPO compared to other methods like PPO and DPO
- Consider the specific challenges of math reasoning when designing rewards
- Think about scalability and computational efficiency
- Review the implementation details from the notebook carefully


## ⚠️ Important Disclaimer

**This notebook demonstrates a simplified implementation of GRPO fine-tuning.** For production-level applications, this approach should be significantly expanded and refined. Specifically:

1. The training duration (50 steps) is far too short for meaningful learning
2. The reward functions are basic implementations that would benefit from more sophisticated alternatives
3. Proper group construction requires careful analysis of your specific dataset
4. Larger models (7B+ parameters) typically yield better reasoning capabilities
5. More extensive evaluation on diverse test sets is essential for real-world deployment

For research or production applications, we recommend:
- Increasing training steps by at least 100x
- Using more sophisticated reward modeling techniques
- Implementing proper group balancing and sampling strategies
- Considering multi-stage training approaches (SFT → GRPO)
- Developing robust evaluation suites for mathematical reasoning

The simplicity of this notebook is designed for educational purposes and to fit within Colab's resource constraints.