# Applying GRPO to DeepSeek-R1-Distill-Qwen-1.5B with LIMO Dataset

This notebook provides a step-by-step tutorial for applying **Generalized Reinforcement Policy Optimization (GRPO)** to the distilled model **DeepSeek-R1-Distill-Qwen-1.5B** using the high-quality LIMO dataset. We will cover:

1. **Setup & Installation** – Installing dependencies and verifying GPU availability.
2. **Model & Dataset Preparation** – Loading the model, tokenizer, and dataset, and formatting prompts.
3. **Reinforcement Learning Fine-Tuning (GRPO)** – Implementing a simplified GRPO training loop, including reward computation and KL regularization.
4. **Evaluation & Performance Metrics** – Demonstrating how to evaluate the fine-tuned model on benchmark tasks.
5. **Hyperparameter Ablations & Future Directions** – Discussion on tuning and potential improvements.

Let's begin!

## 1. Setup & Installation

We first install the necessary libraries including PyTorch, Hugging Face Transformers, TRL (for reinforcement learning), the Datasets library, and bitsandbytes for 8-bit optimization. Then, we verify that a GPU is available.

In [None]:
!pip install transformers==4.48.2 trl==0.15.0.dev0 datasets bitsandbytes accelerate

In [None]:
import torch
print("Torch version:", torch.__version__)
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    print("GPU detected:", device_name)
    # Enable TF32 for faster matrix multiplication on supported GPUs
    torch.backends.cuda.matmul.allow_tf32 = True
else:
    print("No GPU found. Please enable a GPU runtime for training.")

## 2. Model & Dataset Preparation

We now load the **DeepSeek-R1-Distill-Qwen-1.5B** model and its tokenizer from Hugging Face, and load the LIMO dataset. The dataset consists of high-quality reasoning samples with a `question`, a detailed `solution`, and the final `answer`.

We also define a helper function `format_prompt` that formats the question into a prompt instructing the model to output a reasoning chain and final answer using the tags `<think>` and `<answer>`.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16, 
    device_map="auto"
    # Uncomment the following line if the model requires custom code
    # trust_remote_code=True
)

# Quick test generation
prompt_test = "What is the capital of France?"
inputs_test = tokenizer(prompt_test, return_tensors="pt").to(model.device)
outputs_test = model.generate(**inputs_test, max_new_tokens=10)
print("Test output:", tokenizer.decode(outputs_test[0], skip_special_tokens=True))

In [None]:
from datasets import load_dataset

# Load the LIMO dataset
dataset = load_dataset("GAIR/LIMO")
train_data = dataset["train"]
print("Total training samples:", len(train_data))

# Display a sample
sample = train_data[0]
print("Question:", sample["question"])
print("Solution (excerpt):", sample["solution"][:100] + "...")
print("Answer:", sample["answer"])

In [None]:
def format_prompt(question):
    """
    Format the prompt to instruct the model to output a chain-of-thought and final answer.
    """
    instruction = (
        "Solve the following problem step by step, then give the final answer. "
        "Format your response as: <think>[reasoning]</think><answer>[final answer]</answer>."
    )
    return f"{instruction}\nQuestion: {question}\nSolution:"

# Test the formatting
formatted_prompt = format_prompt(sample["question"])
print(formatted_prompt)

## 3. Reinforcement Learning Fine-Tuning (GRPO)

In this section, we implement a simplified GRPO training loop. The main steps include:

- **Sampling:** For each prompt, we generate multiple outputs (a group) from the model.
- **Reward Scoring:** Compute a reward for each output based on answer accuracy and proper formatting.
- **Advantage Calculation:** Compute the advantage by comparing each reward to the group average.
- **Policy Optimization:** Update the model weights using the advantage-weighted log-likelihood loss along with a KL divergence penalty to keep the model close to the reference (base) policy.

We use a default learning rate of `1e-6`, group size of 7, and a KL weight `β = 0.04`. We also set up an optimizer that supports 8-bit parameters (via bitsandbytes) for memory efficiency.

In [None]:
import math
from transformers import AdamW  # Standard AdamW

# Hyperparameters
learning_rate = 1e-6
tokens_per_generation = 4096  # Maximum tokens per generation (can be ablated)
group_size = 7
beta = 0.04

# Initialize the 8-bit AdamW optimizer (using bitsandbytes)
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=learning_rate)

# Optionally, use standard 32-bit AdamW:
# optimizer = AdamW(model.parameters(), lr=learning_rate)

# Clone the initial model to serve as the reference for KL divergence
from transformers import AutoModelForCausalLM
ref_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(model.device)
ref_model.eval()
for param in ref_model.parameters():
    param.requires_grad = False

def reward_function(question, generated_text, true_answer):
    """
    A simple rule-based reward:
      - +0.1 bonus if output contains both <think> and <answer> tags
      - +1.0 if the extracted answer matches the true answer
      - Small penalty if no answer is extracted
    """
    answer = None
    if "<answer>" in generated_text and "</answer>" in generated_text:
        start = generated_text.index("<answer>") + len("<answer>")
        end = generated_text.index("</answer>")
        answer = generated_text[start:end].strip()
    else:
        # Fallback: take the last token as the answer
        answer = generated_text.strip().split()[-1]

    reward = 0.0
    # Bonus for proper formatting
    if "<think>" in generated_text and "</think>" in generated_text and "<answer>" in generated_text and "</answer>" in generated_text:
        reward += 0.1
    
    # Reward based on answer accuracy
    if answer is not None:
        pred_ans = answer.strip().strip('.')
        true_ans = str(true_answer).strip().strip('.')
        if pred_ans == true_ans:
            reward += 1.0
    else:
        reward -= 0.1
    
    return reward

print("Optimizer and reward function set up.")

In [None]:
import random

model.train()
max_train_steps = 2  # Demo steps; in practice, use many more steps
grad_accum_steps = 8  # Effective batch: grad_accum_steps * group_size

# Shuffle training indices
indices = list(range(len(train_data)))
random.shuffle(indices)

step = 0
optimizer.zero_grad()

for idx in indices[: max_train_steps * grad_accum_steps]:
    question = train_data[idx]["question"]
    true_answer = train_data[idx]["answer"]
    prompt = format_prompt(question)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    
    # Generate a group of outputs
    generated_texts = []
    for _ in range(group_size):
        output_ids = model.generate(
            input_ids, 
            max_new_tokens=200,  # For demo; in practice, use tokens_per_generation
            do_sample=True, 
            temperature=1.0,
            eos_token_id=tokenizer.convert_tokens_to_ids("</answer>")
        )
        generated = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
        generated_texts.append(generated)
    
    # Compute rewards and advantages
    rewards = [reward_function(question, text, true_answer) for text in generated_texts]
    baseline = sum(rewards) / len(rewards)
    advantages = [r - baseline for r in rewards]
    
    # Compute policy loss
    policy_loss = 0.0
    for text, adv in zip(generated_texts, advantages):
        full_text = prompt + text
        enc = tokenizer(full_text, return_tensors="pt").to(model.device)
        labels = enc.input_ids.clone()
        labels[:, :input_ids.shape[1]] = -100  # Mask prompt tokens from loss
        out = model(**enc, labels=labels)
        # Multiply the average loss by the number of output tokens
        policy_loss += adv * (out.loss * labels[:, input_ids.shape[1]:].numel())
    policy_loss = policy_loss / group_size
    
    # Approximate KL divergence loss
    kl_loss = 0.0
    for text in generated_texts:
        full_text = prompt + text
        enc = tokenizer(full_text, return_tensors="pt").to(model.device)
        labels = enc.input_ids.clone()
        labels[:, :input_ids.shape[1]] = -100
        with torch.no_grad():
            curr_out = model(**enc, labels=labels)
            ref_out = ref_model(**enc, labels=labels)
        curr_nll = curr_out.loss * labels[:, input_ids.shape[1]:].numel()
        ref_nll = ref_out.loss * labels[:, input_ids.shape[1]:].numel()
        kl_loss += (curr_nll - ref_nll) / labels[:, input_ids.shape[1]:].numel()
    kl_loss = kl_loss / group_size
    
    total_loss = policy_loss + beta * kl_loss
    total_loss.backward()
    
    if (idx + 1) % grad_accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
        step += 1
        print(f"Step {step}: policy_loss={policy_loss.item():.4f}, kl_loss={kl_loss.item():.4f}, rewards={rewards}")
        if step >= max_train_steps:
            break

model.eval()
print("Training demo completed.")

## 4. Evaluation & Performance Metrics

After fine-tuning, we evaluate the model on reasoning benchmarks (e.g., AIME24, GPQA, MATH-500). In this demo, we show an evaluation example for one benchmark. 

The process involves:

- Formatting the prompt as during training.
- Generating an answer using greedy decoding.
- Extracting the answer using the `<answer>` tags and comparing it with the ground truth.

In [None]:
# Example evaluation for a benchmark (e.g., AIME24)
# For illustration, let's assume we have lists of questions and true answers

aime_questions = [
    "If x + y = 10 and x - y = 2, what is the value of x?",
    "Compute the area of a circle with radius 7."
]
aime_answers = [
    "6",  # x = 6
    "153.938"  # Approximate area (could be rounded)
]

model.eval()
correct = 0
for question, true_answer in zip(aime_questions, aime_answers):
    prompt = format_prompt(question)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0)  # Greedy decoding
    output_text = tokenizer.decode(output_ids[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    
    if "<answer>" in output_text and "</answer>" in output_text:
        ans = output_text.split("<answer>")[1].split("</answer>")[0].strip()
    else:
        ans = output_text.strip().split()[-1]
    
    print(f"Question: {question}")
    print(f"Predicted Answer: {ans}")
    print(f"True Answer: {true_answer}\n")
    
    if str(ans).strip().strip('.') == str(true_answer).strip().strip('.'):
        correct += 1

accuracy = correct / len(aime_questions) * 100
print(f"AIME24 Accuracy: {accuracy:.1f}%")

## 5. Hyperparameter Ablations & Future Directions

### Hyperparameter Ablations

Key hyperparameters that can be tuned include:

- **Learning Rate:** Our default is `1e-6`, but values like `2e-6`, `4e-6`, or `8e-6` may be experimented with.
- **Group Size:** Number of outputs per prompt (default is 7). Increasing this (e.g., 14, 28, or 56) can provide a more robust reward baseline but at higher computational cost.
- **KL Weight (β):** Default is `0.04`. Lower values (e.g., 0.01 or 0.001) allow the model more freedom to explore but may risk divergence.

### Future Directions

- **Refining the Reward Function:** Improve extraction of the final answer and consider partial rewards for nearly correct outputs.
- **Adaptive KL Penalty:** Use adaptive techniques to adjust β based on the observed KL divergence during training.
- **Scaling Up:** Experiment with larger models or longer generation tokens to fully exploit the reasoning capabilities.
- **Distillation vs. Pretrained Models:** Compare training outcomes when starting from a distilled model versus a base pretrained model.

This concludes our step-by-step guide. Happy fine-tuning!