In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Understanding GRPO: From PPO to Group-Relative Advantages -- Vizuara

## 1. Why Does This Matter?

Have you ever wondered how models like DeepSeek-R1 learn to reason step by step? The answer involves reinforcement learning, but not the kind you might expect.

Traditional RLHF (Reinforcement Learning from Human Feedback) uses PPO -- an algorithm that requires a **critic network** as large as the language model itself. For a 7B parameter model, this means keeping ~28B parameters in memory across four separate models. That is expensive.

**Group-Relative Policy Optimization (GRPO)** eliminates the critic entirely by using a beautifully simple idea: instead of learning to predict how good a response is, just generate multiple responses and compare them to each other.

In this notebook, you will:
- Understand why PPO needs a critic and why that is expensive
- Build the core intuition behind group-relative advantages
- Implement advantage computation from scratch
- Visualize how group normalization works compared to a learned critic

Let us start by importing our tools.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple

print("All imports successful!")

## 2. Building Intuition

### The Teacher Grading Essays

Imagine a teacher grading 30 essays. There are two approaches:

**Approach 1 - Absolute scoring:** Read each essay and assign a score from 1-10. This requires a detailed rubric and is hard to be consistent.

**Approach 2 - Relative scoring:** Read ALL essays, then rank them. "This one is above average. This one is below average." No rubric needed -- just compare within the group.

GRPO uses Approach 2. Let us see why this is powerful.

In [None]:
# Simulate absolute scoring (like a critic network)
# The critic tries to predict exact value for each response
np.random.seed(42)

# True quality of 8 responses (unknown in practice)
true_quality = np.array([3.2, 7.1, 2.0, 8.5, 5.0, 6.3, 1.5, 4.8])

# Critic's predictions (noisy -- the critic is imperfect)
critic_noise = np.random.normal(0, 2.0, size=8)
critic_predictions = true_quality + critic_noise

# PPO advantages: quality - critic_prediction
ppo_advantages = true_quality - critic_predictions

print("=== PPO (Critic-Based) Advantages ===")
print(f"True quality:       {true_quality}")
print(f"Critic predictions: {np.round(critic_predictions, 2)}")
print(f"PPO advantages:     {np.round(ppo_advantages, 2)}")
print(f"\nNotice how noisy the critic makes the advantages!")

In [None]:
# Now simulate GRPO: just normalize rewards within the group
# No critic needed!

rewards = true_quality  # Direct rewards (or reward model scores)
mean_r = rewards.mean()
std_r = rewards.std()

grpo_advantages = (rewards - mean_r) / std_r

print("=== GRPO (Group-Relative) Advantages ===")
print(f"Rewards:           {rewards}")
print(f"Group mean:        {mean_r:.2f}")
print(f"Group std:         {std_r:.2f}")
print(f"GRPO advantages:   {np.round(grpo_advantages, 2)}")
print(f"\nNo noise from a critic -- just clean normalization!")

In [None]:
# Visualize the comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors_ppo = ['green' if a > 0 else 'red' for a in ppo_advantages]
colors_grpo = ['green' if a > 0 else 'red' for a in grpo_advantages]

axes[0].bar(range(8), ppo_advantages, color=colors_ppo, alpha=0.7, edgecolor='black')
axes[0].set_title("PPO Advantages (Critic-Based)", fontsize=14, fontweight='bold')
axes[0].set_xlabel("Response Index")
axes[0].set_ylabel("Advantage")
axes[0].axhline(y=0, color='black', linestyle='--', alpha=0.3)
axes[0].set_xticks(range(8))

axes[1].bar(range(8), grpo_advantages, color=colors_grpo, alpha=0.7, edgecolor='black')
axes[1].set_title("GRPO Advantages (Group-Relative)", fontsize=14, fontweight='bold')
axes[1].set_xlabel("Response Index")
axes[1].set_ylabel("Advantage")
axes[1].axhline(y=0, color='black', linestyle='--', alpha=0.3)
axes[1].set_xticks(range(8))

plt.tight_layout()
plt.savefig("ppo_vs_grpo_advantages.png", dpi=150, bbox_inches='tight')
plt.show()
print("GRPO advantages cleanly separate good from bad responses!")

## 3. The Mathematics

### PPO's Advantage Estimation

In PPO, the advantage for token $t$ is computed using Generalized Advantage Estimation (GAE):

$$\hat{A}_t = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}$$

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

This requires a learned value function $V(s)$ -- the critic network.

Let us plug in numbers. Suppose $\gamma = 0.99$, $\lambda = 0.95$, $r_t = 1$, $V(s_t) = 5.0$, $V(s_{t+1}) = 5.5$:

$$\delta_t = 1 + 0.99 \times 5.5 - 5.0 = 1 + 5.445 - 5.0 = 1.445$$

The problem: $V(s)$ must be accurate for this to work. If the critic is wrong, the advantages are wrong.

### GRPO's Group-Relative Advantage

GRPO replaces all of the above with:

$$\hat{A}_i = \frac{r_i - \mu}{\sigma}$$

where $\mu = \frac{1}{G}\sum_{j=1}^G r_j$ and $\sigma = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_j - \mu)^2}$

Let us verify this with numbers. Suppose $G=5$ responses have rewards $\{2.0, 3.5, 1.0, 4.0, 2.5\}$:

- $\mu = (2.0+3.5+1.0+4.0+2.5)/5 = 2.6$
- $\sigma = \sqrt{(0.36+0.81+2.56+1.96+0.01)/5} = \sqrt{1.14} \approx 1.07$
- $\hat{A}_1 = (2.0-2.6)/1.07 = -0.56$
- $\hat{A}_4 = (4.0-2.6)/1.07 = +1.31$ (best in group)

This is exactly what we want -- the best response gets the highest advantage.

In [None]:
def compute_grpo_advantages(rewards: torch.Tensor) -> torch.Tensor:
    """
    Compute group-relative advantages from a batch of rewards.

    Args:
        rewards: Tensor of shape (G,) -- rewards for G completions
    Returns:
        advantages: Tensor of shape (G,) -- normalized advantages
    """
    mean_reward = rewards.mean()
    std_reward = rewards.std()

    # Avoid division by zero
    if std_reward < 1e-8:
        return torch.zeros_like(rewards)

    advantages = (rewards - mean_reward) / std_reward
    return advantages

# Verify with our example
rewards = torch.tensor([2.0, 3.5, 1.0, 4.0, 2.5])
advantages = compute_grpo_advantages(rewards)

print("Rewards:    ", rewards.numpy())
print("Advantages: ", advantages.numpy().round(2))
print(f"\nBest response (idx {advantages.argmax()}): A = {advantages.max():.2f}")
print(f"Worst response (idx {advantages.argmin()}): A = {advantages.min():.2f}")

## 4. Let's Build It -- Component by Component

### Component 1: Group Sampling

The first step in GRPO is generating $G$ different completions for the same prompt.

In [None]:
class SimpleLanguageModel(nn.Module):
    """A tiny 'language model' for demonstration purposes."""

    def __init__(self, vocab_size=100, hidden_dim=64, max_len=20):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_dim)
        self.rnn = nn.GRU(hidden_dim, hidden_dim, batch_first=True)
        self.head = nn.Linear(hidden_dim, vocab_size)
        self.vocab_size = vocab_size
        self.max_len = max_len

    def forward(self, x):
        """Returns logits for each position."""
        emb = self.embedding(x)
        out, _ = self.rnn(emb)
        logits = self.head(out)
        return logits

    def generate(self, prompt_ids, max_new_tokens=10, temperature=1.0):
        """Auto-regressively generate tokens."""
        generated = prompt_ids.clone()

        for _ in range(max_new_tokens):
            logits = self.forward(generated)
            next_logits = logits[:, -1, :] / temperature
            probs = F.softmax(next_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            generated = torch.cat([generated, next_token], dim=1)

        return generated

# Create model
torch.manual_seed(42)
model = SimpleLanguageModel(vocab_size=100, hidden_dim=64)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
def sample_group(model, prompt_ids, G=8, max_new_tokens=10, temperature=1.0):
    """
    Sample G completions for the same prompt.
    This is the first step of GRPO.
    """
    completions = []

    with torch.no_grad():
        for i in range(G):
            output = model.generate(
                prompt_ids,
                max_new_tokens=max_new_tokens,
                temperature=temperature
            )
            completions.append(output)

    return completions

# Sample a group of completions
prompt = torch.tensor([[1, 5, 10]])  # A simple prompt
G = 8
completions = sample_group(model, prompt, G=G, max_new_tokens=8)

print(f"Prompt: {prompt[0].tolist()}")
print(f"\nGenerated {G} completions:")
for i, comp in enumerate(completions):
    tokens = comp[0, prompt.shape[1]:].tolist()  # Only new tokens
    print(f"  Response {i+1}: {tokens}")

print(f"\nNotice: each response is different due to sampling!")

### Component 2: Reward Scoring

In practice, rewards come from a reward model or verifiable rules. Here we will use a simple rule-based reward.

In [None]:
def simple_reward_function(completion_ids, target_token=42):
    """
    Simple reward: how many times does the target token appear?
    In real GRPO, this could be:
    - A reward model score
    - Binary correctness (math problems)
    - Code test pass rate
    """
    new_tokens = completion_ids[0, 3:]  # Skip prompt tokens
    count = (new_tokens == target_token).sum().item()
    return float(count) / len(new_tokens)

# Score all completions
rewards = torch.tensor([
    simple_reward_function(comp)
    for comp in completions
])

print("Rewards for each completion:")
for i, (comp, r) in enumerate(zip(completions, rewards)):
    tokens = comp[0, 3:].tolist()
    print(f"  Response {i+1}: reward = {r:.3f}  tokens = {tokens}")

### Component 3: Putting It Together -- Group-Relative Advantage Pipeline

In [None]:
def grpo_advantage_pipeline(model, prompt_ids, reward_fn, G=8, max_new_tokens=10):
    """
    Complete GRPO advantage computation pipeline.

    1. Sample G completions
    2. Score each with reward function
    3. Normalize rewards within the group
    """
    # Step 1: Sample group
    completions = sample_group(model, prompt_ids, G=G, max_new_tokens=max_new_tokens)

    # Step 2: Score each
    rewards = torch.tensor([reward_fn(comp) for comp in completions])

    # Step 3: Compute group-relative advantages
    advantages = compute_grpo_advantages(rewards)

    return completions, rewards, advantages

# Run the pipeline
completions, rewards, advantages = grpo_advantage_pipeline(
    model, prompt, simple_reward_function, G=8
)

print("=== GRPO Advantage Pipeline Results ===")
print(f"{'Response':<10} {'Reward':<10} {'Advantage':<12} {'Action'}")
print("-" * 50)
for i in range(len(rewards)):
    action = "REINFORCE" if advantages[i] > 0 else "PENALIZE" if advantages[i] < -0.5 else "NEUTRAL"
    print(f"{i+1:<10} {rewards[i]:<10.3f} {advantages[i]:<12.3f} {action}")

## 5. Your Turn

### TODO 1: Implement Advantage Computation with Clipping

The basic GRPO advantage is $(r_i - \mu) / \sigma$, but what happens when all rewards are very similar? The advantages become very large due to small $\sigma$. Implement a version that clips the advantages.

In [None]:
def compute_clipped_advantages(
    rewards: torch.Tensor,
    max_advantage: float = 3.0
) -> torch.Tensor:
    """
    Compute group-relative advantages with clipping.

    Args:
        rewards: Tensor of shape (G,) -- rewards for G completions
        max_advantage: Maximum absolute advantage value
    Returns:
        advantages: Tensor of shape (G,) -- clipped normalized advantages

    TODO: Implement this function.
    Hints:
    1. Compute mean and std of rewards
    2. Normalize: (rewards - mean) / std
    3. Clip advantages to [-max_advantage, max_advantage]
    4. Handle the case where std is very small (< 1e-8)
    """
    # YOUR CODE HERE
    mean_reward = rewards.mean()
    std_reward = rewards.std()

    if std_reward < 1e-8:
        return torch.zeros_like(rewards)

    advantages = (rewards - mean_reward) / std_reward
    advantages = torch.clamp(advantages, -max_advantage, max_advantage)

    return advantages

    raise NotImplementedError("Implement clipped advantages!")

# Test your implementation
test_rewards = torch.tensor([1.0, 1.01, 1.02, 1.0, 0.99])  # Very similar rewards
clipped_adv = compute_clipped_advantages(test_rewards, max_advantage=2.0)
print("Similar rewards:", test_rewards.numpy())
print("Clipped advantages:", clipped_adv.numpy().round(3))
assert torch.all(torch.abs(clipped_adv) <= 2.0), "Advantages should be clipped!"
print("Passed!")

### TODO 2: Experiment with Group Size G

How does the group size G affect the quality of advantage estimates?

In [None]:
def experiment_group_size(model, prompt_ids, reward_fn, G_values=[2, 4, 8, 16, 32]):
    """
    TODO: For each group size G, run the GRPO advantage pipeline
    multiple times and compute the variance of the advantages.

    Hints:
    1. For each G in G_values, run grpo_advantage_pipeline N=20 times
    2. Collect the advantage of the BEST response each time
    3. Compute mean and std of these best advantages
    4. Plot: X=G, Y=mean best advantage, error bars = std

    Expected result: larger G gives more stable advantages.
    """
    results = {}
    N_trials = 20

    for G in G_values:
        best_advantages = []
        for _ in range(N_trials):
            _, rewards, advantages = grpo_advantage_pipeline(
                model, prompt_ids, reward_fn, G=G
            )
            best_advantages.append(advantages.max().item())

        results[G] = {
            'mean': np.mean(best_advantages),
            'std': np.std(best_advantages)
        }
        print(f"G={G:3d}: mean best advantage = {results[G]['mean']:.3f} +/- {results[G]['std']:.3f}")

    # Plot
    G_vals = list(results.keys())
    means = [results[g]['mean'] for g in G_vals]
    stds = [results[g]['std'] for g in G_vals]

    plt.figure(figsize=(8, 5))
    plt.errorbar(G_vals, means, yerr=stds, marker='o', capsize=5, linewidth=2)
    plt.xlabel("Group Size G", fontsize=12)
    plt.ylabel("Best Advantage (mean +/- std)", fontsize=12)
    plt.title("Effect of Group Size on Advantage Stability", fontsize=14)
    plt.grid(True, alpha=0.3)
    plt.savefig("group_size_effect.png", dpi=150, bbox_inches='tight')
    plt.show()

experiment_group_size(model, prompt, simple_reward_function)

## 6. Putting It All Together

Now let us combine everything into a visualization that shows the full GRPO advantage computation pipeline.

In [None]:
def visualize_grpo_pipeline(rewards, title="GRPO Advantage Computation"):
    """Visualize the complete GRPO pipeline for a single group."""
    advantages = compute_grpo_advantages(rewards)
    mean_r = rewards.mean().item()
    std_r = rewards.std().item()
    G = len(rewards)

    fig, axes = plt.subplots(1, 3, figsize=(16, 5))

    # Panel 1: Raw rewards
    colors = plt.cm.viridis(np.linspace(0.3, 0.9, G))
    axes[0].bar(range(G), rewards.numpy(), color=colors, edgecolor='black')
    axes[0].axhline(y=mean_r, color='red', linestyle='--', linewidth=2, label=f'Mean = {mean_r:.2f}')
    axes[0].set_title("Step 1: Raw Rewards", fontsize=13, fontweight='bold')
    axes[0].set_xlabel("Response Index")
    axes[0].set_ylabel("Reward")
    axes[0].legend()

    # Panel 2: Centered (subtract mean)
    centered = (rewards - mean_r).numpy()
    colors_centered = ['green' if c > 0 else 'red' for c in centered]
    axes[1].bar(range(G), centered, color=colors_centered, alpha=0.7, edgecolor='black')
    axes[1].axhline(y=0, color='black', linestyle='-', alpha=0.3)
    axes[1].set_title(f"Step 2: Subtract Mean ({mean_r:.2f})", fontsize=13, fontweight='bold')
    axes[1].set_xlabel("Response Index")
    axes[1].set_ylabel("Centered Reward")

    # Panel 3: Final advantages (divide by std)
    colors_adv = ['green' if a > 0 else 'red' for a in advantages.numpy()]
    axes[2].bar(range(G), advantages.numpy(), color=colors_adv, alpha=0.7, edgecolor='black')
    axes[2].axhline(y=0, color='black', linestyle='-', alpha=0.3)
    axes[2].set_title(f"Step 3: Divide by Std ({std_r:.2f})", fontsize=13, fontweight='bold')
    axes[2].set_xlabel("Response Index")
    axes[2].set_ylabel("GRPO Advantage")

    plt.suptitle(title, fontsize=15, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.savefig("grpo_pipeline_visualization.png", dpi=150, bbox_inches='tight')
    plt.show()

    return advantages

# Demonstrate with example rewards
example_rewards = torch.tensor([2.0, 3.5, 1.0, 4.0, 2.5, 3.0, 1.5, 3.8])
advantages = visualize_grpo_pipeline(example_rewards)

## 7. Training and Results

Let us now see how group-relative advantages compare to random advantages in a simplified training simulation.

In [None]:
def simulate_training(n_steps=200, G=8, use_grpo=True):
    """
    Simulate GRPO training on a simple optimization problem.

    The 'policy' adjusts a parameter theta to maximize reward.
    Reward = -|theta - 3.0| (optimal at theta=3.0)
    """
    theta = torch.tensor([0.0], requires_grad=True)
    optimizer = torch.optim.SGD([theta], lr=0.05)

    history = []

    for step in range(n_steps):
        # Sample G 'completions' (here: theta + noise)
        noise = torch.randn(G) * 0.5
        samples = theta.item() + noise

        # Compute rewards
        rewards = -torch.abs(samples - 3.0)

        if use_grpo:
            # GRPO: normalize within group
            advantages = compute_grpo_advantages(rewards)
        else:
            # No normalization (vanilla PG)
            advantages = rewards

        # Compute surrogate loss
        # Higher advantage = push theta toward that sample
        weighted_direction = (advantages * noise).mean()
        loss = -weighted_direction

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        history.append(theta.item())

    return history

# Compare GRPO vs Vanilla
torch.manual_seed(42)
history_grpo = simulate_training(n_steps=200, G=8, use_grpo=True)

torch.manual_seed(42)
history_vanilla = simulate_training(n_steps=200, G=8, use_grpo=False)

plt.figure(figsize=(10, 5))
plt.plot(history_grpo, label='GRPO (Group-Relative)', linewidth=2, color='blue')
plt.plot(history_vanilla, label='Vanilla PG (No Normalization)', linewidth=2, color='orange')
plt.axhline(y=3.0, color='green', linestyle='--', linewidth=1, label='Optimal (theta=3.0)')
plt.xlabel("Training Step", fontsize=12)
plt.ylabel("Parameter Value (theta)", fontsize=12)
plt.title("GRPO vs Vanilla Policy Gradient", fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.savefig("grpo_vs_vanilla_training.png", dpi=150, bbox_inches='tight')
plt.show()

print(f"GRPO final theta:   {history_grpo[-1]:.3f} (target: 3.000)")
print(f"Vanilla final theta: {history_vanilla[-1]:.3f} (target: 3.000)")

## 8. Final Output

In [None]:
# Summary visualization: The GRPO advantage computation is simple but powerful
print("=" * 60)
print("GRPO Advantage Computation Summary")
print("=" * 60)
print()
print("1. Sample G completions for the same prompt")
print("2. Score each with a reward function")
print("3. Normalize: A_i = (r_i - mean(r)) / std(r)")
print()
print("That's it! No critic network. No GAE. No value function.")
print()
print("Key properties:")
print("  - Self-calibrating (works for any reward scale)")
print("  - Memory efficient (no critic network)")
print("  - Larger G = more stable advantages")
print("  - Typically G = 4 to 64 in practice")
print("=" * 60)

## 9. Reflection and Next Steps

**Key takeaways from this notebook:**

1. PPO requires a critic network (value function) to compute advantages. This doubles the memory cost.
2. GRPO eliminates the critic by normalizing rewards within a group of sampled responses.
3. The group-relative advantage formula is $\hat{A}_i = (r_i - \mu) / \sigma$ -- simple z-score normalization.
4. Larger group sizes produce more stable advantage estimates, but with diminishing returns.

**Reflection questions:**
- Why does normalizing by the group standard deviation help compared to just subtracting the mean?
- What happens if all G responses get the same reward? Is the GRPO advantage well-defined?
- How might the choice of G interact with the diversity of the policy's outputs?

**Next notebook:** We will implement the full GRPO loss function including the clipped surrogate objective and KL penalty, and train a small language model using GRPO.