# Group-Relative Policy Optimization (GRPO) -- Notebook Index

Welcome to the GRPO notebook series! These notebooks guide you through understanding, implementing, and training with Group-Relative Policy Optimization -- the algorithm behind DeepSeek-R1.

## Learning Path

### Notebook 1: Understanding GRPO
**From PPO to Group-Relative Advantages**

- Why PPO's critic is expensive for LLMs
- The group-relative advantage formula
- Implementing advantage computation from scratch
- Visualizing PPO vs GRPO advantages

### Notebook 2: Implementing the GRPO Loss Function
**Clipped Surrogate + KL Penalty**

- The three components: clipping, advantages, KL penalty
- Building each component step by step
- Numerical verification of the KL divergence formula
- Testing with synthetic data

### Notebook 3: Training a Reasoning Model with GRPO
**End-to-End GRPO Training Pipeline**

- Character-level arithmetic as a testbed
- Binary verifiable rewards (no reward model needed)
- Full GRPO training loop
- Training curves and evaluation

## Prerequisites

- Basic understanding of neural networks and PyTorch
- Familiarity with language models (tokenization, generation)
- Some exposure to reinforcement learning concepts (rewards, policies)

## How to Use These Notebooks

1. Run each notebook in order (1 -> 2 -> 3)
2. Complete the TODO exercises to deepen your understanding
3. Experiment with the hyperparameters (G, epsilon, beta)
4. Each notebook is self-contained and runs on Google Colab with a T4 GPU

In [None]:
print("Ready to start learning GRPO!")
print()
print("Notebook 1: Understanding GRPO")
print("  -> Open 01_understanding_grpo.ipynb")
print()
print("Notebook 2: Implementing the GRPO Loss")
print("  -> Open 02_implementing_grpo_loss.ipynb")
print()
print("Notebook 3: Training with GRPO")
print("  -> Open 03_training_reasoning_with_grpo.ipynb")