A tiny single-file implementation of Group Relative Policy Optimization (GRPO) as introduced by the DeepSeekMath paper123.
🆕 microGRPO now implements the GRPO improvements introduced by Dr. GRPO4, Apple's LOOP5, and Mistral's Magistral6:
- 💥 Remove per-group advantage normalization4
- ⛳️ Leave-one-out advantage5 (LOOP only)
- 🔥 Eliminate KL divergence5
- 🎢 Normalize loss5
- 🏆 Add per-batch advantage normalization6 (Magistral only)
- 🚦 Relax trust region bounds5
- 🌈 Eliminate non-diverse groups5
- 🐭 Only ~300 lines of code
- 📦 In pure NumPy, with autograd to compute the gradient
- ✅ Type annotated and linted
- ✂️ Easily swap out the default game and train on any other game or environment
Note
You'll need to install uv to run the commands below.
To start teaching a policy to play a simplified version of Battleship, run:
uv run microgrpo.py
You should see that the policy learns to improve its average score from around 15% to about 50% over 2000 iterations:
The file is structured into five sections:
- 🕹️ Game (~50 lines): An implementation of the Battleship board game
- 🌍 Environment (~60 lines): The API with which an agent can interact with the game
- 🧠 Policy (~30 lines): A model that produces action probabilities given the observed environment state
- 🎯 GRPO (~80 lines): The GRPO objective function and training data generator
- ⚡ Train (~50 lines): The loop that collects training data and optimizes the GRPO objective with AdamW
Starting a training run only requires defining a GRPOConfig
with your choice of environment (here, BattleshipEnv
) and a function that evaluates the policy model given its parameters (here, neural_battleship_policy
):
# Define the environment and the policy model to optimize.
grpo_config = GRPOConfig(environment=BattleshipEnv, policy=neural_battleship_policy)
# Train the policy model by maximizing the GRPO objective with AdamW.
θ_star, rewards_val = train_grpo(θ_init := neural_battleship_policy_init(), grpo_config)