Skip to content

🐭 A tiny single-file implementation of Group Relative Policy Optimization (GRPO) as introduced by the DeepSeekMath paper

License

Notifications You must be signed in to change notification settings

superlinear-ai/microGRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

microGRPO

A tiny single-file implementation of Group Relative Policy Optimization (GRPO) as introduced by the DeepSeekMath paper123.

🆕 microGRPO now implements the GRPO improvements introduced by Dr. GRPO4, Apple's LOOP5, and Mistral's Magistral6:

  1. 💥 Remove per-group advantage normalization4
  2. ⛳️ Leave-one-out advantage5 (LOOP only)
  3. 🔥 Eliminate KL divergence5
  4. 🎢 Normalize loss5
  5. 🏆 Add per-batch advantage normalization6 (Magistral only)
  6. 🚦 Relax trust region bounds5
  7. 🌈 Eliminate non-diverse groups5

Features

  1. 🐭 Only ~300 lines of code
  2. 📦 In pure NumPy, with autograd to compute the gradient
  3. ✅ Type annotated and linted
  4. ✂️ Easily swap out the default game and train on any other game or environment

Getting started

Note

You'll need to install uv to run the commands below.

To start teaching a policy to play a simplified version of Battleship, run:

uv run microgrpo.py

You should see that the policy learns to improve its average score from around 15% to about 50% over 2000 iterations:

Battleship policy trained with GRPO

Background

File structure

The file is structured into five sections:

  1. 🕹️ Game (~50 lines): An implementation of the Battleship board game
  2. 🌍 Environment (~60 lines): The API with which an agent can interact with the game
  3. 🧠 Policy (~30 lines): A model that produces action probabilities given the observed environment state
  4. 🎯 GRPO (~80 lines): The GRPO objective function and training data generator
  5. ⚡ Train (~50 lines): The loop that collects training data and optimizes the GRPO objective with AdamW

GRPO config

Starting a training run only requires defining a GRPOConfig with your choice of environment (here, BattleshipEnv) and a function that evaluates the policy model given its parameters (here, neural_battleship_policy):

# Define the environment and the policy model to optimize.
grpo_config = GRPOConfig(environment=BattleshipEnv, policy=neural_battleship_policy)

# Train the policy model by maximizing the GRPO objective with AdamW.
θ_star, rewards_val = train_grpo(θ_init := neural_battleship_policy_init(), grpo_config)

Footnotes

  1. The DeepSeekMath paper

  2. Yuge (Jimmy) Shi's blog post

  3. Nathan Lambert's RLHF book

  4. The Dr. GRPO paper 2

  5. Apple's LOOP paper 2 3 4 5 6

  6. Mistral's Magistral paper 2

About

🐭 A tiny single-file implementation of Group Relative Policy Optimization (GRPO) as introduced by the DeepSeekMath paper

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages