Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency.
Flash-GRPO, a single-step training framework that outperforms full trajectory trainingin alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO’s effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment qualit
Welcome Ideas and Contributions. Stay tuned!
We have presented a single-step training framework, Flash-GRPO.
- [2026-05-11] We release the code of our paper, and we will release a 8 gpus version of Flash-GRPO (can achieve the same performance, and only need ~40hours). 🔥🔥🔥
- [2026-05-28] we have released a 8 gpus (~40 hours) version of Flash-GRPO (The reward curve is as following) !
Download the reward model HPSV3 and base model Wan2.1-1.3B.
cd flow_grpo/reward-server
gunicorn "app_hpsv3:create_app()" # Flash-GRPO 96GPUs
bash scripts/multi_node/train_wan2_1_flash.sh# Flash-GRPO 8GPUs
bash scripts/multi_node/train_wan2_1_flash_1node.sh- For more details please read our paper.
Flow-GRPO: The first method integrating online reinforcement learning (RL) into flow matching models.





