Algorithm-first post-training framework for large models.
"What I cannot create, I do not understand." — Richard Feynman
FeynRL (pronounced "FineRL") is an algorithm-first framework for post-training and fine-tuning large models. It supports supervised fine-tuning (SFT), preference learning (e.g., DPO), and reinforcement learning (e.g., PPO, GRPO, CISPO, P3O), and is built for researchers and engineers who want to understand, modify, and develop new methods without fighting the infrastructure.
The main goal of FeynRL is simple: make new algorithms easy to implement, easy to debug, and still possible to train at scale. The codebase is designed so that algorithmic logic stays local and systems logic stays explicit, which makes the framework easier to reason about, easier to extend, and more reliable to debug.
FeynRL is a good fit if your goal is not only to run an existing recipe, but to build and test new post-training methods.
-
Algorithm-first design — Most method changes stay local: you can add new objectives, rewards, baselines, or update rules without reshaping the full stack.
-
Clear separation of concerns — Algorithm code stays algorithmic, and systems code stays systems. That keeps the codebase easier to understand, test, and extend.
-
One framework across post-training — SFT, DPO, and RL share the same workflow and configuration system, making comparisons easier and reducing duplicated infrastructure.
-
Scales beyond toy settings — Use the same framework for local single-GPU debugging or large multi-node distributed runs.
FeynRL may not be the best fit if your main priority is the largest built-in feature surface out of the box, or if you mainly want a framework already optimized around a narrow workflow and do not expect to modify it much.
There are already several strong open-source frameworks for post-training large models. Many are powerful and feature-rich, but they are often optimized around a narrower set of methods or execution patterns, and can become hard to modify once you want to try something new.
FeynRL was built to make a different trade-off. Instead of optimizing first for the largest feature surface, it optimizes first for clarity, locality of change, and algorithm development. The codebase is structured so that algorithmic ideas are easy to implement and reason about, while the distributed systems layer remains explicit rather than hidden behind heavy abstractions. In practice, implementing a new algorithm typically means writing a single file with its own loss and update logic, not threading changes through the orchestration, rollout, and data layers.
The framework is designed for scale from the start. It supports large-scale training with DeepSpeed, Ray, and vLLM, including sync and async execution modes, adaptive weight synchronization, and multi-node runs. The goal is to make it possible to do both: move fast on algorithms and still run realistic experiments at scale.
This is the first public release, so expect rough edges. We are open-sourcing FeynRL not just as a library, but as a foundation for building new post-training methods with the community.
For a detailed breakdown of the architecture, see the Architecture Overview.
- 🧪 Training paradigms: RL (PPO, GRPO, CISPO, P3O), preference-based learning (DPO), and supervised fine-tuning (SFT)
- 🖥️ Distributed training: Multi-GPU and multi-node via DeepSpeed (ZeRO Stage 1/2/3)
- 🎲 Rollouts / inference: vLLM-powered rollout engines with tensor parallelism
- 🛰️ Orchestration: Ray for scheduling training and rollout workers across nodes
- 🔀 Training-rollout scheduling: Sync and overlap (async) modes; the overlap engine interleaves chunk-based generation with training, using ESS-driven NCCL weight sync to keep rollout data fresh while significantly reducing GPU idle time
- 🔄 Weight sync: NCCL broadcast (fastest), direct in-memory transfer via Ray object store, and disk-based checkpoint reload, with automatic fallback chain (NCCL to direct to disk)
- 🧷 Parameter-efficient fine-tuning: LoRA via PEFT
- 🔢 Mixed-dataset sampling: Configurable multi-dataset sampling with ratios within a single training run
- 📈 Experiment tracking: MLflow and Weights & Biases support
- 🏅 Evaluation: Standalone eval pipeline with vLLM engines
For RL, Ray orchestrates the full training loop: it schedules DeepSpeed training workers and vLLM rollout workers across nodes, and coordinates weight synchronization between them. In sync mode, each epoch generates all rollouts, trains on them, syncs weights, and repeats. In overlap mode, generation is dispatched one chunk at a time while training runs concurrently on already-collected data, significantly reducing GPU idle time. An ESS (Effective Sample Size) metric monitors policy divergence and triggers weight sync at chunk boundaries when the rollout engines are idle, keeping data fresh without stalling the pipeline. Weight sync uses a three-tier fallback chain: NCCL broadcast, direct transfer via Ray object store, and disk-based checkpoint reload. SFT and DPO are simpler because they only require a single model and no rollout workers, so they run directly on DeepSpeed without Ray. All paradigms support full fine-tuning and LoRA, and plug into mixed-dataset sampling, experiment tracking, and standalone evaluation without changing the overall workflow.
The repository is organized so that algorithmic changes usually stay local:
algs/— Algorithm and optimization logic. Each algorithm (PPO, GRPO, CISPO, P3O, DPO, SFT) has its own module with a README documenting the math and pseudocode.rollouts/— Rollout generation, vLLM engine wrappers, weight sync, and replay buffer.rewards/— Pluggable reward functions (GSM8K, math verification, and custom).data_feeds/— Data loading, sampling, and mixed-dataset support.data_prep/— Dataset preparation scripts.configs/— YAML configs for RL, SFT, DPO, and evaluation, with full parameter reference.unit_tests/— Unit and integration tests.
We're excited to publicly release FeynRL as a preview! Some features and documentation are still evolving. We welcome feedback, bug reports, and contributions as we continue to build this together.
Installation & Setup — Configure your environment and dependencies.
Usage & Examples — Learn how to launch jobs and run experiments.
Configuration Reference — Full parameter guide for RL, SFT, DPO, and evaluation configs.
Contributions are welcome! Please see our Contributing Guidelines for details on how to get involved.
Check out the FAQ for common questions and answers.
