Conversation
added 30 commits
March 31, 2026 19:38
… estimator
TTT-Discover implementation (arXiv:2601.16175) for the Erdős Minimum
Overlap Problem.
New files:
- nemo_rl/algorithms/entropic_advantage_estimator.py
LOO entropic advantage with adaptive β via bisection.
Solves for β such that KL(softmax_β(R) || uniform) = ln(2),
then computes leave-one-out advantages w_i = exp(β·r_i)/Z_{-i} - 1.
- nemo_rl/environments/erdos_discovery_environment.py
Ray remote environment that calls the NeMo Gym Erdős resource
server for sandboxed code execution and reward computation.
Modified:
- nemo_rl/algorithms/grpo.py: add entropic_adaptive_beta to
advantage estimator factory + AdvEstimatorConfig TypedDict.
- nemo_rl/environments/utils.py: register erdos_discovery in
ENV_REGISTRY.
…ADME - nemo_rl/utils/puct_buffer.py: General-purpose PUCT tree buffer for iterative optimization environments. Reusable across any task that needs exploration/exploitation state selection. - nemo_rl/environments/ERDOS_DISCOVERY.md: Full documentation for the TTT-Discover integration — architecture diagram, component locations, config examples, hyperparameters from the paper. - README.md: Added Advantage Estimators (including entropic adaptive-β) and PUCT Buffer to the Features section.
Run script (examples/run_discover.py): - DiscoverDataset: IterableDataset backed by PUCT buffer via HTTP. Calls /select_state on the Gym resource server each step to get dynamically selected states, generates DatumSpecs with tokenized prompts and parent_state metadata. - setup_discover_data(): Wires dataset + ErdosDiscoveryEnvironment Ray actor, returns the (dataset, env) tuple for grpo_train(). - Follows the sliding_puzzle pattern for custom env integration. Config (examples/configs/grpo_erdos_discover.yaml): - entropic_adaptive_beta advantage estimator (gamma=ln(2)) - 8 groups × 64 rollouts = 512 trajectories per step - LoRA r=32, lr=4e-5, 50 training steps - KL penalty 0.1, importance sampling loss - 32K context window for long code generation
- grpo_erdos_discover_debug.yaml: Qwen3-1.7B, single node, 4×8=32 trajectories, 5 steps, 4K seq len, colocated vLLM. For testing the full pipeline before scaling up. - erdos_debug.slurm: Starts Gym resource server in background on the same node, then runs NeMo RL GRPO training.
…aunch - Environment supports resource_server_url="inline" which runs the code sandbox and reward computation directly in-process, no Gym server dependency needed. - Debug config updated to use inline mode. - SLURM script simplified: no Gym server startup, just uv run.
…ug iterations - Use load_config + OmegaConf resolve instead of yaml.safe_load - Match setup() and grpo_train() calling convention from sliding_puzzle - Fix SLURM script: PATH for uv, PYTHONPATH unbound var - Debug config uses defaults: grpo_math_1B.yaml for proper field inheritance
…nf_resolvers, add launch scripts and shim for container deployment
Covers everything we learned getting this running on the d2dfac12 B200 cluster: ray.sub patches, container compat, async fixes, config gotchas, and the full working launch pattern.
8 nodes: 2 inference (vLLM TP=8) + 6 training (Megatron TP=4, EP=8) Uses Dakota custom super-v3 container with NemotronH vLLM support. No LoRA initially (Megatron backend), full fine-tune with optimizer CPU offload + activation checkpointing.
Rewrites erdos_discovery_environment.py and run_discover.py to match the reference implementation at github.com/test-time-training/discover: - C5 = max(np.correlate(h, 1-h, mode="full") * dx) formulation - Code must define run(seed, budget_s) returning (h_values, c5_bound, n_points) - scipy/cvxpy allowed in sandbox - State context shows parent code + improvement tracking (State.to_prompt) - Full ErdosMinOverlapEnv.get_question() prompt with problem description - reward = 1 / (1e-8 + c5_bound) - Initial states: random n in [40,100], perturbed h=0.5 - Removes inline/HTTP mode split (always computes directly) - DiscoverDataset generates diverse initial states each step
…id rate - Seq len 4096 -> 8192 (7168 max_new_tokens) - wandb enabled: project=ttt-discover-erdos - Environment logs erdos/max_reward, erdos/avg_reward, erdos/valid_rate, erdos/best_c5, erdos/global_best_c5 to both console and metrics dict - Print summary line per step for easy monitoring
…n Ray actors) signal.alarm only works in main thread. Ray actors run in worker threads, so SIGALRM never fires and sandbox code blocks indefinitely. Now uses ThreadPoolExecutor with a 120s timeout. Also caps run() budget_s at 60s for faster iteration.
…er step - 10 nodes: 2 inference + 8 training (was 8 total) - CP=2 enables 16k context (was 8k) - 15360 max_new_tokens (was 7168) - Save first 10 + all valid outputs per step for debugging valid rate - ERDOS_LOG_DIR for output files
added 14 commits
April 2, 2026 15:23
…hreads cant be killed)
…re SIGKILL, 1000s timeout
- ErdosRefPUCTSampler: full PUCT tree with state tracking, update_states, record_failed_rollout, flush, sample_states (matches ttt-discover-ref) - PUCTDiscoverDataset: calls env.puct_sample_states.remote() for prompts (shared state between dataset and env via Ray actor) - Environment _sync_step updates PUCT buffer on success/failure - RandomDiscoverDataset for validation (no PUCT)
…bmodule Removed: - shim/ (old copy of code, unused) - Debug configs and launch scripts (grpo_erdos_debug_16k, launch_erdos_debug*) - grpo_superv3.yaml (base container config, not needed in PR) - Scratch files (example.txt, keep.py, message (3).md, erdos_debug*.slurm) - scripts/convert_ds_to_nemorl_format.py - test_gptoss_vllm.sh - 3rdparty/Gym-workspace/Gym submodule (training runs inline, no Gym server needed)
bsigala
approved these changes
Apr 12, 2026
bsigala
left a comment
There was a problem hiding this comment.
TTT-Discover: Erdős Minimum Overlap Problem via GRPO
Summary
Implements the TTT-Discover algorithm for the Erdős Minimum Overlap Problem using NeMo RL's GRPO training loop. The system trains a 120B MoE model (Nemotron-3-Super) to generate Python programs that construct step functions minimizing the C5 overlap correlation — a longstanding open problem in combinatorics.
Result: C5 = 0.380918, surpassing the published SOTA of 0.380920.
What's included
Core components:
nemo_rl/environments/erdos_discovery_environment.py— Custom NeMo RL environment that executes LLM-generated code in sandboxed subprocesses, computes C5 correlation via FFT, and returnsreward = 1 / (1e-8 + c5_bound). Batched parallel sandbox execution (8 concurrent processes) with hard-kill timeout for scipy/Fortran code that ignores signals.nemo_rl/environments/erdos_ref_puct_sampler.py— Stateful PUCT (Polynomial Upper Confidence Trees) buffer that tracks the tree of discovered constructions across training steps. Supports save/load for checkpoint resume.nemo_rl/algorithms/entropic_advantage_estimator.py— Adaptive-β leave-one-out advantage estimator from the TTT-Discover paper, with entropy-based temperature scaling for GRPO.Training infrastructure:
examples/run_discover.py— Main entry point. WiresPUCTDiscoverDataset(pulls PUCT-selected states from the Ray env actor) to the GRPO training loop. Supports--resume/--resume-fromfor checkpoint recovery.examples/configs/grpo_erdos_discover.yaml— Production config: 8 nodes (2 inference + 6 training), 16k context, CP=2, Nemotron-3-Super-120B-A12B (instruct), entropic adaptive-β advantages, 30s sandbox timeout.launch_scripts/launch_erdos_120b.sh— SLURM launch script for the Together AI cluster (pyxis/enroot, super-v3 container). Copies custom files to/opt/nemo-rland monkey-patchesgrpo.py/utils.pyat runtime.Debug/development configs:
examples/configs/grpo_erdos_debug_16k.yaml— 1-node Qwen2.5-1.5B debug configexamples/configs/grpo_erdos_discover_debug.yaml— Lightweight debug configlaunch_scripts/launch_erdos_debug*.sh— Debug launch scriptsDocumentation:
nemo_rl/environments/ERDOS_DISCOVERY.md— Problem description, reward formulation, architecturenemo_rl/environments/LESSONS_LEARNED.md— Cluster-specific fixes and debugging notesKey design decisions
- Subprocess sandbox with hard kill: LLM-generated code runs in
multiprocessing.Processwithsignal.alarmfor cooperative timeout +p.kill()(SIGKILL) for hard timeout. Required because scipy's Fortran extensions ignore Python signals. 8 processes run concurrently in batches.- Stateful PUCT via Ray actor: The dataset and environment share state through a single Ray actor. The dataset calls
env.puct_sample_states.remote()for prompts; the environment updates the PUCT tree in_sync_step. PUCT state persists to JSON for checkpoint resume.- Monkey-patching at runtime: Custom files are copied into the container at launch to avoid rebuilding the 27GB image.
grpo.pyandutils.pyare patched withsed/python to register the entropic advantage estimator and erdos environment.- No LoRA: Full-weight GRPO training on the 120B model. TP=4, EP=8, CP=2 for 16k context on 6 training nodes (48 GPUs).
Training progression
Step Best C5 Valid Rate Avg Reward 1 0.381488 4.4% 0.099 5 0.381254 22.2% 0.575 10 0.381125 44.8% 1.144 15 0.380939 53.0% 1.345 22 0.380918 41.5% 1.075 SOTA 0.380920 — — Dependencies
- NeMo RL v0.5.0+ (super-v3 container for Nemotron-3-Super support)
scipy,numpy(available in sandbox for LLM-generated optimization code)- Cluster with pyxis/enroot, B200 GPUs
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
bsigala
approved these changes
Apr 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TTT-Discover: Erdős Minimum Overlap Problem via GRPO
Summary
Implements the TTT-Discover algorithm for the Erdős Minimum Overlap Problem using NeMo RL's GRPO training loop. The system trains a 120B MoE model (Nemotron-3-Super) to generate Python programs that construct step functions minimizing the C5 overlap correlation — a longstanding open problem in combinatorics.
Result: C5 = 0.380918, surpassing the published SOTA of 0.380920.
What's included
Core components:
nemo_rl/environments/erdos_discovery_environment.py— Custom NeMo RL environment that executes LLM-generated code in sandboxed subprocesses, computes C5 correlation via FFT, and returnsreward = 1 / (1e-8 + c5_bound). Batched parallel sandbox execution (8 concurrent processes) with hard-kill timeout for scipy/Fortran code that ignores signals.nemo_rl/environments/erdos_ref_puct_sampler.py— Stateful PUCT (Polynomial Upper Confidence Trees) buffer that tracks the tree of discovered constructions across training steps. Supports save/load for checkpoint resume.nemo_rl/algorithms/entropic_advantage_estimator.py— Adaptive-β leave-one-out advantage estimator from the TTT-Discover paper, with entropy-based temperature scaling for GRPO.Training infrastructure:
examples/run_discover.py— Main entry point. WiresPUCTDiscoverDataset(pulls PUCT-selected states from the Ray env actor) to the GRPO training loop. Supports--resume/--resume-fromfor checkpoint recovery.examples/configs/grpo_erdos_discover.yaml— Production config: 8 nodes (2 inference + 6 training), 16k context, CP=2, Nemotron-3-Super-120B-A12B (instruct), entropic adaptive-β advantages, 30s sandbox timeout.launch_scripts/launch_erdos_120b.sh— SLURM launch script for the Together AI cluster (pyxis/enroot, super-v3 container). Copies custom files to/opt/nemo-rland monkey-patchesgrpo.py/utils.pyat runtime.Debug/development configs:
examples/configs/grpo_erdos_debug_16k.yaml— 1-node Qwen2.5-1.5B debug configexamples/configs/grpo_erdos_discover_debug.yaml— Lightweight debug configlaunch_scripts/launch_erdos_debug*.sh— Debug launch scriptsDocumentation:
nemo_rl/environments/ERDOS_DISCOVERY.md— Problem description, reward formulation, architecturenemo_rl/environments/LESSONS_LEARNED.md— Cluster-specific fixes and debugging notesKey design decisions
multiprocessing.Processwithsignal.alarmfor cooperative timeout +p.kill()(SIGKILL) for hard timeout. Required because scipy's Fortran extensions ignore Python signals. 8 processes run concurrently in batches.env.puct_sample_states.remote()for prompts; the environment updates the PUCT tree in_sync_step. PUCT state persists to JSON for checkpoint resume.grpo.pyandutils.pyare patched withsed/python to register the entropic advantage estimator and erdos environment.Training progression
Dependencies
scipy,numpy(available in sandbox for LLM-generated optimization code)