Implement "learning to discover at test time" Erdos env + training code by mormio · Pull Request #1 · NousResearch/RL

mormio · 2026-04-07T19:53:14Z

TTT-Discover: Erdős Minimum Overlap Problem via GRPO

Summary

Implements the TTT-Discover algorithm for the Erdős Minimum Overlap Problem using NeMo RL's GRPO training loop. The system trains a 120B MoE model (Nemotron-3-Super) to generate Python programs that construct step functions minimizing the C5 overlap correlation — a longstanding open problem in combinatorics.

Result: C5 = 0.380918, surpassing the published SOTA of 0.380920.

What's included

Core components:

nemo_rl/environments/erdos_discovery_environment.py — Custom NeMo RL environment that executes LLM-generated code in sandboxed subprocesses, computes C5 correlation via FFT, and returns reward = 1 / (1e-8 + c5_bound). Batched parallel sandbox execution (8 concurrent processes) with hard-kill timeout for scipy/Fortran code that ignores signals.
nemo_rl/environments/erdos_ref_puct_sampler.py — Stateful PUCT (Polynomial Upper Confidence Trees) buffer that tracks the tree of discovered constructions across training steps. Supports save/load for checkpoint resume.
nemo_rl/algorithms/entropic_advantage_estimator.py — Adaptive-β leave-one-out advantage estimator from the TTT-Discover paper, with entropy-based temperature scaling for GRPO.

Training infrastructure:

examples/run_discover.py — Main entry point. Wires PUCTDiscoverDataset (pulls PUCT-selected states from the Ray env actor) to the GRPO training loop. Supports --resume / --resume-from for checkpoint recovery.
examples/configs/grpo_erdos_discover.yaml — Production config: 8 nodes (2 inference + 6 training), 16k context, CP=2, Nemotron-3-Super-120B-A12B (instruct), entropic adaptive-β advantages, 30s sandbox timeout.
launch_scripts/launch_erdos_120b.sh — SLURM launch script for the Together AI cluster (pyxis/enroot, super-v3 container). Copies custom files to /opt/nemo-rl and monkey-patches grpo.py/utils.py at runtime.

Debug/development configs:

examples/configs/grpo_erdos_debug_16k.yaml — 1-node Qwen2.5-1.5B debug config
examples/configs/grpo_erdos_discover_debug.yaml — Lightweight debug config
launch_scripts/launch_erdos_debug*.sh — Debug launch scripts

Documentation:

nemo_rl/environments/ERDOS_DISCOVERY.md — Problem description, reward formulation, architecture
nemo_rl/environments/LESSONS_LEARNED.md — Cluster-specific fixes and debugging notes

Key design decisions

Subprocess sandbox with hard kill: LLM-generated code runs in multiprocessing.Process with signal.alarm for cooperative timeout + p.kill() (SIGKILL) for hard timeout. Required because scipy's Fortran extensions ignore Python signals. 8 processes run concurrently in batches.
Stateful PUCT via Ray actor: The dataset and environment share state through a single Ray actor. The dataset calls env.puct_sample_states.remote() for prompts; the environment updates the PUCT tree in _sync_step. PUCT state persists to JSON for checkpoint resume.
Monkey-patching at runtime: Custom files are copied into the container at launch to avoid rebuilding the 27GB image. grpo.py and utils.py are patched with sed/python to register the entropic advantage estimator and erdos environment.
No LoRA: Full-weight GRPO training on the 120B model. TP=4, EP=8, CP=2 for 16k context on 6 training nodes (48 GPUs).

Training progression

Step	Best C5	Valid Rate	Avg Reward
1	0.381488	4.4%	0.099
5	0.381254	22.2%	0.575
10	0.381125	44.8%	1.144
15	0.380939	53.0%	1.345
22	0.380918	41.5%	1.075
SOTA	0.380920	—	—

Dependencies

NeMo RL v0.5.0+ (super-v3 container for Nemotron-3-Super support)
scipy, numpy (available in sandbox for LLM-generated optimization code)
Cluster with pyxis/enroot, B200 GPUs

… estimator TTT-Discover implementation (arXiv:2601.16175) for the Erdős Minimum Overlap Problem. New files: - nemo_rl/algorithms/entropic_advantage_estimator.py LOO entropic advantage with adaptive β via bisection. Solves for β such that KL(softmax_β(R) || uniform) = ln(2), then computes leave-one-out advantages w_i = exp(β·r_i)/Z_{-i} - 1. - nemo_rl/environments/erdos_discovery_environment.py Ray remote environment that calls the NeMo Gym Erdős resource server for sandboxed code execution and reward computation. Modified: - nemo_rl/algorithms/grpo.py: add entropic_adaptive_beta to advantage estimator factory + AdvEstimatorConfig TypedDict. - nemo_rl/environments/utils.py: register erdos_discovery in ENV_REGISTRY.

…ADME - nemo_rl/utils/puct_buffer.py: General-purpose PUCT tree buffer for iterative optimization environments. Reusable across any task that needs exploration/exploitation state selection. - nemo_rl/environments/ERDOS_DISCOVERY.md: Full documentation for the TTT-Discover integration — architecture diagram, component locations, config examples, hyperparameters from the paper. - README.md: Added Advantage Estimators (including entropic adaptive-β) and PUCT Buffer to the Features section.

Run script (examples/run_discover.py): - DiscoverDataset: IterableDataset backed by PUCT buffer via HTTP. Calls /select_state on the Gym resource server each step to get dynamically selected states, generates DatumSpecs with tokenized prompts and parent_state metadata. - setup_discover_data(): Wires dataset + ErdosDiscoveryEnvironment Ray actor, returns the (dataset, env) tuple for grpo_train(). - Follows the sliding_puzzle pattern for custom env integration. Config (examples/configs/grpo_erdos_discover.yaml): - entropic_adaptive_beta advantage estimator (gamma=ln(2)) - 8 groups × 64 rollouts = 512 trajectories per step - LoRA r=32, lr=4e-5, 50 training steps - KL penalty 0.1, importance sampling loss - 32K context window for long code generation

- grpo_erdos_discover_debug.yaml: Qwen3-1.7B, single node, 4×8=32 trajectories, 5 steps, 4K seq len, colocated vLLM. For testing the full pipeline before scaling up. - erdos_debug.slurm: Starts Gym resource server in background on the same node, then runs NeMo RL GRPO training.

…aunch - Environment supports resource_server_url="inline" which runs the code sandbox and reward computation directly in-process, no Gym server dependency needed. - Debug config updated to use inline mode. - SLURM script simplified: no Gym server startup, just uv run.

…ug iterations - Use load_config + OmegaConf resolve instead of yaml.safe_load - Match setup() and grpo_train() calling convention from sliding_puzzle - Fix SLURM script: PATH for uv, PYTHONPATH unbound var - Debug config uses defaults: grpo_math_1B.yaml for proper field inheritance

…nf_resolvers, add launch scripts and shim for container deployment

Covers everything we learned getting this running on the d2dfac12 B200 cluster: ray.sub patches, container compat, async fixes, config gotchas, and the full working launch pattern.

8 nodes: 2 inference (vLLM TP=8) + 6 training (Megatron TP=4, EP=8) Uses Dakota custom super-v3 container with NemotronH vLLM support. No LoRA initially (Megatron backend), full fine-tune with optimizer CPU offload + activation checkpointing.

…nfig

…rn values)

Rewrites erdos_discovery_environment.py and run_discover.py to match the reference implementation at github.com/test-time-training/discover: - C5 = max(np.correlate(h, 1-h, mode="full") * dx) formulation - Code must define run(seed, budget_s) returning (h_values, c5_bound, n_points) - scipy/cvxpy allowed in sandbox - State context shows parent code + improvement tracking (State.to_prompt) - Full ErdosMinOverlapEnv.get_question() prompt with problem description - reward = 1 / (1e-8 + c5_bound) - Initial states: random n in [40,100], perturbed h=0.5 - Removes inline/HTTP mode split (always computes directly) - DiscoverDataset generates diverse initial states each step

…id rate - Seq len 4096 -> 8192 (7168 max_new_tokens) - wandb enabled: project=ttt-discover-erdos - Environment logs erdos/max_reward, erdos/avg_reward, erdos/valid_rate, erdos/best_c5, erdos/global_best_c5 to both console and metrics dict - Print summary line per step for easy monitoring

…n Ray actors) signal.alarm only works in main thread. Ray actors run in worker threads, so SIGALRM never fires and sandbox code blocks indefinitely. Now uses ThreadPoolExecutor with a 120s timeout. Also caps run() budget_s at 60s for faster iteration.

…del output

…est_C5

…er step - 10 nodes: 2 inference + 8 training (was 8 total) - CP=2 enables 16k context (was 8k) - 15360 max_new_tokens (was 7168) - Save first 10 + all valid outputs per step for debugging valid rate - ERDOS_LOG_DIR for output files

…aloader)

…hreads cant be killed)

…re SIGKILL, 1000s timeout

- ErdosRefPUCTSampler: full PUCT tree with state tracking, update_states, record_failed_rollout, flush, sample_states (matches ttt-discover-ref) - PUCTDiscoverDataset: calls env.puct_sample_states.remote() for prompts (shared state between dataset and env via Ray actor) - Environment _sync_step updates PUCT buffer on success/failure - RandomDiscoverDataset for validation (no PUCT)

…bmodule Removed: - shim/ (old copy of code, unused) - Debug configs and launch scripts (grpo_erdos_debug_16k, launch_erdos_debug*) - grpo_superv3.yaml (base container config, not needed in PR) - Scratch files (example.txt, keep.py, message (3).md, erdos_debug*.slurm) - scripts/convert_ds_to_nemorl_format.py - test_gptoss_vllm.sh - 3rdparty/Gym-workspace/Gym submodule (training runs inline, no Gym server needed)

bsigala

TTT-Discover: Erdős Minimum Overlap Problem via GRPO

Summary

Implements the TTT-Discover algorithm for the Erdős Minimum Overlap Problem using NeMo RL's GRPO training loop. The system trains a 120B MoE model (Nemotron-3-Super) to generate Python programs that construct step functions minimizing the C5 overlap correlation — a longstanding open problem in combinatorics.

Result: C5 = 0.380918, surpassing the published SOTA of 0.380920.

What's included

Core components:

nemo_rl/environments/erdos_discovery_environment.py — Custom NeMo RL environment that executes LLM-generated code in sandboxed subprocesses, computes C5 correlation via FFT, and returns reward = 1 / (1e-8 + c5_bound). Batched parallel sandbox execution (8 concurrent processes) with hard-kill timeout for scipy/Fortran code that ignores signals.

nemo_rl/environments/erdos_ref_puct_sampler.py — Stateful PUCT (Polynomial Upper Confidence Trees) buffer that tracks the tree of discovered constructions across training steps. Supports save/load for checkpoint resume.

nemo_rl/algorithms/entropic_advantage_estimator.py — Adaptive-β leave-one-out advantage estimator from the TTT-Discover paper, with entropy-based temperature scaling for GRPO.

Training infrastructure:

examples/run_discover.py — Main entry point. Wires PUCTDiscoverDataset (pulls PUCT-selected states from the Ray env actor) to the GRPO training loop. Supports --resume / --resume-from for checkpoint recovery.

examples/configs/grpo_erdos_discover.yaml — Production config: 8 nodes (2 inference + 6 training), 16k context, CP=2, Nemotron-3-Super-120B-A12B (instruct), entropic adaptive-β advantages, 30s sandbox timeout.

launch_scripts/launch_erdos_120b.sh — SLURM launch script for the Together AI cluster (pyxis/enroot, super-v3 container). Copies custom files to /opt/nemo-rl and monkey-patches grpo.py/utils.py at runtime.

Debug/development configs:

examples/configs/grpo_erdos_debug_16k.yaml — 1-node Qwen2.5-1.5B debug config

examples/configs/grpo_erdos_discover_debug.yaml — Lightweight debug config

launch_scripts/launch_erdos_debug*.sh — Debug launch scripts

Documentation:

nemo_rl/environments/ERDOS_DISCOVERY.md — Problem description, reward formulation, architecture

nemo_rl/environments/LESSONS_LEARNED.md — Cluster-specific fixes and debugging notes

Key design decisions

Subprocess sandbox with hard kill: LLM-generated code runs in multiprocessing.Process with signal.alarm for cooperative timeout + p.kill() (SIGKILL) for hard timeout. Required because scipy's Fortran extensions ignore Python signals. 8 processes run concurrently in batches.

Stateful PUCT via Ray actor: The dataset and environment share state through a single Ray actor. The dataset calls env.puct_sample_states.remote() for prompts; the environment updates the PUCT tree in _sync_step. PUCT state persists to JSON for checkpoint resume.

Monkey-patching at runtime: Custom files are copied into the container at launch to avoid rebuilding the 27GB image. grpo.py and utils.py are patched with sed/python to register the entropic advantage estimator and erdos environment.

No LoRA: Full-weight GRPO training on the 120B model. TP=4, EP=8, CP=2 for 16k context on 6 training nodes (48 GPUs).

Training progression

Step Best C5 Valid Rate Avg Reward

1 0.381488 4.4% 0.099

5 0.381254 22.2% 0.575

10 0.381125 44.8% 1.144

15 0.380939 53.0% 1.345

22 0.380918 41.5% 1.075

SOTA 0.380920 — —

Dependencies

NeMo RL v0.5.0+ (super-v3 container for Nemotron-3-Super support)

scipy, numpy (available in sandbox for LLM-generated optimization code)

Cluster with pyxis/enroot, B200 GPUs

Morgane Moss added 30 commits March 31, 2026 19:38

fix: v0.5.0 container compatibility - handle missing register_omegaco…

b634fec

…nf_resolvers, add launch scripts and shim for container deployment

fix: register mul/div OmegaConf resolvers for v0.5.0 compat

6c58bd6

fix: sync step method for Ray actor event loop compatibility

760fa9f

fix: observations must include content key for rollout engine

7e0d70f

fix: disable CPU offload for debug (1.5B fits on GPU)

c37caa6

docs: add LESSONS_LEARNED.md for Erdős TTT-Discover on NeMo RL

fb2dab5

Covers everything we learned getting this running on the d2dfac12 B200 cluster: ray.sub patches, container compat, async fixes, config gotchas, and the full working launch pattern.

fix: inherit from grpo_superv3.yaml for correct Megatron+NemotronH co…

2dd29f2

…nfig

fix: set async_engine false for 120B to avoid engine core crash

2cbb9be

fix: batch size 504 (8×63) divisible by DP=12

d265e72

fix: version-agnostic setup() unpacking for super-v3 container compat

dcb07dc

fix: explicit setup→grpo_train wiring for super-v3 container (11 retu…

081d5b5

…rn values)

fix: reduce seq length to 4096 to avoid OOM during training

def9089

feat: add timestamped progress logging to reward computation

8527a86

fix: remove stale _Timeout reference that crashed env actor on bad mo…

c48826b

…del output

fix: use print() instead of logger.info() for Ray actor visibility

2b07c29

feat: prominent step-level logging with max_reward, best_C5, global_b…

5904671

…est_C5

fix: disable validation to prevent max_val_samples None crash at step 5

da2ce18

fix: disable checkpointing (async writer crashes at step 10)

24e9aa0

fix: fully disable checkpointing with null checkpoint_must_save_by

a4c5ea0

Morgane Moss added 14 commits April 2, 2026 15:23

debug: print setup() return types to fix unpacking order

f699f02

fix: correct v0.5.0 setup() unpacking order (clusters at [2], not dat…

f6899a8

…aloader)

fix: debug config back to 4k (16k OOMs on 1 node with 1.5B + LoRA)

f5590fd

120B at 4k context, 8 nodes, 50 steps, checkpointing fully disabled

7a7bd71

fix: use multiprocessing.Process + kill() for hard sandbox timeout (t…

7de56f3

…hreads cant be killed)

fix: sandbox timeout 1000s matching paper, not 120s

ab974aa

fix: clean subprocess sandbox - BaseException for alarm, SIGTERM befo…

1cf501f

…re SIGKILL, 1000s timeout

config: 8 nodes, 16k seq, CP=2, copy erdos_ref_puct_sampler to container

bbed0f0

cleanup: config naming, PUCT log dir, remove stale puct_buffer copy

0adfa3a

fix: point config to instruct model (no Base)

67051f2

script to convert ds to nemo rl/sft format

edbbca8

cleanup: remove unused puct_buffer.py and ray.sub.bak

47d4b0b

mormio marked this pull request as draft April 7, 2026 19:53

restore Gym submodule to match main (avoid merge conflict)

a0de921

mormio marked this pull request as ready for review April 7, 2026 19:57

mormio changed the title ~~[DRAFT] Implement "learning to discover at test time" Erdos env + training code~~ Implement "learning to discover at test time" Erdos env + training code Apr 7, 2026

bsigala approved these changes Apr 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement "learning to discover at test time" Erdos env + training code#1

Implement "learning to discover at test time" Erdos env + training code#1
mormio wants to merge 48 commits intomainfrom
mm/tttd

mormio commented Apr 7, 2026

Uh oh!

bsigala left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mormio commented Apr 7, 2026

TTT-Discover: Erdős Minimum Overlap Problem via GRPO

Summary

What's included

Key design decisions

Training progression

Dependencies

Uh oh!

bsigala left a comment

Choose a reason for hiding this comment

TTT-Discover: Erdős Minimum Overlap Problem via GRPO

Summary

What's included

Key design decisions

Training progression

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants