Skip to content

Implement "learning to discover at test time" Erdos env + training code#1

Open
mormio wants to merge 48 commits intomainfrom
mm/tttd
Open

Implement "learning to discover at test time" Erdos env + training code#1
mormio wants to merge 48 commits intomainfrom
mm/tttd

Conversation

@mormio
Copy link
Copy Markdown
Collaborator

@mormio mormio commented Apr 7, 2026

TTT-Discover: Erdős Minimum Overlap Problem via GRPO

Summary

Implements the TTT-Discover algorithm for the Erdős Minimum Overlap Problem using NeMo RL's GRPO training loop. The system trains a 120B MoE model (Nemotron-3-Super) to generate Python programs that construct step functions minimizing the C5 overlap correlation — a longstanding open problem in combinatorics.

Result: C5 = 0.380918, surpassing the published SOTA of 0.380920.

What's included

Core components:

  • nemo_rl/environments/erdos_discovery_environment.py — Custom NeMo RL environment that executes LLM-generated code in sandboxed subprocesses, computes C5 correlation via FFT, and returns reward = 1 / (1e-8 + c5_bound). Batched parallel sandbox execution (8 concurrent processes) with hard-kill timeout for scipy/Fortran code that ignores signals.
  • nemo_rl/environments/erdos_ref_puct_sampler.py — Stateful PUCT (Polynomial Upper Confidence Trees) buffer that tracks the tree of discovered constructions across training steps. Supports save/load for checkpoint resume.
  • nemo_rl/algorithms/entropic_advantage_estimator.py — Adaptive-β leave-one-out advantage estimator from the TTT-Discover paper, with entropy-based temperature scaling for GRPO.

Training infrastructure:

  • examples/run_discover.py — Main entry point. Wires PUCTDiscoverDataset (pulls PUCT-selected states from the Ray env actor) to the GRPO training loop. Supports --resume / --resume-from for checkpoint recovery.
  • examples/configs/grpo_erdos_discover.yaml — Production config: 8 nodes (2 inference + 6 training), 16k context, CP=2, Nemotron-3-Super-120B-A12B (instruct), entropic adaptive-β advantages, 30s sandbox timeout.
  • launch_scripts/launch_erdos_120b.sh — SLURM launch script for the Together AI cluster (pyxis/enroot, super-v3 container). Copies custom files to /opt/nemo-rl and monkey-patches grpo.py/utils.py at runtime.

Debug/development configs:

  • examples/configs/grpo_erdos_debug_16k.yaml — 1-node Qwen2.5-1.5B debug config
  • examples/configs/grpo_erdos_discover_debug.yaml — Lightweight debug config
  • launch_scripts/launch_erdos_debug*.sh — Debug launch scripts

Documentation:

  • nemo_rl/environments/ERDOS_DISCOVERY.md — Problem description, reward formulation, architecture
  • nemo_rl/environments/LESSONS_LEARNED.md — Cluster-specific fixes and debugging notes

Key design decisions

  • Subprocess sandbox with hard kill: LLM-generated code runs in multiprocessing.Process with signal.alarm for cooperative timeout + p.kill() (SIGKILL) for hard timeout. Required because scipy's Fortran extensions ignore Python signals. 8 processes run concurrently in batches.
  • Stateful PUCT via Ray actor: The dataset and environment share state through a single Ray actor. The dataset calls env.puct_sample_states.remote() for prompts; the environment updates the PUCT tree in _sync_step. PUCT state persists to JSON for checkpoint resume.
  • Monkey-patching at runtime: Custom files are copied into the container at launch to avoid rebuilding the 27GB image. grpo.py and utils.py are patched with sed/python to register the entropic advantage estimator and erdos environment.
  • No LoRA: Full-weight GRPO training on the 120B model. TP=4, EP=8, CP=2 for 16k context on 6 training nodes (48 GPUs).

Training progression

Step Best C5 Valid Rate Avg Reward
1 0.381488 4.4% 0.099
5 0.381254 22.2% 0.575
10 0.381125 44.8% 1.144
15 0.380939 53.0% 1.345
22 0.380918 41.5% 1.075
SOTA 0.380920

Dependencies

  • NeMo RL v0.5.0+ (super-v3 container for Nemotron-3-Super support)
  • scipy, numpy (available in sandbox for LLM-generated optimization code)
  • Cluster with pyxis/enroot, B200 GPUs

Morgane Moss added 30 commits March 31, 2026 19:38
… estimator

TTT-Discover implementation (arXiv:2601.16175) for the Erdős Minimum
Overlap Problem.

New files:
- nemo_rl/algorithms/entropic_advantage_estimator.py
  LOO entropic advantage with adaptive β via bisection.
  Solves for β such that KL(softmax_β(R) || uniform) = ln(2),
  then computes leave-one-out advantages w_i = exp(β·r_i)/Z_{-i} - 1.

- nemo_rl/environments/erdos_discovery_environment.py
  Ray remote environment that calls the NeMo Gym Erdős resource
  server for sandboxed code execution and reward computation.

Modified:
- nemo_rl/algorithms/grpo.py: add entropic_adaptive_beta to
  advantage estimator factory + AdvEstimatorConfig TypedDict.
- nemo_rl/environments/utils.py: register erdos_discovery in
  ENV_REGISTRY.
…ADME

- nemo_rl/utils/puct_buffer.py: General-purpose PUCT tree buffer for
  iterative optimization environments. Reusable across any task that
  needs exploration/exploitation state selection.

- nemo_rl/environments/ERDOS_DISCOVERY.md: Full documentation for the
  TTT-Discover integration — architecture diagram, component locations,
  config examples, hyperparameters from the paper.

- README.md: Added Advantage Estimators (including entropic adaptive-β)
  and PUCT Buffer to the Features section.
Run script (examples/run_discover.py):
- DiscoverDataset: IterableDataset backed by PUCT buffer via HTTP.
  Calls /select_state on the Gym resource server each step to get
  dynamically selected states, generates DatumSpecs with tokenized
  prompts and parent_state metadata.
- setup_discover_data(): Wires dataset + ErdosDiscoveryEnvironment
  Ray actor, returns the (dataset, env) tuple for grpo_train().
- Follows the sliding_puzzle pattern for custom env integration.

Config (examples/configs/grpo_erdos_discover.yaml):
- entropic_adaptive_beta advantage estimator (gamma=ln(2))
- 8 groups × 64 rollouts = 512 trajectories per step
- LoRA r=32, lr=4e-5, 50 training steps
- KL penalty 0.1, importance sampling loss
- 32K context window for long code generation
- grpo_erdos_discover_debug.yaml: Qwen3-1.7B, single node, 4×8=32
  trajectories, 5 steps, 4K seq len, colocated vLLM. For testing
  the full pipeline before scaling up.

- erdos_debug.slurm: Starts Gym resource server in background on
  the same node, then runs NeMo RL GRPO training.
…aunch

- Environment supports resource_server_url="inline" which runs the
  code sandbox and reward computation directly in-process, no Gym
  server dependency needed.
- Debug config updated to use inline mode.
- SLURM script simplified: no Gym server startup, just uv run.
…ug iterations

- Use load_config + OmegaConf resolve instead of yaml.safe_load
- Match setup() and grpo_train() calling convention from sliding_puzzle
- Fix SLURM script: PATH for uv, PYTHONPATH unbound var
- Debug config uses defaults: grpo_math_1B.yaml for proper field inheritance
…nf_resolvers, add launch scripts and shim for container deployment
Covers everything we learned getting this running on the d2dfac12
B200 cluster: ray.sub patches, container compat, async fixes,
config gotchas, and the full working launch pattern.
8 nodes: 2 inference (vLLM TP=8) + 6 training (Megatron TP=4, EP=8)
Uses Dakota custom super-v3 container with NemotronH vLLM support.
No LoRA initially (Megatron backend), full fine-tune with optimizer
CPU offload + activation checkpointing.
Rewrites erdos_discovery_environment.py and run_discover.py to match
the reference implementation at github.com/test-time-training/discover:

- C5 = max(np.correlate(h, 1-h, mode="full") * dx) formulation
- Code must define run(seed, budget_s) returning (h_values, c5_bound, n_points)
- scipy/cvxpy allowed in sandbox
- State context shows parent code + improvement tracking (State.to_prompt)
- Full ErdosMinOverlapEnv.get_question() prompt with problem description
- reward = 1 / (1e-8 + c5_bound)
- Initial states: random n in [40,100], perturbed h=0.5
- Removes inline/HTTP mode split (always computes directly)
- DiscoverDataset generates diverse initial states each step
…id rate

- Seq len 4096 -> 8192 (7168 max_new_tokens)
- wandb enabled: project=ttt-discover-erdos
- Environment logs erdos/max_reward, erdos/avg_reward, erdos/valid_rate,
  erdos/best_c5, erdos/global_best_c5 to both console and metrics dict
- Print summary line per step for easy monitoring
…n Ray actors)

signal.alarm only works in main thread. Ray actors run in worker threads,
so SIGALRM never fires and sandbox code blocks indefinitely. Now uses
ThreadPoolExecutor with a 120s timeout. Also caps run() budget_s at 60s
for faster iteration.
…er step

- 10 nodes: 2 inference + 8 training (was 8 total)
- CP=2 enables 16k context (was 8k)
- 15360 max_new_tokens (was 7168)
- Save first 10 + all valid outputs per step for debugging valid rate
- ERDOS_LOG_DIR for output files
Morgane Moss added 14 commits April 2, 2026 15:23
- ErdosRefPUCTSampler: full PUCT tree with state tracking, update_states,
  record_failed_rollout, flush, sample_states (matches ttt-discover-ref)
- PUCTDiscoverDataset: calls env.puct_sample_states.remote() for prompts
  (shared state between dataset and env via Ray actor)
- Environment _sync_step updates PUCT buffer on success/failure
- RandomDiscoverDataset for validation (no PUCT)
…bmodule

Removed:
- shim/ (old copy of code, unused)
- Debug configs and launch scripts (grpo_erdos_debug_16k, launch_erdos_debug*)
- grpo_superv3.yaml (base container config, not needed in PR)
- Scratch files (example.txt, keep.py, message (3).md, erdos_debug*.slurm)
- scripts/convert_ds_to_nemorl_format.py
- test_gptoss_vllm.sh
- 3rdparty/Gym-workspace/Gym submodule (training runs inline, no Gym server needed)
@mormio mormio marked this pull request as draft April 7, 2026 19:53
@mormio mormio marked this pull request as ready for review April 7, 2026 19:57
@mormio mormio changed the title [DRAFT] Implement "learning to discover at test time" Erdos env + training code Implement "learning to discover at test time" Erdos env + training code Apr 7, 2026
Copy link
Copy Markdown

@bsigala bsigala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TTT-Discover: Erdős Minimum Overlap Problem via GRPO

Summary

Implements the TTT-Discover algorithm for the Erdős Minimum Overlap Problem using NeMo RL's GRPO training loop. The system trains a 120B MoE model (Nemotron-3-Super) to generate Python programs that construct step functions minimizing the C5 overlap correlation — a longstanding open problem in combinatorics.

Result: C5 = 0.380918, surpassing the published SOTA of 0.380920.

What's included

Core components:

  • nemo_rl/environments/erdos_discovery_environment.py — Custom NeMo RL environment that executes LLM-generated code in sandboxed subprocesses, computes C5 correlation via FFT, and returns reward = 1 / (1e-8 + c5_bound). Batched parallel sandbox execution (8 concurrent processes) with hard-kill timeout for scipy/Fortran code that ignores signals.
  • nemo_rl/environments/erdos_ref_puct_sampler.py — Stateful PUCT (Polynomial Upper Confidence Trees) buffer that tracks the tree of discovered constructions across training steps. Supports save/load for checkpoint resume.
  • nemo_rl/algorithms/entropic_advantage_estimator.py — Adaptive-β leave-one-out advantage estimator from the TTT-Discover paper, with entropy-based temperature scaling for GRPO.

Training infrastructure:

  • examples/run_discover.py — Main entry point. Wires PUCTDiscoverDataset (pulls PUCT-selected states from the Ray env actor) to the GRPO training loop. Supports --resume / --resume-from for checkpoint recovery.
  • examples/configs/grpo_erdos_discover.yaml — Production config: 8 nodes (2 inference + 6 training), 16k context, CP=2, Nemotron-3-Super-120B-A12B (instruct), entropic adaptive-β advantages, 30s sandbox timeout.
  • launch_scripts/launch_erdos_120b.sh — SLURM launch script for the Together AI cluster (pyxis/enroot, super-v3 container). Copies custom files to /opt/nemo-rl and monkey-patches grpo.py/utils.py at runtime.

Debug/development configs:

  • examples/configs/grpo_erdos_debug_16k.yaml — 1-node Qwen2.5-1.5B debug config
  • examples/configs/grpo_erdos_discover_debug.yaml — Lightweight debug config
  • launch_scripts/launch_erdos_debug*.sh — Debug launch scripts

Documentation:

  • nemo_rl/environments/ERDOS_DISCOVERY.md — Problem description, reward formulation, architecture
  • nemo_rl/environments/LESSONS_LEARNED.md — Cluster-specific fixes and debugging notes

Key design decisions

  • Subprocess sandbox with hard kill: LLM-generated code runs in multiprocessing.Process with signal.alarm for cooperative timeout + p.kill() (SIGKILL) for hard timeout. Required because scipy's Fortran extensions ignore Python signals. 8 processes run concurrently in batches.
  • Stateful PUCT via Ray actor: The dataset and environment share state through a single Ray actor. The dataset calls env.puct_sample_states.remote() for prompts; the environment updates the PUCT tree in _sync_step. PUCT state persists to JSON for checkpoint resume.
  • Monkey-patching at runtime: Custom files are copied into the container at launch to avoid rebuilding the 27GB image. grpo.py and utils.py are patched with sed/python to register the entropic advantage estimator and erdos environment.
  • No LoRA: Full-weight GRPO training on the 120B model. TP=4, EP=8, CP=2 for 16k context on 6 training nodes (48 GPUs).

Training progression

Step Best C5 Valid Rate Avg Reward
1 0.381488 4.4% 0.099
5 0.381254 22.2% 0.575
10 0.381125 44.8% 1.144
15 0.380939 53.0% 1.345
22 0.380918 41.5% 1.075
SOTA 0.380920

Dependencies

  • NeMo RL v0.5.0+ (super-v3 container for Nemotron-3-Super support)
  • scipy, numpy (available in sandbox for LLM-generated optimization code)
  • Cluster with pyxis/enroot, B200 GPUs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants