Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
46e64c6
feat: add Erdős Discovery environment + entropic adaptive-β advantage…
Mar 31, 2026
bb43f23
docs: add PUCT buffer utility, Erdős discovery README, update main RE…
Mar 31, 2026
224d76e
feat: add TTT-Discover run script and config for Erdős GRPO training
Mar 31, 2026
c2f9e75
feat: add debug config and SLURM launcher for Erdős TTT-Discover
Mar 31, 2026
a6c6246
feat: add inline mode for ErdosDiscoveryEnvironment, simplify debug l…
Mar 31, 2026
05c7e95
fix: run_discover.py config loading, setup/grpo_train signatures, deb…
Mar 31, 2026
b634fec
fix: v0.5.0 container compatibility - handle missing register_omegaco…
Apr 1, 2026
6c58bd6
fix: register mul/div OmegaConf resolvers for v0.5.0 compat
Apr 1, 2026
760fa9f
fix: sync step method for Ray actor event loop compatibility
Apr 1, 2026
7e0d70f
fix: observations must include content key for rollout engine
Apr 1, 2026
c37caa6
fix: disable CPU offload for debug (1.5B fits on GPU)
Apr 1, 2026
fb2dab5
docs: add LESSONS_LEARNED.md for Erdős TTT-Discover on NeMo RL
Apr 1, 2026
7daed16
feat: add 120B Nemotron Super launch config for Erdős TTT-Discover
Apr 1, 2026
2dd29f2
fix: inherit from grpo_superv3.yaml for correct Megatron+NemotronH co…
Apr 1, 2026
2cbb9be
fix: set async_engine false for 120B to avoid engine core crash
Apr 1, 2026
d265e72
fix: batch size 504 (8×63) divisible by DP=12
Apr 1, 2026
dcb07dc
fix: version-agnostic setup() unpacking for super-v3 container compat
Apr 1, 2026
081d5b5
fix: explicit setup→grpo_train wiring for super-v3 container (11 retu…
Apr 1, 2026
def9089
fix: reduce seq length to 4096 to avoid OOM during training
Apr 1, 2026
c3b0971
feat: port exact reference TTT-Discover env + prompts
Apr 1, 2026
eca88d8
feat: 8k seq len, wandb logging, erdos/ metrics with max reward + val…
Apr 1, 2026
a777cbc
fix: use ThreadPoolExecutor timeout instead of signal.alarm (broken i…
Apr 1, 2026
8527a86
feat: add timestamped progress logging to reward computation
Apr 1, 2026
c48826b
fix: remove stale _Timeout reference that crashed env actor on bad mo…
Apr 1, 2026
2b07c29
fix: use print() instead of logger.info() for Ray actor visibility
Apr 1, 2026
5904671
feat: prominent step-level logging with max_reward, best_C5, global_b…
Apr 1, 2026
da2ce18
fix: disable validation to prevent max_val_samples None crash at step 5
Apr 1, 2026
24e9aa0
fix: disable checkpointing (async writer crashes at step 10)
Apr 2, 2026
b9afea8
feat: scale to 10 nodes + 16k seq len (CP=2), save outputs to JSONL p…
Apr 2, 2026
a4c5ea0
fix: fully disable checkpointing with null checkpoint_must_save_by
Apr 2, 2026
8ece7e1
feat: debug config for step 10 hang repro (Qwen 1.5B, 1 node, 16k, 15…
Apr 2, 2026
745468e
fix: ensure checkpointing config exists for both v0.5.0 and super-v3 …
Apr 2, 2026
d1b79b5
fix: inject checkpointing into master_config returned by setup() (not…
Apr 2, 2026
f699f02
debug: print setup() return types to fix unpacking order
Apr 2, 2026
f6899a8
fix: correct v0.5.0 setup() unpacking order (clusters at [2], not dat…
Apr 2, 2026
f5590fd
fix: debug config back to 4k (16k OOMs on 1 node with 1.5B + LoRA)
Apr 2, 2026
7a7bd71
120B at 4k context, 8 nodes, 50 steps, checkpointing fully disabled
Apr 2, 2026
7de56f3
fix: use multiprocessing.Process + kill() for hard sandbox timeout (t…
Apr 2, 2026
ab974aa
fix: sandbox timeout 1000s matching paper, not 120s
Apr 2, 2026
1cf501f
fix: clean subprocess sandbox - BaseException for alarm, SIGTERM befo…
Apr 2, 2026
ee698b9
feat: stateful PUCT sampler integrated into RL env + dataset
Apr 3, 2026
bbed0f0
config: 8 nodes, 16k seq, CP=2, copy erdos_ref_puct_sampler to container
Apr 3, 2026
0adfa3a
cleanup: config naming, PUCT log dir, remove stale puct_buffer copy
Apr 3, 2026
67051f2
fix: point config to instruct model (no Base)
Apr 3, 2026
edbbca8
script to convert ds to nemo rl/sft format
Apr 3, 2026
3c5c16b
cleanup: remove debug configs, shim copies, scratch files, and Gym su…
Apr 7, 2026
47d4b0b
cleanup: remove unused puct_buffer.py and ray.sub.bak
Apr 7, 2026
a0de921
restore Gym submodule to match main (avoid merge conflict)
Apr 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 3rdparty/Gym-workspace/Gym
Submodule Gym updated 791 files
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ For detailed information on backend selection, configuration, and examples, see
- ✅ **Environment Support and Isolation** - Support for multi-environment training and dependency isolation between components.
- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state).
- ✅ **Learning Algorithms** - GRPO/GSPO/DAPO, SFT(with LoRA), DPO, and On-policy distillation.
- ✅ **Advantage Estimators** - Group-relative (GRPO), multi-reward (GDPO), Reinforce++, and [Entropic Adaptive-β](nemo_rl/algorithms/entropic_advantage_estimator.py) (LOO entropic weighting from [TTT-Discover](https://arxiv.org/abs/2601.16175)).
- ✅ **PUCT Buffer** - [Tree-structured state selection](nemo_rl/utils/puct_buffer.py) for iterative optimization environments (exploration/exploitation via Upper Confidence bounds).
- ✅ **Multi-Turn RL** - Multi-turn generation and training for RL with tool use, games, etc.
- ✅ **Advanced Parallelism with DTensor** - PyTorch FSDP2, TP, CP, and SP for efficient training (through NeMo AutoModel).
- ✅ **Larger Model Support with Longer Sequences** - Performant parallelisms with Megatron Core (TP/PP/CP/SP/EP/FSDP) (through NeMo Megatron Bridge).
Expand Down
98 changes: 98 additions & 0 deletions examples/configs/grpo_erdos_discover.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# TTT-Discover Erdős — Nemotron-3-Super-120B, 16k seq, 8 nodes, CP=2
defaults: "grpo_superv3.yaml"

grpo:
num_prompts_per_step: 8
num_generations_per_prompt: 63
max_num_steps: 50
max_rollout_turns: 1
remove_constant_reward_groups: true
val_period: 0
val_at_start: false
val_at_end: false
adv_estimator:
name: entropic_adaptive_beta
gamma: 0.6931471805599453

loss_fn:
kl_penalty_coef: 0.1
ratio_clip: 0.2
token_level_loss: false

policy:
model_name: "/home/shared/models/NVIDIA-Nemotron-3-Super-120B-A12B-BF16"
tokenizer:
name: "/home/shared/models/NVIDIA-Nemotron-3-Super-120B-A12B-BF16"
chat_template_kwargs: null
max_total_sequence_length: 16384
train_global_batch_size: 504
train_micro_batch_size: 1
logprob_batch_size: 1

generation:
colocated:
enabled: false
resources:
num_nodes: 2
gpus_per_node: 8
max_new_tokens: 15360
vllm_cfg:
async_engine: false
tensor_parallel_size: 8
gpu_memory_utilization: 0.85
max_model_len: 16384

megatron_cfg:
tensor_model_parallel_size: 4
pipeline_model_parallel_size: 1
context_parallel_size: 2
expert_model_parallel_size: 8
sequence_parallel: true
activation_checkpointing: true
empty_unused_memory_level: 2
optimizer_cpu_offload: true
optimizer:
optimizer_cpu_offload: true
optimizer_offload_fraction: 1.0

dynamic_batching:
enabled: false

lora_cfg:
enabled: false

optimizer:
lr: 4.0e-5

data:
shuffle: false
max_input_seq_length: 16384

env:
erdos_discovery:
num_initial_states: 8 # matches num_prompts_per_step
puct_seed_batch_size: 8 # matches num_prompts_per_step
sandbox_timeout: 1000
should_use_nemo_gym: false

cluster:
gpus_per_node: 8
num_nodes: 8

logger:
log_dir: "results/erdos-120b-16k"
wandb_enabled: true
wandb:
project: "ttt-discover-erdos"
name: "nemotron-120b-16k-8node-puct"
tensorboard_enabled: false
mlflow_enabled: false
swanlab_enabled: false

checkpointing:
enabled: false
checkpoint_dir: "results/erdos-120b-16k"
save_period: 999999
checkpoint_must_save_by: null
model_save_format: "safetensors"
save_consolidated: false
Loading