Skip to content

Expose hardcoded Megatron infrastructure parameters to user config#2230

Open
nic-nvidia wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
nic-nvidia:expose-infra-params
Open

Expose hardcoded Megatron infrastructure parameters to user config#2230
nic-nvidia wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
nic-nvidia:expose-infra-params

Conversation

@nic-nvidia
Copy link
Copy Markdown

Summary

  • Read checkpoint, timeout, and diagnostic settings from megatron_cfg with backward-compatible defaults instead of hardcoding them in setup.py
  • New optional megatron_cfg fields: async_save, fully_parallel_save, fully_parallel_load, load_rng, distributed_timeout_minutes, logging_level
  • New optional distributed_data_parallel_config field: check_for_nan_in_grad
  • All defaults match the existing hardcoded values — no behavior change without explicit config

Closes #2229

Test plan

  • Existing test_basic_checkpoint_config passes (backward compat, no config arg)
  • New test_checkpoint_config_overrides validates all 4 checkpoint fields
  • CI passes with no config changes (defaults preserved)

Read checkpoint, timeout, and diagnostic settings from megatron_cfg
with backward-compatible defaults instead of hardcoding them.

New megatron_cfg fields (all optional, existing defaults preserved):
  - async_save, fully_parallel_save, fully_parallel_load, load_rng
  - distributed_timeout_minutes
  - logging_level

New distributed_data_parallel_config field:
  - check_for_nan_in_grad

Closes NVIDIA-NeMo#2229
@nic-nvidia nic-nvidia requested review from a team as code owners April 8, 2026 06:03
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nic-nvidia , thanks for the enhancement! LGTM except the default config place. Could you help to update?

Also @yaoyu-33 @cuichenx , could you help to check whether the params in this PR are well supported in MBrdige?

# Step 1: Setup distributed
setup_distributed()
setup_distributed(
timeout_minutes=config.get("megatron_cfg", {}).get("distributed_timeout_minutes"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We encourage to set default value in config.yaml instead of in code, so that people can know what feature we have and their default behavior w/o looking into the code.

Can you help to:

  1. Update to the below, also other configs
  2. Add the param (set to the default value) to several base configs? (other configs will inherit from the base one so don't need to change)
    1. examples/configs/distillation_math.yaml
    2. examples/configs/dpo.yaml
    3. examples/configs/grpo_math_1B.yaml
    4. examples/configs/rm.yaml
    5. examples/configs/sft.yaml
    6. examples/nemo_gym/grpo_nanov3.yaml
    7. examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
    8. research/template_project/configs/grpo_math_1B.yaml
Suggested change
timeout_minutes=config.get("megatron_cfg", {}).get("distributed_timeout_minutes"),
timeout_minutes=config["megatron_cfg"]["distributed_timeout_minutes"],

@yuki-97 yuki-97 requested review from cuichenx and yaoyu-33 April 11, 2026 04:28
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose hardcoded Megatron infrastructure parameters to user config

4 participants