E2E heterogenous non colocated MiMo training by yashaswikarnati · Pull Request #5602 · NVIDIA/Megatron-LM

yashaswikarnati · 2026-07-01T16:04:32Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

Factor the post-build distributed lifecycle (TP attrs, GPU placement, mixed-precision wrap, meta materialize, DDP/FSDP wrap) out of unimodal_build_distributed_models into a reusable function callable on already-built chunks, for per-submodule reuse by the MIMO builder. Byte-identical for stock callers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…sjoint grids Four default-off, stock-byte-identical changes so a non-language rank under skip_model_parallel_init never touches an uninitialized global mpu: - train(): only default finalize_model_grads_func when unset (keep builder hook) - setup_model_and_optimizer(): dp world-size falls back to args when mpu uninit - setup_model_and_optimizer(): thread the model's pg_collection groups + rng prefix into load_checkpoint (mirrors the existing save path) - save_checkpoint_and_time(): pass dp_group to report_memory Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…t pick Rename the schedule_pg_collection carrier param to pg_collection on pretrain/train/train_step and collapse train_step's dual params to one carrier (reductions now derive per-rank groups from the model). Remove the language-else-first heuristic: LM-global concerns use the language collection or None; per-module seeding is the builder's job. schedules.py untouched; stock is byte-identical (carrier stays None). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

… bug Delete _get_pg_collection_for_optimizer, which queried expert groups on the default grid view and raised on dense (encoder) grids; read each active module's pg_collection off the built model instead. Add the base-view intra_dist_opt group in topology, and make MimoOptimizer.count_zeros world-consistent (per-module all_reduce MAX then sum) so num_zeros agrees across disjoint grids. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

configure_module_rng now delegates to the stock _set_random_seed with explicit per-module groups (adds data_parallel_random_init); wrap_active_modules_with_ddp delegates to prepare_existing_model_chunks_for_distributed_training per active submodule (language Float16Module; encoders _EncoderFloat16Module with overlap off), dropping the hand-rolled DDP/Float16/bucket/stream code. Function names unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

New examples/mimo/training/builder.py: MimoBuildConfig(ModelConfig) carries the live topology/args as non-serialized underscore fields; MimoModelBuilder builds the rank-local MimoModel, seeds the active role before build, wraps it via the shared distributed lifecycle, installs grad-sync, and sets model.pg_collection + rng_state_key_prefix (read by the training loop for checkpoint save/load, per-module reductions, and optimizer construction). Thread data_parallel_random_init into the DDP wrap so DP replicas broadcast-sync when the flag is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

New examples/mimo/training/data.py: per-role external DataLoaders for non-colocated MIMO (language at PP first/last stage, encoder at PP first stage, others None); encoder micro-batch = mbs*llm_dp//encoder_dp; per-role/DP/split seeding; provider tagged is_distributed. Text-token pool excludes image_token_id and pad_token_id (collision fix); labels remap image + pad to -100. Dense and dynamic-resolution (packed thd) encoder inputs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

New examples/mimo/pretrain_mimo.py wires the topology, cross-grid P2P communicator, role-aware mock data, and MimoBuildConfig into the stock pretrain() loop (model_provider=None, skip_model_parallel_init=True, pg_collection=topology.schedule_pg_collection); validates the disjoint grids by standing the language grid in for stock validate_args. Adds the 8-GPU 20L nemotron VLM mock launcher matching the proven cog config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…s seeding) initialize_megatron gains a skip_random_seed flag; pretrain sets it for a MultiModuleProcessGroupCollection so the per-module builder is the sole seeder. Fixes an encoder-rank crash: _set_random_seed with a None pp_group falls back to mpu.get_pipeline_model_parallel_rank(), which asserts under skip_model_parallel_init. Removes the language-vs-first seed resolution entirely. Byte-identical for stock (skip_random_seed=False, groups from mpu as before). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…ollection The distributed optimizer requires pg_collection.intra_dist_opt; production threads it via pg_collection_from_grid into the submodule spec. The test helper get_pg_collection omitted it, so use_distributed_optimizer tests raised. Add mp/tp_ep_pp/intra_dist_opt (create the base intra_dist_opt group collectively) and drop the dead groups from the removed _get_pg_collection_for_optimizer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

test_builder_seeds asserts the wrap call including the data_parallel_random_init arg the builder now threads; test_active_module_is_ddp uses bf16 (production precision) so the encoder wrap resolves config via Float16Module (a bare fp32 modality container exposes no config to get_model_config). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

The Nemotron6 hybrid model has Mamba layers; the launcher must use uv run --extra ssm so mamba-ssm is available (matching the proven cog e2e config). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

Match the proven cog e2e config (--tee 3) so every rank's output is captured, not just rank 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…joint grids Under skip_model_parallel_init the mpu is uninitialized, so ProcessGroupCollection.use_mpu_process_groups() asserts; only call it when model-parallel is initialized (None otherwise). Gate the builder path on having a model config rather than a non-None pg_collection, so a disjoint-grid run stays on the builder (its per-module builder ignores the passed collection). Byte-identical for stock (mpu initialized, pg_collection always non-None). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…action Restore the pre-existing import blank lines and the two build_virtual_pipeline_stages call sites to match main; the extraction diff now shows only the new function and the delegation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

…ly diff) Regenerate dist_utils.py = main + only the extraction (new prepare_existing_model_chunks_for_distributed_training + the unimodal delegation), so the unchanged helper functions keep main's exact formatting instead of the whole-file black reflow from the initial commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

copy-pr-bot · 2026-07-01T16:04:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-07-01T16:04:43Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

yashaswikarnati · 2026-07-01T16:39:23Z

+
+    builder: ClassVar[str] = "examples.mimo.training.builder.MimoModelBuilder"
+    _topology: Optional[HeteroTopology] = field(default=None)
+    _args: Optional[argparse.Namespace] = field(default=None)


why args is part of build config

yashaswikarnati · 2026-07-01T16:40:11Z

+    _args: Optional[argparse.Namespace] = field(default=None)
+
+
+def _encoder_module_name(topology: HeteroTopology) -> Optional[str]:


how would this generalize if multiple encoders ?

yashaswikarnati · 2026-07-01T16:40:49Z

+    """
+    if wrap_with_ddp and not ddp_config:
+        raise ValueError("ddp_config is required when wrap_with_ddp is True")
+    if transformer_config.init_model_with_meta_device != built_with_meta_device:


why we need both init_model_with_meta_device and built_with_meta_device ?

yashaswikarnati · 2026-07-01T16:41:18Z


    # Materialize tensors on meta device (GPU allocation) if not using FSDP2 and not using Megatron FSDP.
-    if init_model_with_meta_device and not use_torch_fsdp2 and not use_megatron_fsdp:
+    if built_with_meta_device and not use_torch_fsdp2 and not use_megatron_fsdp:


why rename init_model_with_meta_device to built_with_meta_device

yashaswikarnati · 2026-07-01T16:54:35Z

+    """Register model-provider, heterogeneous-grid, and mock-data arguments."""
+    parser = add_model_provider_args(parser)
+    parser = add_hetero_grid_args(parser)
+    data = parser.add_argument_group("mimo mock data")


why not maintain mock data args seperately in mock data file

yashaswikarnati · 2026-07-01T16:56:24Z

+        language_config = language_model_spec(args, None, language_grid).params["config"]
+        communicator = MultiModulePipelineCommunicator(
+            topology.grids,
+            {RADIO_ENCODER_MODULE_NAME: [MIMO_LANGUAGE_MODULE_KEY], MIMO_LANGUAGE_MODULE_KEY: []},


this is very specific to current radio encoder and language model,. we want this to be extended to different models,. this is the common entry point for pretrain mimo

yashaswikarnati · 2026-07-01T17:02:35Z

+
+        wrap_active_modules_with_ddp(args, mimo_model, topology, data_parallel_random_init)
+        configure_grad_sync(args, mimo_model, topology)
+        # Load-bearing contract read by training.py for checkpoint save/load, per-module


remove # Load-bearing contract read by training.py for checkpoint save/load, per-module
# reductions, and optimizer construction (see Increments 2 and 4).

… grids Running evaluate() cross-grid (skip_model_parallel_init) surfaced two spots that train_step already guards but evaluate did not: - the modelopt distillation shape-adjust reads global parallel_state; gate it on modelopt_enabled, matching train_step. - the loss reduction indexes loss_dicts[0]; skip when empty (encoder ranks produce no loss), matching train_step's 'and losses_reduced' guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati · 2026-07-02T01:47:11Z

/ok to test 3af3084

svcnvidia-nemo-ci · 2026-07-02T03:44:23Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28563811161

svcnvidia-nemo-ci · 2026-07-02T06:21:30Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28569887360

svcnvidia-nemo-ci · 2026-07-02T13:50:19Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28595216236

yashaswikarnati and others added 16 commits June 30, 2026 20:35

mimo: tee all ranks in the e2e launcher for per-rank diagnosability

7d8d15e

Match the proven cog e2e config (--tee 3) so every rank's output is captured, not just rank 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>

yashaswikarnati requested review from a team as code owners July 1, 2026 16:04

svcnvidia-nemo-ci marked this pull request as draft July 1, 2026 16:04

yashaswikarnati commented Jul 1, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public July 2, 2026 00:35 Inactive

copy-pr-bot Bot temporarily deployed to test July 2, 2026 00:35 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 00:38 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 00:47 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 01:47 Inactive

copy-pr-bot Bot temporarily deployed to test July 2, 2026 01:48 Inactive

copy-pr-bot Bot temporarily deployed to public July 2, 2026 01:51 Inactive

yashaswikarnati enabled auto-merge July 2, 2026 01:58

copy-pr-bot Bot temporarily deployed to public July 2, 2026 02:00 Inactive

yashaswikarnati added this pull request to the merge queue Jul 2, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 2, 2026

yashaswikarnati added this pull request to the merge queue Jul 2, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 2, 2026

yashaswikarnati added this pull request to the merge queue Jul 2, 2026

Merged via the queue into NVIDIA:main with commit 0522099 Jul 2, 2026
87 checks passed

yashaswikarnati deleted the ykarnati/mimo-noncolocated-e2e branch July 2, 2026 14:39

		_args: Optional[argparse.Namespace] = field(default=None)


		def _encoder_module_name(topology: HeteroTopology) -> Optional[str]:

Uh oh!

Conversation

yashaswikarnati commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

yashaswikarnati Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati commented Jul 2, 2026

Uh oh!

svcnvidia-nemo-ci commented Jul 2, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Jul 2, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yashaswikarnati commented Jul 1, 2026 •

edited

Loading