Skip to content

E2E heterogenous non colocated MiMo training#5602

Merged
yashaswikarnati merged 35 commits into
NVIDIA:mainfrom
yashaswikarnati:ykarnati/mimo-noncolocated-e2e
Jul 2, 2026
Merged

E2E heterogenous non colocated MiMo training#5602
yashaswikarnati merged 35 commits into
NVIDIA:mainfrom
yashaswikarnati:ykarnati/mimo-noncolocated-e2e

Conversation

@yashaswikarnati

@yashaswikarnati yashaswikarnati commented Jul 1, 2026

Copy link
Copy Markdown
Contributor
  • I, the PR author, have personally reviewed every line of this PR.

What does this PR do?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

yashaswikarnati and others added 16 commits June 30, 2026 20:35
Factor the post-build distributed lifecycle (TP attrs, GPU placement,
mixed-precision wrap, meta materialize, DDP/FSDP wrap) out of
unimodal_build_distributed_models into a reusable function callable on
already-built chunks, for per-submodule reuse by the MIMO builder.
Byte-identical for stock callers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…sjoint grids

Four default-off, stock-byte-identical changes so a non-language rank under
skip_model_parallel_init never touches an uninitialized global mpu:
- train(): only default finalize_model_grads_func when unset (keep builder hook)
- setup_model_and_optimizer(): dp world-size falls back to args when mpu uninit
- setup_model_and_optimizer(): thread the model's pg_collection groups + rng
  prefix into load_checkpoint (mirrors the existing save path)
- save_checkpoint_and_time(): pass dp_group to report_memory

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…t pick

Rename the schedule_pg_collection carrier param to pg_collection on
pretrain/train/train_step and collapse train_step's dual params to one carrier
(reductions now derive per-rank groups from the model). Remove the
language-else-first heuristic: LM-global concerns use the language collection or
None; per-module seeding is the builder's job. schedules.py untouched; stock is
byte-identical (carrier stays None).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
… bug

Delete _get_pg_collection_for_optimizer, which queried expert groups on the
default grid view and raised on dense (encoder) grids; read each active module's
pg_collection off the built model instead. Add the base-view intra_dist_opt
group in topology, and make MimoOptimizer.count_zeros world-consistent
(per-module all_reduce MAX then sum) so num_zeros agrees across disjoint grids.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
configure_module_rng now delegates to the stock _set_random_seed with explicit
per-module groups (adds data_parallel_random_init); wrap_active_modules_with_ddp
delegates to prepare_existing_model_chunks_for_distributed_training per active
submodule (language Float16Module; encoders _EncoderFloat16Module with overlap
off), dropping the hand-rolled DDP/Float16/bucket/stream code. Function names
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
New examples/mimo/training/builder.py: MimoBuildConfig(ModelConfig) carries the
live topology/args as non-serialized underscore fields; MimoModelBuilder builds
the rank-local MimoModel, seeds the active role before build, wraps it via the
shared distributed lifecycle, installs grad-sync, and sets model.pg_collection +
rng_state_key_prefix (read by the training loop for checkpoint save/load,
per-module reductions, and optimizer construction). Thread data_parallel_random_init
into the DDP wrap so DP replicas broadcast-sync when the flag is set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
New examples/mimo/training/data.py: per-role external DataLoaders for
non-colocated MIMO (language at PP first/last stage, encoder at PP first stage,
others None); encoder micro-batch = mbs*llm_dp//encoder_dp; per-role/DP/split
seeding; provider tagged is_distributed. Text-token pool excludes image_token_id
and pad_token_id (collision fix); labels remap image + pad to -100. Dense and
dynamic-resolution (packed thd) encoder inputs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
New examples/mimo/pretrain_mimo.py wires the topology, cross-grid P2P
communicator, role-aware mock data, and MimoBuildConfig into the stock
pretrain() loop (model_provider=None, skip_model_parallel_init=True,
pg_collection=topology.schedule_pg_collection); validates the disjoint grids by
standing the language grid in for stock validate_args. Adds the 8-GPU 20L
nemotron VLM mock launcher matching the proven cog config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…s seeding)

initialize_megatron gains a skip_random_seed flag; pretrain sets it for a
MultiModuleProcessGroupCollection so the per-module builder is the sole seeder.
Fixes an encoder-rank crash: _set_random_seed with a None pp_group falls back to
mpu.get_pipeline_model_parallel_rank(), which asserts under
skip_model_parallel_init. Removes the language-vs-first seed resolution entirely.
Byte-identical for stock (skip_random_seed=False, groups from mpu as before).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…ollection

The distributed optimizer requires pg_collection.intra_dist_opt; production
threads it via pg_collection_from_grid into the submodule spec. The test helper
get_pg_collection omitted it, so use_distributed_optimizer tests raised. Add
mp/tp_ep_pp/intra_dist_opt (create the base intra_dist_opt group collectively)
and drop the dead groups from the removed _get_pg_collection_for_optimizer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
test_builder_seeds asserts the wrap call including the data_parallel_random_init
arg the builder now threads; test_active_module_is_ddp uses bf16 (production
precision) so the encoder wrap resolves config via Float16Module (a bare fp32
modality container exposes no config to get_model_config).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
The Nemotron6 hybrid model has Mamba layers; the launcher must use uv run --extra
ssm so mamba-ssm is available (matching the proven cog e2e config).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Match the proven cog e2e config (--tee 3) so every rank's output is captured,
not just rank 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…joint grids

Under skip_model_parallel_init the mpu is uninitialized, so
ProcessGroupCollection.use_mpu_process_groups() asserts; only call it when
model-parallel is initialized (None otherwise). Gate the builder path on having a
model config rather than a non-None pg_collection, so a disjoint-grid run stays
on the builder (its per-module builder ignores the passed collection).
Byte-identical for stock (mpu initialized, pg_collection always non-None).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…action

Restore the pre-existing import blank lines and the two build_virtual_pipeline_stages
call sites to match main; the extraction diff now shows only the new function and
the delegation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…ly diff)

Regenerate dist_utils.py = main + only the extraction (new
prepare_existing_model_chunks_for_distributed_training + the unimodal delegation),
so the unchanged helper functions keep main's exact formatting instead of the
whole-file black reflow from the initial commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
@yashaswikarnati yashaswikarnati requested review from a team as code owners July 1, 2026 16:04
@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft July 1, 2026 16:04
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

Comment thread examples/mimo/training/builder.py Outdated

builder: ClassVar[str] = "examples.mimo.training.builder.MimoModelBuilder"
_topology: Optional[HeteroTopology] = field(default=None)
_args: Optional[argparse.Namespace] = field(default=None)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why args is part of build config

Comment thread examples/mimo/training/builder.py Outdated
_args: Optional[argparse.Namespace] = field(default=None)


def _encoder_module_name(topology: HeteroTopology) -> Optional[str]:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this generalize if multiple encoders ?

Comment thread megatron/training/models/dist_utils.py Outdated
"""
if wrap_with_ddp and not ddp_config:
raise ValueError("ddp_config is required when wrap_with_ddp is True")
if transformer_config.init_model_with_meta_device != built_with_meta_device:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need both init_model_with_meta_device and built_with_meta_device ?

Comment thread megatron/training/models/dist_utils.py Outdated

# Materialize tensors on meta device (GPU allocation) if not using FSDP2 and not using Megatron FSDP.
if init_model_with_meta_device and not use_torch_fsdp2 and not use_megatron_fsdp:
if built_with_meta_device and not use_torch_fsdp2 and not use_megatron_fsdp:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rename init_model_with_meta_device to built_with_meta_device

Comment thread examples/mimo/pretrain_mimo.py Outdated
"""Register model-provider, heterogeneous-grid, and mock-data arguments."""
parser = add_model_provider_args(parser)
parser = add_hetero_grid_args(parser)
data = parser.add_argument_group("mimo mock data")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not maintain mock data args seperately in mock data file

Comment thread examples/mimo/pretrain_mimo.py Outdated
language_config = language_model_spec(args, None, language_grid).params["config"]
communicator = MultiModulePipelineCommunicator(
topology.grids,
{RADIO_ENCODER_MODULE_NAME: [MIMO_LANGUAGE_MODULE_KEY], MIMO_LANGUAGE_MODULE_KEY: []},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very specific to current radio encoder and language model,. we want this to be extended to different models,. this is the common entry point for pretrain mimo

Comment thread examples/mimo/training/builder.py Outdated

wrap_active_modules_with_ddp(args, mimo_model, topology, data_parallel_random_init)
configure_grad_sync(args, mimo_model, topology)
# Load-bearing contract read by training.py for checkpoint save/load, per-module

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove # Load-bearing contract read by training.py for checkpoint save/load, per-module
# reductions, and optimizer construction (see Increments 2 and 4).

… grids

Running evaluate() cross-grid (skip_model_parallel_init) surfaced two spots
that train_step already guards but evaluate did not:
- the modelopt distillation shape-adjust reads global parallel_state; gate it
  on modelopt_enabled, matching train_step.
- the loss reduction indexes loss_dicts[0]; skip when empty (encoder ranks
  produce no loss), matching train_step's 'and losses_reduced' guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
@yashaswikarnati

Copy link
Copy Markdown
Contributor Author

/ok to test 3af3084

@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28563811161

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 2, 2026
@yashaswikarnati yashaswikarnati added this pull request to the merge queue Jul 2, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28569887360

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 2, 2026
@yashaswikarnati yashaswikarnati added this pull request to the merge queue Jul 2, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28595216236

Merged via the queue into NVIDIA:main with commit 0522099 Jul 2, 2026
87 checks passed
@yashaswikarnati yashaswikarnati deleted the ykarnati/mimo-noncolocated-e2e branch July 2, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: high

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants