E2E heterogenous non colocated MiMo training#5602
Conversation
Factor the post-build distributed lifecycle (TP attrs, GPU placement, mixed-precision wrap, meta materialize, DDP/FSDP wrap) out of unimodal_build_distributed_models into a reusable function callable on already-built chunks, for per-submodule reuse by the MIMO builder. Byte-identical for stock callers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…sjoint grids Four default-off, stock-byte-identical changes so a non-language rank under skip_model_parallel_init never touches an uninitialized global mpu: - train(): only default finalize_model_grads_func when unset (keep builder hook) - setup_model_and_optimizer(): dp world-size falls back to args when mpu uninit - setup_model_and_optimizer(): thread the model's pg_collection groups + rng prefix into load_checkpoint (mirrors the existing save path) - save_checkpoint_and_time(): pass dp_group to report_memory Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…t pick Rename the schedule_pg_collection carrier param to pg_collection on pretrain/train/train_step and collapse train_step's dual params to one carrier (reductions now derive per-rank groups from the model). Remove the language-else-first heuristic: LM-global concerns use the language collection or None; per-module seeding is the builder's job. schedules.py untouched; stock is byte-identical (carrier stays None). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
… bug Delete _get_pg_collection_for_optimizer, which queried expert groups on the default grid view and raised on dense (encoder) grids; read each active module's pg_collection off the built model instead. Add the base-view intra_dist_opt group in topology, and make MimoOptimizer.count_zeros world-consistent (per-module all_reduce MAX then sum) so num_zeros agrees across disjoint grids. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
configure_module_rng now delegates to the stock _set_random_seed with explicit per-module groups (adds data_parallel_random_init); wrap_active_modules_with_ddp delegates to prepare_existing_model_chunks_for_distributed_training per active submodule (language Float16Module; encoders _EncoderFloat16Module with overlap off), dropping the hand-rolled DDP/Float16/bucket/stream code. Function names unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
New examples/mimo/training/builder.py: MimoBuildConfig(ModelConfig) carries the live topology/args as non-serialized underscore fields; MimoModelBuilder builds the rank-local MimoModel, seeds the active role before build, wraps it via the shared distributed lifecycle, installs grad-sync, and sets model.pg_collection + rng_state_key_prefix (read by the training loop for checkpoint save/load, per-module reductions, and optimizer construction). Thread data_parallel_random_init into the DDP wrap so DP replicas broadcast-sync when the flag is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
New examples/mimo/training/data.py: per-role external DataLoaders for non-colocated MIMO (language at PP first/last stage, encoder at PP first stage, others None); encoder micro-batch = mbs*llm_dp//encoder_dp; per-role/DP/split seeding; provider tagged is_distributed. Text-token pool excludes image_token_id and pad_token_id (collision fix); labels remap image + pad to -100. Dense and dynamic-resolution (packed thd) encoder inputs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
New examples/mimo/pretrain_mimo.py wires the topology, cross-grid P2P communicator, role-aware mock data, and MimoBuildConfig into the stock pretrain() loop (model_provider=None, skip_model_parallel_init=True, pg_collection=topology.schedule_pg_collection); validates the disjoint grids by standing the language grid in for stock validate_args. Adds the 8-GPU 20L nemotron VLM mock launcher matching the proven cog config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…s seeding) initialize_megatron gains a skip_random_seed flag; pretrain sets it for a MultiModuleProcessGroupCollection so the per-module builder is the sole seeder. Fixes an encoder-rank crash: _set_random_seed with a None pp_group falls back to mpu.get_pipeline_model_parallel_rank(), which asserts under skip_model_parallel_init. Removes the language-vs-first seed resolution entirely. Byte-identical for stock (skip_random_seed=False, groups from mpu as before). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…ollection The distributed optimizer requires pg_collection.intra_dist_opt; production threads it via pg_collection_from_grid into the submodule spec. The test helper get_pg_collection omitted it, so use_distributed_optimizer tests raised. Add mp/tp_ep_pp/intra_dist_opt (create the base intra_dist_opt group collectively) and drop the dead groups from the removed _get_pg_collection_for_optimizer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
test_builder_seeds asserts the wrap call including the data_parallel_random_init arg the builder now threads; test_active_module_is_ddp uses bf16 (production precision) so the encoder wrap resolves config via Float16Module (a bare fp32 modality container exposes no config to get_model_config). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
The Nemotron6 hybrid model has Mamba layers; the launcher must use uv run --extra ssm so mamba-ssm is available (matching the proven cog e2e config). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
Match the proven cog e2e config (--tee 3) so every rank's output is captured, not just rank 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…joint grids Under skip_model_parallel_init the mpu is uninitialized, so ProcessGroupCollection.use_mpu_process_groups() asserts; only call it when model-parallel is initialized (None otherwise). Gate the builder path on having a model config rather than a non-None pg_collection, so a disjoint-grid run stays on the builder (its per-module builder ignores the passed collection). Byte-identical for stock (mpu initialized, pg_collection always non-None). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…action Restore the pre-existing import blank lines and the two build_virtual_pipeline_stages call sites to match main; the extraction diff now shows only the new function and the delegation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
…ly diff) Regenerate dist_utils.py = main + only the extraction (new prepare_existing_model_chunks_for_distributed_training + the unimodal delegation), so the unchanged helper functions keep main's exact formatting instead of the whole-file black reflow from the initial commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
|
||
| builder: ClassVar[str] = "examples.mimo.training.builder.MimoModelBuilder" | ||
| _topology: Optional[HeteroTopology] = field(default=None) | ||
| _args: Optional[argparse.Namespace] = field(default=None) |
There was a problem hiding this comment.
why args is part of build config
| _args: Optional[argparse.Namespace] = field(default=None) | ||
|
|
||
|
|
||
| def _encoder_module_name(topology: HeteroTopology) -> Optional[str]: |
There was a problem hiding this comment.
how would this generalize if multiple encoders ?
| """ | ||
| if wrap_with_ddp and not ddp_config: | ||
| raise ValueError("ddp_config is required when wrap_with_ddp is True") | ||
| if transformer_config.init_model_with_meta_device != built_with_meta_device: |
There was a problem hiding this comment.
why we need both init_model_with_meta_device and built_with_meta_device ?
|
|
||
| # Materialize tensors on meta device (GPU allocation) if not using FSDP2 and not using Megatron FSDP. | ||
| if init_model_with_meta_device and not use_torch_fsdp2 and not use_megatron_fsdp: | ||
| if built_with_meta_device and not use_torch_fsdp2 and not use_megatron_fsdp: |
There was a problem hiding this comment.
why rename init_model_with_meta_device to built_with_meta_device
| """Register model-provider, heterogeneous-grid, and mock-data arguments.""" | ||
| parser = add_model_provider_args(parser) | ||
| parser = add_hetero_grid_args(parser) | ||
| data = parser.add_argument_group("mimo mock data") |
There was a problem hiding this comment.
why not maintain mock data args seperately in mock data file
| language_config = language_model_spec(args, None, language_grid).params["config"] | ||
| communicator = MultiModulePipelineCommunicator( | ||
| topology.grids, | ||
| {RADIO_ENCODER_MODULE_NAME: [MIMO_LANGUAGE_MODULE_KEY], MIMO_LANGUAGE_MODULE_KEY: []}, |
There was a problem hiding this comment.
this is very specific to current radio encoder and language model,. we want this to be extended to different models,. this is the common entry point for pretrain mimo
|
|
||
| wrap_active_modules_with_ddp(args, mimo_model, topology, data_parallel_random_init) | ||
| configure_grad_sync(args, mimo_model, topology) | ||
| # Load-bearing contract read by training.py for checkpoint save/load, per-module |
There was a problem hiding this comment.
remove # Load-bearing contract read by training.py for checkpoint save/load, per-module
# reductions, and optimizer construction (see Increments 2 and 4).
… grids Running evaluate() cross-grid (skip_model_parallel_init) surfaced two spots that train_step already guards but evaluate did not: - the modelopt distillation shape-adjust reads global parallel_state; gate it on modelopt_enabled, matching train_step. - the loss reduction indexes loss_dicts[0]; skip when empty (encoder ranks produce no loss), matching train_step's 'and losses_reduced' guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ykarnati <ykarnati@nvidia.com>
|
/ok to test 3af3084 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28563811161 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28569887360 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28595216236 |
What does this PR do?
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.