Skip to content

Draft(M-FSDP): Demonstrate the fix for Megatron-Bridge#3161

Draft
shjwudp wants to merge 13 commits intoNVIDIA:mainfrom
shjwudp:mfsdp_fix_mcore_tp
Draft

Draft(M-FSDP): Demonstrate the fix for Megatron-Bridge#3161
shjwudp wants to merge 13 commits intoNVIDIA:mainfrom
shjwudp:mfsdp_fix_mcore_tp

Conversation

@shjwudp
Copy link
Contributor

@shjwudp shjwudp commented Jan 30, 2026

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shjwudp shjwudp changed the title [M-FSDP] fix mcore tp type detect Draft(M-FSDP): Demonstrate the fix for Megatron-Bridge Feb 2, 2026
@shjwudp
Copy link
Contributor Author

shjwudp commented Mar 13, 2026

/claude review

Comment on lines +354 to +359
debug_slices = []
for chunk_info, local_chunk in zip(local_chunks_info, all_local_chunks):
offset = chunk_info["offset"]
slices = tuple(slice(o, o + s) for o, s in zip(offset, local_chunk.shape))
debug_slices.append(slices)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debug code: debug_slices is computed but never used. Should be removed.

Suggested change
debug_slices = []
for chunk_info, local_chunk in zip(local_chunks_info, all_local_chunks):
offset = chunk_info["offset"]
slices = tuple(slice(o, o + s) for o, s in zip(offset, local_chunk.shape))
debug_slices.append(slices)
# Reconstruct the full tensor by placing chunks at their correct offsets

>>> mesh = DeviceMesh("cuda", [0, 1, 2, 3])
>>> # Create a DTensor with uneven sharding
>>> dtensor = DTensor(..., placements=[Shard(0)])
>>> full_tensor = gather_uneven_dtensor_to_full_tensor(dtensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring example still references the old function name gather_uneven_dtensor_to_full_tensor — should be uneven_dtensor_to_full_tensor.

Suggested change
>>> full_tensor = gather_uneven_dtensor_to_full_tensor(dtensor)
>>> full_tensor = uneven_dtensor_to_full_tensor(dtensor)

Comment on lines 820 to +824
def is_mcore_tensor_parallel_duplicated(param: torch.Tensor) -> bool:
"""
Check if the given parameter is Megatron-Core tensor model parallel and duplicated.
"""
return getattr(param, "_tp_duplicated", False)
return get_mcore_tensor_parallel_partition_dim(param) is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: is_mcore_tensor_parallel_duplicated now returns True for any parameter without _tensor_parallel_mode set (including completely unannotated parameters). Combined with the change in param_and_grad_buffer.py where the guard switched from is_mcore_tensor_model_parallel(param) (per-param check) to using_tensor_parallel(dist_index) (global mesh check), any unannotated parameter will now be wrapped as Replicate() DTensor when TP is enabled.

If _detect_parallelism_type returns None for a module (e.g. a custom user module), its params won't get _tensor_parallel_mode set, but they'll still enter the TP DTensor path and be treated as duplicated. This may be the intended behavior, but the docstring is misleading — this function no longer checks if the param "is tensor model parallel and duplicated"; it returns True for anything that isn't column/row partitioned.

At minimum, consider guarding on hasattr(param, "_tensor_parallel_mode"):

def is_mcore_tensor_parallel_duplicated(param: torch.Tensor) -> bool:
    """
    Check if the given parameter is Megatron-Core tensor model parallel and duplicated.
    """
    return hasattr(param, "_tensor_parallel_mode") and get_mcore_tensor_parallel_partition_dim(param) is None

Comment on lines +47 to +80
else:
device_type = "cpu"
device = torch.device("cpu")
backend = "gloo"

# Initialize process group if not already initialized
if not dist.is_initialized():
dist.init_process_group(
backend=backend,
rank=rank,
world_size=world_size,
)

yield {
"rank": rank,
"world_size": world_size,
"local_rank": local_rank,
"device_type": device_type,
"device": device,
}

# Cleanup
if dist.is_initialized():
dist.destroy_process_group()


# ---------- Helper: distributed setup ----------

@pytest.fixture(scope="module")
def distributed_setup():
"""Setup torch.distributed and CUDA device for torchrun + pytest."""
if "RANK" not in os.environ or "WORLD_SIZE" not in os.environ:
pytest.skip("Not running under torchrun. Use torchrun to run this test file.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distributed_setup fixture is defined twice (first at line 31, again here). The second definition silently shadows the first. Remove the duplicate.

self.layer_norm_bias = nn.Parameter(torch.empty(8))

module = TELayerNormColumnParallelLinear()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in test name: telinertelinear

Suggested change
def test_detect_parallelism_telinear_parallel_mode_variants():

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants