Skip to content

fix: handle list-typed process groups in ProcessGroupCollection.__repr__#3753

Merged
ericharper merged 4 commits intoNVIDIA:mainfrom
cluster2600:fix/process-group-repr-list-groups
Apr 13, 2026
Merged

fix: handle list-typed process groups in ProcessGroupCollection.__repr__#3753
ericharper merged 4 commits intoNVIDIA:mainfrom
cluster2600:fix/process-group-repr-list-groups

Conversation

@cluster2600
Copy link
Copy Markdown
Contributor

Fixes #3723.

Problem

ProcessGroupCollection.__repr__ crashes with AttributeError: 'list' object has no attribute 'size' when hierarchical_context_parallel_sizes is configured. The hcp field stores a List[ProcessGroup] rather than a single ProcessGroup, but __repr__ unconditionally called pg.size() on every field value.

This surfaces during checkpoint saving when modelopt calls str() on the config object, and in any logging or debugging context that triggers repr().

Solution

Added an isinstance(pg, list) check in __repr__ to handle list-typed fields. When a field contains a list of process groups, it collects the individual sizes into a list (e.g. hcp([2, 4])) instead of calling .size() directly.

Changes

  • megatron/core/process_groups_config.py — added list handling in __repr__
  • tests/unit_tests/test_process_groups_config.py — added test_repr_with_list_process_groups covering the hcp list case

Test plan

  • New unit test verifies repr() output with list-typed hcp field
  • Existing test_repr continues to pass for single process groups

@cluster2600 cluster2600 requested review from a team as code owners March 9, 2026 10:29
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft March 9, 2026 10:30
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@cluster2600 cluster2600 marked this pull request as ready for review March 10, 2026 21:27
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 10, 2026 21:28
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Mar 10, 2026
@cluster2600 cluster2600 force-pushed the fix/process-group-repr-list-groups branch from 31781da to cc2c6e1 Compare March 11, 2026 10:24
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Mar 11, 2026
@cluster2600 cluster2600 force-pushed the fix/process-group-repr-list-groups branch from cc2c6e1 to 7988e2d Compare March 11, 2026 18:58
Fixes NVIDIA#3723.

When hierarchical_context_parallel_sizes is configured, the hcp field
stores a list of ProcessGroup objects rather than a single one. The
__repr__ method assumed every field has a .size() method, causing an
AttributeError when encountering a list.

Add an isinstance check to format list-typed groups by collecting
their individual sizes.

Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
@cluster2600 cluster2600 force-pushed the fix/process-group-repr-list-groups branch from 7988e2d to 23bb1ce Compare March 12, 2026 06:21
@dimapihtar dimapihtar added Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. and removed Final Review PR is in the "final review" stage labels Mar 20, 2026
@dimapihtar
Copy link
Copy Markdown
Contributor

/ok to test a313b8e

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Mar 20, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 20, 2026
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 20, 2026
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 1, 2026
@gautham-kollu
Copy link
Copy Markdown
Contributor

/ok to test ab56132

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 8, 2026
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 10, 2026
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 13, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Apr 13, 2026
@ericharper ericharper added this pull request to the merge queue Apr 13, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24343641742

Merged via the queue into NVIDIA:main with commit dda6901 Apr 13, 2026
63 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made community-request Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ProcessGroupCollection.__repr__ crashes with list-typed hierarchical CP groups

8 participants