fix: handle list-typed process groups in ProcessGroupCollection.__repr__#3753
Merged
ericharper merged 4 commits intoNVIDIA:mainfrom Apr 13, 2026
Merged
Conversation
Contributor
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
31781da to
cc2c6e1
Compare
cc2c6e1 to
7988e2d
Compare
Fixes NVIDIA#3723. When hierarchical_context_parallel_sizes is configured, the hcp field stores a list of ProcessGroup objects rather than a single one. The __repr__ method assumed every field has a .size() method, causing an AttributeError when encountering a list. Add an isinstance check to format list-typed groups by collecting their individual sizes. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
7988e2d to
23bb1ce
Compare
Contributor
|
/ok to test a313b8e |
jaredcasper
approved these changes
Mar 30, 2026
gautham-kollu
approved these changes
Apr 8, 2026
Contributor
|
/ok to test ab56132 |
ericharper
approved these changes
Apr 12, 2026
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24343641742 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3723.
Problem
ProcessGroupCollection.__repr__crashes withAttributeError: 'list' object has no attribute 'size'whenhierarchical_context_parallel_sizesis configured. Thehcpfield stores aList[ProcessGroup]rather than a singleProcessGroup, but__repr__unconditionally calledpg.size()on every field value.This surfaces during checkpoint saving when
modeloptcallsstr()on the config object, and in any logging or debugging context that triggersrepr().Solution
Added an
isinstance(pg, list)check in__repr__to handle list-typed fields. When a field contains a list of process groups, it collects the individual sizes into a list (e.g.hcp([2, 4])) instead of calling.size()directly.Changes
megatron/core/process_groups_config.py— added list handling in__repr__tests/unit_tests/test_process_groups_config.py— addedtest_repr_with_list_process_groupscovering thehcplist caseTest plan
repr()output with list-typedhcpfieldtest_reprcontinues to pass for single process groups