Non-powers of 2 support for Primus preflight v2#707
Merged
yeandy merged 2 commits intodev/preflight-direct-testfrom Apr 29, 2026
Merged
Non-powers of 2 support for Primus preflight v2#707yeandy merged 2 commits intodev/preflight-direct-testfrom
yeandy merged 2 commits intodev/preflight-direct-testfrom
Conversation
amd-ama10002-2
approved these changes
Apr 29, 2026
Collaborator
There was a problem hiding this comment.
LGTM 👍I'm approving.
Do we need to manually test it again? If the command to execute the preflight test is the same as it was in PR #699 then I don't think we need to manually test again.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Preflight inter-node benchmarks assumed power-of-2 node counts, causing two bugs:
AssertionError: peer_rank >= WORLD_SIZEbecause the unpaired trailing node tried to compute a peer beyond world sizeThis PR fixes both so preflight runs correctly on any node count.
What changed
inter_node_comm.pynum_nodes % adjacent_nodes >= 2. Usedist.get_world_size()for actual group size. Report via explicitgroup_leaderslist instead of modular arithmetic.inter_node_comm_p2p.pypeer_rankwithRANK < num_paired_ranks. Unpaired ranks getpeer_rank = -1and skip gracefully. Same guard in reporting loop.How it works
Example -- 6 nodes,
allreduce-4nodes:Test plan
Ran preflight on 1-9 nodes with
NCCL_DEBUG=INFOenabled. All passed cleanly.allreduce-4nodesreports 2 group leaders: rank 0 (4-node main) + rank 32 (2-node remainder)allreduce-4nodesreports 2 group leaders: rank 0 (4-node main) + rank 32 (3-node remainder)NCCL-level verification (6N example)
Confirmed via
ncclCommSplit_impllogs that two separate sub-communicators are created:Different colors = separate sub-communicators. Remainder nodes do NOT appear in the main group's split.
Caveat
When
remainder = 1(e.g., 5N with 4-node adjacency), the single leftover node is excluded from that sub-group test -- inter-node communication requires >= 2 nodes. It's still covered by the full N-node allreduce/alltoall and ring P2P.Test results
All jobs ran on partition
amd-twwithNCCL_DEBUG=INFO. Node g57 excluded due to faulty RDMA NIC.Remainder group output (6N,
slurm-482.outline 54476)Two group leaders reported: rank 0 (main 4-node group, nodes 0-3) and rank 32 (remainder 2-node group, nodes 4-5).
Remainder group output (7N,
slurm-483.outline 68237)Two group leaders: rank 0 (main 4-node group, nodes 0-3) and rank 32 (remainder 3-node group, nodes 4-6).
P2P on odd counts (no crash)
Ring P2P includes all nodes (odd counts)
All 5 nodes participate in the ring (including node 4 which was excluded from P2P pairing).
NCCL warnings
All NCCL WARN messages across all jobs are benign IB port-state notifications:
No
NCCL ERROR,ncclSystemError, orSIGTERMin any completed job.Out of scope
This PR addresses preflight benchmark correctness only. Early-stopping / RCCL hang diagnosis is in progress in #689.