Skip to content

Non-powers of 2 support for Primus preflight v2#707

Merged
yeandy merged 2 commits intodev/preflight-direct-testfrom
yeandy/preflight-non-powers-of-2-v2
Apr 29, 2026
Merged

Non-powers of 2 support for Primus preflight v2#707
yeandy merged 2 commits intodev/preflight-direct-testfrom
yeandy/preflight-non-powers-of-2-v2

Conversation

@yeandy
Copy link
Copy Markdown

@yeandy yeandy commented Apr 29, 2026

Summary

Preflight inter-node benchmarks assumed power-of-2 node counts, causing two bugs:

  1. Silent exclusion -- remainder nodes never benchmarked for 2-node/4-node adjacency tests (e.g., on 6 nodes with 4-node adjacency, nodes 4-5 were silently dropped -- 33% of nodes never tested)
  2. P2P crash -- odd node counts triggered AssertionError: peer_rank >= WORLD_SIZE because the unpaired trailing node tried to compute a peer beyond world size

This PR fixes both so preflight runs correctly on any node count.

What changed

File Change
inter_node_comm.py Create remainder group when num_nodes % adjacent_nodes >= 2. Use dist.get_world_size() for actual group size. Report via explicit group_leaders list instead of modular arithmetic.
inter_node_comm_p2p.py Guard peer_rank with RANK < num_paired_ranks. Unpaired ranks get peer_rank = -1 and skip gracefully. Same guard in reporting loop.

How it works

remainder = num_nodes % adjacent_nodes

remainder = 0  -> all nodes in equal groups (unchanged)
remainder = 1  -> excluded (correct -- can't do inter-node comm with 1 node)
remainder >= 2 -> OLD: silently excluded. NEW: remainder group created & benchmarked

Example -- 6 nodes, allreduce-4nodes:

  • Old: only nodes 0-3 benchmarked (1 group of 32 ranks). Nodes 4-5 silently excluded.
  • New: nodes 0-3 form the main group (32 ranks), nodes 4-5 form a remainder group (16 ranks). Both reported.

Test plan

Ran preflight on 1-9 nodes with NCCL_DEBUG=INFO enabled. All passed cleanly.

Nodes Verification
2, 4, 8 POW2 regression -- identical to main
3, 5, 9 Odd counts -- P2P no crash, unpaired node skipped gracefully
6 allreduce-4nodes reports 2 group leaders: rank 0 (4-node main) + rank 32 (2-node remainder)
7 allreduce-4nodes reports 2 group leaders: rank 0 (4-node main) + rank 32 (3-node remainder)
1 All inter-node tests skip cleanly

NCCL-level verification (6N example)

Confirmed via ncclCommSplit_impl logs that two separate sub-communicators are created:

splitCount 87: g52+g53 (nodes 4-5) -> nranks=16, color 1721561492  (remainder group)
splitCount 88: g2+g14+g50+g51 (nodes 0-3) -> nranks=32, color 2008404567  (main group)

Different colors = separate sub-communicators. Remainder nodes do NOT appear in the main group's split.

Caveat

When remainder = 1 (e.g., 5N with 4-node adjacency), the single leftover node is excluded from that sub-group test -- inter-node communication requires >= 2 nodes. It's still covered by the full N-node allreduce/alltoall and ring P2P.

Test results

All jobs ran on partition amd-tw with NCCL_DEBUG=INFO. Node g57 excluded due to faulty RDMA NIC.

Nodes Ranks Slurm Job Nodes Used Exit
1 8 477 g59 0
2 16 478 g2, g14 0
3 24 479 g25, g26, g27 0
4 32 480 g50, g51, g52, g53 0
5 40 481 g15, g29, g54, g55, g59 0
6 48 482 g2, g14, g50, g51, g52, g53 0
7 56 483 g15, g25, g26, g27, g29, g54, g55 0
8 64 484 g2, g14, g15, g25, g26, g27, g29, g50 0
9 72 485 g2, g14, g15, g50, g51, g52, g53, g54, g55 0

Remainder group output (6N, slurm-482.out line 54476)

=======InterNodeComm - allreduce-4nodes (GB/s)=======
Hostname      Node  Rank  2MB    4MB    8MB    16MB   32MB   64MB   128MB  256MB  512MB  1024MB
tus1-p3-g2    0     0     3.73   5.31   5.03   5.84   10.96  7.55   7.41   8.01   15.01  29.27
tus1-p3-g52   4     32    32.46  53.65  77.70  96.48  135.12 175.10 215.54 324.20 336.77 344.32

Two group leaders reported: rank 0 (main 4-node group, nodes 0-3) and rank 32 (remainder 2-node group, nodes 4-5).

Remainder group output (7N, slurm-483.out line 68237)

=======InterNodeComm - allreduce-4nodes (GB/s)=======
Hostname      Node  Rank  2MB    4MB    8MB    16MB   32MB   64MB   128MB  256MB  512MB  1024MB
tus1-p3-g15   0     0     4.28   5.64   5.49   6.03   11.37  7.55   7.40   8.11   18.00  30.02
tus1-p3-g29   4     32    9.94   14.09  14.52  15.25  30.92  20.28  19.93  20.25  40.16  77.46

Two group leaders: rank 0 (main 4-node group, nodes 0-3) and rank 32 (remainder 3-node group, nodes 4-6).

P2P on odd counts (no crash)

=======InterNodeComm - p2p-2nodes (GB/s)=======  [3N, slurm-479.out]
Hostname      Node  Rank  2MB    ...    1024MB
tus1-p3-g25   0     0     12.25  ...    20.40    <- paired (node 0 <-> node 1)
                                                    node 2 absent (unpaired, skipped)

=======InterNodeComm - p2p-2nodes (GB/s)=======  [5N, slurm-481.out]
Hostname      Node  Rank  2MB    ...    1024MB
tus1-p3-g15   0     0     10.83  ...    10.12    <- pair 1 (node 0 <-> node 1)
tus1-p3-g15   0     1     11.00  ...    9.98
...
                                                    node 4 absent (unpaired, skipped)

Ring P2P includes all nodes (odd counts)

=======InterNodeRingP2P bidirectional bandwidth - ringsize=5 - (GB/s)=======
ring 10MB   20MB   40MB   80MB   160MB
0    49.79  52.22  53.95  54.93  54.87
1    49.57  53.02  53.80  54.93  55.19
2    49.71  52.39  53.84  54.78  54.98
3    49.37  52.21  53.94  54.69  55.21
4    48.15  53.20  53.84  54.94  55.24
5    47.59  52.54  53.73  54.70  55.07

All 5 nodes participate in the ring (including node 4 which was excluded from P2P pairing).

NCCL warnings

All NCCL WARN messages across all jobs are benign IB port-state notifications:

NCCL WARN NET/IB : rdmaN:1 Got async error event: port active

No NCCL ERROR, ncclSystemError, or SIGTERM in any completed job.

Out of scope

This PR addresses preflight benchmark correctness only. Early-stopping / RCCL hang diagnosis is in progress in #689.

@yeandy yeandy marked this pull request as ready for review April 29, 2026 21:18
Copy link
Copy Markdown
Collaborator

@amd-ama10002-2 amd-ama10002-2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍I'm approving.

Do we need to manually test it again? If the command to execute the preflight test is the same as it was in PR #699 then I don't think we need to manually test again.

@yeandy yeandy merged commit b285352 into dev/preflight-direct-test Apr 29, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants