Enable P2P transport for AMD systems with >2 GPUs at PHB level#2080
Open
voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
Open
Enable P2P transport for AMD systems with >2 GPUs at PHB level#2080voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
Conversation
On AMD multi-socket systems, GPUs on the same NUMA node connect through separate PCIe root complexes under the same PCIe Host Bridge (PATH_PHB). The default P2P level (PATH_PXB) disables P2P for these paths, forcing shared memory transport with 24-42% bandwidth loss. Extend the existing AMD P2P exception to allow PHB-level P2P for configurations with more than 2 GPUs. The original SYS-level P2P for ≤2 GPU configurations is preserved. Benchmarked on dual-socket AMD EPYC 9575F (Turin) with 4x RTX PRO 6000 on the same socket (NCCL 2.29.7+cuda13.2): Transport change: SHM/direct/direct -> P2P/direct pointer Throughput: +24-42% across 256K-128M message sizes Latency: up to 19% lower at 128K Signed-off-by: Martin Vit <martin@voipmonitor.org>
df78025 to
bcd0da0
Compare
7 tasks
Collaborator
|
/mirror v2.30u1 |
Collaborator
|
Mirroring to the internal repository failed. The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes. |
Collaborator
|
/mirror |
Collaborator
|
Mirroring to the internal repository failed. The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On AMD multi-socket systems, GPUs on the same NUMA node connect through separate PCIe root complexes under the same PCIe Host Bridge (
PATH_PHB). The default P2P level (PATH_PXB) disables P2P for these paths, forcing NCCL to use shared memory (SHM) transport instead of direct P2P — with 24-42% bandwidth loss.The existing AMD exception (
paths.cc:328) enablesPATH_SYS-level P2P but only for ≤2 GPUs. This patch extends it to allowPATH_PHB-level P2P for >2 GPUs, enabling same-socket P2P while remaining conservative (no cross-socket P2P change).Code change (1 line logic change in
src/graph/paths.cc):Benchmark results
System: Dual-socket AMD EPYC 9575F (Turin, Zen 5), 4x NVIDIA RTX PRO 6000 (Blackwell) on same NUMA node
Both stock and patched use NCCL 2.29.7+cuda13.2 built from the same master commit (3619159), with the only difference being this patch. nccl-tests rebuilt against 2.29.7.
Transport change:
SHM/direct/direct→P2P/direct pointerThroughput (all_reduce_perf, Ring, -g 4, -n 500, bus bandwidth in GB/s)
Latency (all_reduce_perf, -g 4, -n 1000)
Why this matters
NCCL_P2P_LEVEL=SYS— this is the only workaround, but it's undiscoverableHow to verify
Test plan
NCCL_DEBUG=INFONCCL_P2P_LEVEL=SYSworkaround🤖 Generated with Claude Code