Enable P2P transport for AMD systems with >2 GPUs at PHB level by voipmonitor · Pull Request #2080 · NVIDIA/nccl

voipmonitor · 2026-03-31T13:21:52Z

Summary

On AMD multi-socket systems, GPUs on the same NUMA node connect through separate PCIe root complexes under the same PCIe Host Bridge (PATH_PHB). The default P2P level (PATH_PXB) disables P2P for these paths, forcing NCCL to use shared memory (SHM) transport instead of direct P2P — with 24-42% bandwidth loss.

The existing AMD exception (paths.cc:328) enables PATH_SYS-level P2P but only for ≤2 GPUs. This patch extends it to allow PATH_PHB-level P2P for >2 GPUs, enabling same-socket P2P while remaining conservative (no cross-socket P2P change).

Code change (1 line logic change in src/graph/paths.cc):

// Before:
if ((arch == X86 && vendor == AMD) && GPU.count <= 2) p2pLevel = PATH_SYS;

// After:
if (arch == X86 && vendor == AMD) {
    p2pLevel = (GPU.count <= 2) ? PATH_SYS : PATH_PHB;
}

Benchmark results

System: Dual-socket AMD EPYC 9575F (Turin, Zen 5), 4x NVIDIA RTX PRO 6000 (Blackwell) on same NUMA node

Both stock and patched use NCCL 2.29.7+cuda13.2 built from the same master commit (3619159), with the only difference being this patch. nccl-tests rebuilt against 2.29.7.

Transport change: SHM/direct/direct → P2P/direct pointer

Throughput (all_reduce_perf, Ring, -g 4, -n 500, bus bandwidth in GB/s)

Size	Stock (SHM)	Patched (P2P)	Improvement
256K	11.32 GB/s	15.32 GB/s	+35%
512K	13.12 GB/s	18.55 GB/s	+41%
1M	20.01 GB/s	27.88 GB/s	+39%
2M	24.62 GB/s	34.89 GB/s	+42%
4M	30.77 GB/s	39.67 GB/s	+29%
16M	36.18 GB/s	44.92 GB/s	+24%
128M	37.64 GB/s	46.60 GB/s	+24%

Latency (all_reduce_perf, -g 4, -n 1000)

Size	Stock (SHM)	Patched (P2P)	Improvement
4K	12.47 µs	11.86 µs	5% lower
32K	12.94 µs	11.95 µs	8% lower
128K	21.61 µs	17.40 µs	19% lower

Why this matters

Most users don't know about NCCL_P2P_LEVEL=SYS — this is the only workaround, but it's undiscoverable
LLM inference is latency-sensitive — AllReduce messages during inference are typically 256K–4M (tensor parallel sharding), exactly the range with the largest regression (35-42%)
The fix is conservative — only enables PHB-level P2P (same socket), doesn't change cross-socket behavior for >2 GPUs
Preserves existing behavior — ≤2 GPU AMD systems keep PATH_SYS level unchanged

How to verify

# Stock NCCL (SHM transport):
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO all_reduce_perf -g 4
# → "via SHM/direct/direct"

# Patched NCCL (P2P transport):
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO all_reduce_perf -g 4
# → "via P2P/direct pointer"

# Manual workaround (equivalent result):
NCCL_P2P_LEVEL=SYS CUDA_VISIBLE_DEVICES=0,1,2,3 all_reduce_perf -g 4

Test plan

Verified transport changes from SHM to P2P via NCCL_DEBUG=INFO
Benchmarked throughput across message sizes (256K–128MB), same NCCL version
Benchmarked latency at small message sizes (8B–128KB), same NCCL version
Confirmed patched results match NCCL_P2P_LEVEL=SYS workaround
Confirmed ≤2 GPU path unchanged (still PATH_SYS)
Test on other AMD platforms (Zen 3/4 — Milan/Genoa)

🤖 Generated with Claude Code

On AMD multi-socket systems, GPUs on the same NUMA node connect through separate PCIe root complexes under the same PCIe Host Bridge (PATH_PHB). The default P2P level (PATH_PXB) disables P2P for these paths, forcing shared memory transport with 24-42% bandwidth loss. Extend the existing AMD P2P exception to allow PHB-level P2P for configurations with more than 2 GPUs. The original SYS-level P2P for ≤2 GPU configurations is preserved. Benchmarked on dual-socket AMD EPYC 9575F (Turin) with 4x RTX PRO 6000 on the same socket (NCCL 2.29.7+cuda13.2): Transport change: SHM/direct/direct -> P2P/direct pointer Throughput: +24-42% across 256K-128M message sizes Latency: up to 19% lower at 128K Signed-off-by: Martin Vit <martin@voipmonitor.org>

marksantesson · 2026-04-16T23:29:26Z

/mirror v2.30u1

jskrobola · 2026-04-16T23:32:53Z

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

marksantesson · 2026-04-16T23:47:27Z

/mirror

jskrobola · 2026-04-16T23:47:38Z

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

voipmonitor force-pushed the fix/amd-p2p-phb-level branch from df78025 to bcd0da0 Compare March 31, 2026 14:05

voipmonitor mentioned this pull request Apr 9, 2026

Fix AMD inter-CPU bandwidth and LL128 protocol for multi-socket EPYC #2036

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable P2P transport for AMD systems with >2 GPUs at PHB level#2080

Enable P2P transport for AMD systems with >2 GPUs at PHB level#2080
voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
voipmonitor:fix/amd-p2p-phb-level

voipmonitor commented Mar 31, 2026 •

edited

Loading

Uh oh!

marksantesson commented Apr 16, 2026

Uh oh!

jskrobola commented Apr 16, 2026

Uh oh!

marksantesson commented Apr 16, 2026

Uh oh!

jskrobola commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

voipmonitor commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results

Throughput (all_reduce_perf, Ring, -g 4, -n 500, bus bandwidth in GB/s)

Latency (all_reduce_perf, -g 4, -n 1000)

Why this matters

How to verify

Test plan

Uh oh!

marksantesson commented Apr 16, 2026

Uh oh!

jskrobola commented Apr 16, 2026

Uh oh!

marksantesson commented Apr 16, 2026

Uh oh!

jskrobola commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

voipmonitor commented Mar 31, 2026 •

edited

Loading