Skip to content

Enable P2P transport for AMD systems with >2 GPUs at PHB level#2080

Open
voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
voipmonitor:fix/amd-p2p-phb-level
Open

Enable P2P transport for AMD systems with >2 GPUs at PHB level#2080
voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
voipmonitor:fix/amd-p2p-phb-level

Conversation

@voipmonitor
Copy link
Copy Markdown

@voipmonitor voipmonitor commented Mar 31, 2026

Summary

On AMD multi-socket systems, GPUs on the same NUMA node connect through separate PCIe root complexes under the same PCIe Host Bridge (PATH_PHB). The default P2P level (PATH_PXB) disables P2P for these paths, forcing NCCL to use shared memory (SHM) transport instead of direct P2P — with 24-42% bandwidth loss.

The existing AMD exception (paths.cc:328) enables PATH_SYS-level P2P but only for ≤2 GPUs. This patch extends it to allow PATH_PHB-level P2P for >2 GPUs, enabling same-socket P2P while remaining conservative (no cross-socket P2P change).

Code change (1 line logic change in src/graph/paths.cc):

// Before:
if ((arch == X86 && vendor == AMD) && GPU.count <= 2) p2pLevel = PATH_SYS;

// After:
if (arch == X86 && vendor == AMD) {
    p2pLevel = (GPU.count <= 2) ? PATH_SYS : PATH_PHB;
}

Benchmark results

System: Dual-socket AMD EPYC 9575F (Turin, Zen 5), 4x NVIDIA RTX PRO 6000 (Blackwell) on same NUMA node

Both stock and patched use NCCL 2.29.7+cuda13.2 built from the same master commit (3619159), with the only difference being this patch. nccl-tests rebuilt against 2.29.7.

Transport change: SHM/direct/directP2P/direct pointer

Throughput (all_reduce_perf, Ring, -g 4, -n 500, bus bandwidth in GB/s)

Size Stock (SHM) Patched (P2P) Improvement
256K 11.32 GB/s 15.32 GB/s +35%
512K 13.12 GB/s 18.55 GB/s +41%
1M 20.01 GB/s 27.88 GB/s +39%
2M 24.62 GB/s 34.89 GB/s +42%
4M 30.77 GB/s 39.67 GB/s +29%
16M 36.18 GB/s 44.92 GB/s +24%
128M 37.64 GB/s 46.60 GB/s +24%

Latency (all_reduce_perf, -g 4, -n 1000)

Size Stock (SHM) Patched (P2P) Improvement
4K 12.47 µs 11.86 µs 5% lower
32K 12.94 µs 11.95 µs 8% lower
128K 21.61 µs 17.40 µs 19% lower

Why this matters

  • Most users don't know about NCCL_P2P_LEVEL=SYS — this is the only workaround, but it's undiscoverable
  • LLM inference is latency-sensitive — AllReduce messages during inference are typically 256K–4M (tensor parallel sharding), exactly the range with the largest regression (35-42%)
  • The fix is conservative — only enables PHB-level P2P (same socket), doesn't change cross-socket behavior for >2 GPUs
  • Preserves existing behavior — ≤2 GPU AMD systems keep PATH_SYS level unchanged

How to verify

# Stock NCCL (SHM transport):
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO all_reduce_perf -g 4
# → "via SHM/direct/direct"

# Patched NCCL (P2P transport):
CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO all_reduce_perf -g 4
# → "via P2P/direct pointer"

# Manual workaround (equivalent result):
NCCL_P2P_LEVEL=SYS CUDA_VISIBLE_DEVICES=0,1,2,3 all_reduce_perf -g 4

Test plan

  • Verified transport changes from SHM to P2P via NCCL_DEBUG=INFO
  • Benchmarked throughput across message sizes (256K–128MB), same NCCL version
  • Benchmarked latency at small message sizes (8B–128KB), same NCCL version
  • Confirmed patched results match NCCL_P2P_LEVEL=SYS workaround
  • Confirmed ≤2 GPU path unchanged (still PATH_SYS)
  • Test on other AMD platforms (Zen 3/4 — Milan/Genoa)

🤖 Generated with Claude Code

On AMD multi-socket systems, GPUs on the same NUMA node connect through
separate PCIe root complexes under the same PCIe Host Bridge (PATH_PHB).
The default P2P level (PATH_PXB) disables P2P for these paths, forcing
shared memory transport with 24-42% bandwidth loss.

Extend the existing AMD P2P exception to allow PHB-level P2P for
configurations with more than 2 GPUs. The original SYS-level P2P
for ≤2 GPU configurations is preserved.

Benchmarked on dual-socket AMD EPYC 9575F (Turin) with 4x RTX PRO 6000
on the same socket (NCCL 2.29.7+cuda13.2):

  Transport change: SHM/direct/direct -> P2P/direct pointer
  Throughput: +24-42% across 256K-128M message sizes
  Latency: up to 19% lower at 128K

Signed-off-by: Martin Vit <martin@voipmonitor.org>
@marksantesson
Copy link
Copy Markdown
Collaborator

/mirror v2.30u1

@jskrobola
Copy link
Copy Markdown
Collaborator

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

@marksantesson
Copy link
Copy Markdown
Collaborator

/mirror

@jskrobola
Copy link
Copy Markdown
Collaborator

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants