NCCL_NET_GDR_READ's performance impact on a PCIe platform #1295

cold2stone · 2024-05-23T05:53:13Z

Hello,

NVIDIA's official documentation mentions that NCCL_NET_GDR_READ is set to 1 by default only on NVLink-based platforms. Additionally, it notes, "Reading directly from GPU memory when sending data is known to be slightly slower than reading from CPU memory on some platforms, such as PCI-E."
Really, my experiments on a PCIe platform show better performance when NCCL_NET_GDR_READ=0.

My question is this:
Even on NVLink-based platforms (e.g. DGX), the RNIC and GPU are connected via PCIe, not NVLink. Then why is there a performance difference with PCIe platforms despite RNIC and GPU not being connected via NVLink?
Isn't NVLink involved only in data transfer between GPUs?

Additionally, this question evolves to:
What exactly makes GDR perform better than not using GDR?
I suspect that the performance difference in GPU memory read between these two platforms is more about latency than bandwidth. Also, I believe that p2p communication does not affect the PCIe data transfer bandwidth between devices.
Therefore, does the improvement in collective communication bandwidth brought by GDR rely solely on the reduced communication latency through p2p?

The text was updated successfully, but these errors were encountered:

shanleo2024 · 2024-06-04T06:42:00Z

I have the same question: #1181
I think NCCL_NET_GDR_READ is not the meaning of GDR, GDR is controlled by NCCL_NET_GDR_LEVEL, NCCL_NET_GDR_READ only affect the GDR of sending side.
Did you test the NCCL_NET_GDR_READ=1 perform better than NCCL_NET_GDR_READ=0 on DGX platform?
I cannot make sure if the NCCL_NET_GDR_READ relates with PXN or not.

cold2stone · 2024-06-04T07:57:57Z

NCCL_NET_GDR_READ only determines whether the send-side node uses GDR or not.
I am not using DGX platform.

My question is, even if the pcie gen5 platform does not use GDR, the pcie bandwidth will not be the bottleneck of the system. Given that the original advantage of GDR is to solve the pcie bottleneck near cpu, I wonder why GDR makes the performance different although the PCIe bandwidth is not the bottleneck.

I guess GDR itself does not increases the network bandwidth.

shanleo2024 · 2024-07-16T03:43:51Z

Do you have any understanding about the NCCL_NET_GDR_READ=1?
On my setup, let NCCL_NET_GDR_READ=1 will performance worse than NCCL_NET_GDR_READ=0 when running allreduce, allgather and reducescater.
While other several test will performance better with NCCL_NET_GDR_READ=1.
Cannot understand this, do you have any idea?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL_NET_GDR_READ's performance impact on a PCIe platform #1295

NCCL_NET_GDR_READ's performance impact on a PCIe platform #1295

cold2stone commented May 23, 2024 •

edited

Loading

shanleo2024 commented Jun 4, 2024

cold2stone commented Jun 4, 2024 •

edited

Loading

shanleo2024 commented Jul 16, 2024

NCCL_NET_GDR_READ's performance impact on a PCIe platform #1295

NCCL_NET_GDR_READ's performance impact on a PCIe platform #1295

Comments

cold2stone commented May 23, 2024 • edited Loading

shanleo2024 commented Jun 4, 2024

cold2stone commented Jun 4, 2024 • edited Loading

shanleo2024 commented Jul 16, 2024

cold2stone commented May 23, 2024 •

edited

Loading

cold2stone commented Jun 4, 2024 •

edited

Loading