Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL_NET_GDR_READ's performance impact on a PCIe platform #1295

Open
cold2stone opened this issue May 23, 2024 · 3 comments
Open

NCCL_NET_GDR_READ's performance impact on a PCIe platform #1295

cold2stone opened this issue May 23, 2024 · 3 comments

Comments

@cold2stone
Copy link

cold2stone commented May 23, 2024

Hello,

NVIDIA's official documentation mentions that NCCL_NET_GDR_READ is set to 1 by default only on NVLink-based platforms. Additionally, it notes, "Reading directly from GPU memory when sending data is known to be slightly slower than reading from CPU memory on some platforms, such as PCI-E."
Really, my experiments on a PCIe platform show better performance when NCCL_NET_GDR_READ=0.

My question is this:
Even on NVLink-based platforms (e.g. DGX), the RNIC and GPU are connected via PCIe, not NVLink. Then why is there a performance difference with PCIe platforms despite RNIC and GPU not being connected via NVLink?
Isn't NVLink involved only in data transfer between GPUs?

Additionally, this question evolves to:
What exactly makes GDR perform better than not using GDR?
I suspect that the performance difference in GPU memory read between these two platforms is more about latency than bandwidth. Also, I believe that p2p communication does not affect the PCIe data transfer bandwidth between devices.
Therefore, does the improvement in collective communication bandwidth brought by GDR rely solely on the reduced communication latency through p2p?

@shanleo2024
Copy link

I have the same question: #1181
I think NCCL_NET_GDR_READ is not the meaning of GDR, GDR is controlled by NCCL_NET_GDR_LEVEL, NCCL_NET_GDR_READ only affect the GDR of sending side.
Did you test the NCCL_NET_GDR_READ=1 perform better than NCCL_NET_GDR_READ=0 on DGX platform?
I cannot make sure if the NCCL_NET_GDR_READ relates with PXN or not.

@cold2stone
Copy link
Author

cold2stone commented Jun 4, 2024

NCCL_NET_GDR_READ only determines whether the send-side node uses GDR or not.
I am not using DGX platform.

My question is, even if the pcie gen5 platform does not use GDR, the pcie bandwidth will not be the bottleneck of the system. Given that the original advantage of GDR is to solve the pcie bottleneck near cpu, I wonder why GDR makes the performance different although the PCIe bandwidth is not the bottleneck.

I guess GDR itself does not increases the network bandwidth.

@shanleo2024
Copy link

Do you have any understanding about the NCCL_NET_GDR_READ=1?
On my setup, let NCCL_NET_GDR_READ=1 will performance worse than NCCL_NET_GDR_READ=0 when running allreduce, allgather and reducescater.
While other several test will performance better with NCCL_NET_GDR_READ=1.
Cannot understand this, do you have any idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants