New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why uses rdma write for default ib traffic #609
Comments
The define was there to switch between RDMA_WRITE and SEND, not RDMA_WRITE and RDMA_READ. To use RDMA_READ we'd need to revert the way the code works. |
@sjeaugey thank you for the quick reply! Yes, I understand what this switch does here. it's my bad for not being clear here. I meant once this USE_RDMA_WRITE is set, Line 690 in c5790b3
Any reason for choosing RDMA_WRITE instead of RDMA_READ in the first place? AFAIK, rdma read based protocol is a better option in rendezvous communication according to this. Or is nccl not involved with rendezvous at all? Hope to help me clear the doubts! |
NCCL is only doing Rendez-vous. It does not have any eager protocol because there should not be any "unexpected message". Using RDMA_READ would mean adding a round trip of latency, i.e. multiplying the network latency by 3 in our critical path. When a GPU is ready to send data, the sender CPU is supposed to have received the information from the receiver on where to put data. So it only needs to trigger the RDMA_WRITE and data lands directly in the next GPU's memory. If we were to revert that protocol to use RDMA_READ, then we would need to have the sender CPU notify the receiver that data is ready (1x the latency), then the receiver would initiate an RDMA_READ to get the data (2x the latency). So READ is always an additional latency from WRITE, or even two in our case where we can make sure receives are posted in advance, most of the time. Regarding the paper about READ vs WRITE performance, this paper dates from 2006, it is probably running the first generations of Infiniband NICs, at SDR speed (10Gbps) or best case DDR (20 Gbps). There were a few issues that MPI libraries were trying to workaround at that time. Fortunately, this is no longer the case and we no longer see performance difference between READ and WRITE. |
@sjeaugey great explanation! many thanks.
is this achieved by having the receiver prepare(like register) the dedicated recv buffer and then advocate it to senders before the rendezvous starts? |
The Receive CPU Proxy will start around the same time (or before) the CUDA kernel is launched. So by the time the GPU starts sending data, it should have had enough time to send the "Clear to send" message to the sender CPU proxy. |
@sjeaugey All doubts clear! Thanks again |
I saw code here
nccl/src/transport/net_ib.cc
Line 26 in c5790b3
The text was updated successfully, but these errors were encountered: