why uses rdma write for default ib traffic #609

zegao96 · 2021-12-09T12:39:20Z

I saw code here

Line 26 in c5790b3

#define USE_RDMA_WRITE 1

make ib traffic default to WRITE operation instead of READ, any rationale behind this?

sjeaugey · 2021-12-09T18:11:01Z

The define was there to switch between RDMA_WRITE and SEND, not RDMA_WRITE and RDMA_READ. To use RDMA_READ we'd need to revert the way the code works.

zegao96 · 2021-12-10T03:49:48Z

@sjeaugey thank you for the quick reply! Yes, I understand what this switch does here. it's my bad for not being clear here. I meant once this USE_RDMA_WRITE is set, ncclIbIsend turns out to use the RDMA_WRITE here:

nccl/src/transport/net_ib.cc

Line 690 in c5790b3

wr[0].opcode = IBV_WR_RDMA_WRITE_WITH_IMM;

Any reason for choosing RDMA_WRITE instead of RDMA_READ in the first place? AFAIK, rdma read based protocol is a better option in rendezvous communication according to this. Or is nccl not involved with rendezvous at all? Hope to help me clear the doubts!

sjeaugey · 2021-12-10T07:53:16Z

NCCL is only doing Rendez-vous. It does not have any eager protocol because there should not be any "unexpected message". Using RDMA_READ would mean adding a round trip of latency, i.e. multiplying the network latency by 3 in our critical path.

When a GPU is ready to send data, the sender CPU is supposed to have received the information from the receiver on where to put data. So it only needs to trigger the RDMA_WRITE and data lands directly in the next GPU's memory.

If we were to revert that protocol to use RDMA_READ, then we would need to have the sender CPU notify the receiver that data is ready (1x the latency), then the receiver would initiate an RDMA_READ to get the data (2x the latency).

So READ is always an additional latency from WRITE, or even two in our case where we can make sure receives are posted in advance, most of the time.

Regarding the paper about READ vs WRITE performance, this paper dates from 2006, it is probably running the first generations of Infiniband NICs, at SDR speed (10Gbps) or best case DDR (20 Gbps). There were a few issues that MPI libraries were trying to workaround at that time. Fortunately, this is no longer the case and we no longer see performance difference between READ and WRITE.

zegao96 · 2021-12-13T06:23:45Z

@sjeaugey great explanation! many thanks.

When a GPU is ready to send data, the sender CPU is supposed to have received the information from the receiver on where to put data.

is this achieved by having the receiver prepare(like register) the dedicated recv buffer and then advocate it to senders before the rendezvous starts?

sjeaugey · 2021-12-13T17:07:51Z

The Receive CPU Proxy will start around the same time (or before) the CUDA kernel is launched. So by the time the GPU starts sending data, it should have had enough time to send the "Clear to send" message to the sender CPU proxy.

zegao96 · 2021-12-14T02:19:55Z

@sjeaugey All doubts clear! Thanks again

zegao96 closed this as completed Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why uses rdma write for default ib traffic #609

why uses rdma write for default ib traffic #609

zegao96 commented Dec 9, 2021

sjeaugey commented Dec 9, 2021

zegao96 commented Dec 10, 2021 •

edited

sjeaugey commented Dec 10, 2021

zegao96 commented Dec 13, 2021

sjeaugey commented Dec 13, 2021

zegao96 commented Dec 14, 2021

why uses rdma write for default ib traffic #609

why uses rdma write for default ib traffic #609

Comments

zegao96 commented Dec 9, 2021

sjeaugey commented Dec 9, 2021

zegao96 commented Dec 10, 2021 • edited

sjeaugey commented Dec 10, 2021

zegao96 commented Dec 13, 2021

sjeaugey commented Dec 13, 2021

zegao96 commented Dec 14, 2021

zegao96 commented Dec 10, 2021 •

edited