Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why uses rdma write for default ib traffic #609

Closed
zegao96 opened this issue Dec 9, 2021 · 6 comments
Closed

why uses rdma write for default ib traffic #609

zegao96 opened this issue Dec 9, 2021 · 6 comments

Comments

@zegao96
Copy link

zegao96 commented Dec 9, 2021

I saw code here

#define USE_RDMA_WRITE 1
make ib traffic default to WRITE operation instead of READ, any rationale behind this?

@sjeaugey
Copy link
Member

sjeaugey commented Dec 9, 2021

The define was there to switch between RDMA_WRITE and SEND, not RDMA_WRITE and RDMA_READ. To use RDMA_READ we'd need to revert the way the code works.

@zegao96
Copy link
Author

zegao96 commented Dec 10, 2021

@sjeaugey thank you for the quick reply! Yes, I understand what this switch does here. it's my bad for not being clear here. I meant once this USE_RDMA_WRITE is set, ncclIbIsend turns out to use the RDMA_WRITE here:

wr[0].opcode = IBV_WR_RDMA_WRITE_WITH_IMM;

Any reason for choosing RDMA_WRITE instead of RDMA_READ in the first place? AFAIK, rdma read based protocol is a better option in rendezvous communication according to this. Or is nccl not involved with rendezvous at all? Hope to help me clear the doubts!

@sjeaugey
Copy link
Member

NCCL is only doing Rendez-vous. It does not have any eager protocol because there should not be any "unexpected message". Using RDMA_READ would mean adding a round trip of latency, i.e. multiplying the network latency by 3 in our critical path.

When a GPU is ready to send data, the sender CPU is supposed to have received the information from the receiver on where to put data. So it only needs to trigger the RDMA_WRITE and data lands directly in the next GPU's memory.

If we were to revert that protocol to use RDMA_READ, then we would need to have the sender CPU notify the receiver that data is ready (1x the latency), then the receiver would initiate an RDMA_READ to get the data (2x the latency).

So READ is always an additional latency from WRITE, or even two in our case where we can make sure receives are posted in advance, most of the time.

Regarding the paper about READ vs WRITE performance, this paper dates from 2006, it is probably running the first generations of Infiniband NICs, at SDR speed (10Gbps) or best case DDR (20 Gbps). There were a few issues that MPI libraries were trying to workaround at that time. Fortunately, this is no longer the case and we no longer see performance difference between READ and WRITE.

@zegao96
Copy link
Author

zegao96 commented Dec 13, 2021

@sjeaugey great explanation! many thanks.

When a GPU is ready to send data, the sender CPU is supposed to have received the information from the receiver on where to put data.

is this achieved by having the receiver prepare(like register) the dedicated recv buffer and then advocate it to senders before the rendezvous starts?

@sjeaugey
Copy link
Member

The Receive CPU Proxy will start around the same time (or before) the CUDA kernel is launched. So by the time the GPU starts sending data, it should have had enough time to send the "Clear to send" message to the sender CPU proxy.

@zegao96
Copy link
Author

zegao96 commented Dec 14, 2021

@sjeaugey All doubts clear! Thanks again

@zegao96 zegao96 closed this as completed Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants