Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Point-to-point communications in NCCL? #212

Open
ktnyt opened this issue Apr 24, 2019 · 13 comments
Open

Point-to-point communications in NCCL? #212

ktnyt opened this issue Apr 24, 2019 · 13 comments

Comments

@ktnyt
Copy link

ktnyt commented Apr 24, 2019

When I attended GTC last month and attended the session by Mr. J. Kraus on multi-GPU programming, I heard from him that there were plans for point-to-point communication support in NCCL and perhaps bumping the development team with an issue will help get the attention.
While I did feel like the idea was kind of controversial (since this is a collective communication library), it would be great if point-to-point communication is indeed supported. However I also feel that this is a niche request so I wouldn't expect it to roll out anytime soon.
Has there been a discussion on supporting point-to-point communication? And if so are there any roadmaps towards supporting it? Any response would be very helpful.
Thanks in advance.

@sjeaugey
Copy link
Member

We've had plans to implement point to point communication for some time now, in the form of two new primitives : ncclSend and ncclRecv. Then, combining ncclSend and ncclRecv with ncclGroupStart and ncclGroupEnd, users could do any Alltoall, Scatter, Gather, or neighbor collectives.

So it would look like point to point, but with the idea of implementing a collective alltoallv operation, within the NCCL communicator -- which would follow the same rules as the current collective operations, i.e. operations are serialized on the communicator.

The main difference with MPI is that this is still a blocking call on the GPU side ; there is no Isend/Irecv.

We are indeed interested in hearing about use cases to precisely determine if users need blocking send/receive operations (allltoallv) or asynchronous send/receive with respect to collective operations (which NCCL cannot provide due to CUDA kernel semantics).

@ktnyt
Copy link
Author

ktnyt commented May 24, 2019

Thank you for the feedback! And apologies for not being able to reply back.

The current plan sounds feasible for our use case since we do not need non-blocking operations in the meantime.
We have been developing an algorithm that used blocking point-to-point communication and broadcasting functionalities of MPI and wanted to see if we can transition to using multi-GPUs.

@Tixxx
Copy link

Tixxx commented Jun 14, 2019

Hi, I'm not sure if the work has already started or not since this thread was opened more than a month ago. But we are also looking for ways to do direct point-to-point communications using NCCL library to support one of our distributed training algorithms. The nature of the algorithm is a pair-wise binary tree reduction. We have already implemented this using blocking MPI send/recv. Having NCCL is believed to deliver an even better performance boost. Blocking NCCLSend and NCCLRecv would be sufficient for our use case. So I'm really looking forward to hearing about the road-map for this feature. Please let me know if this has been planned or not. Thanks in advance!

@Tixxx
Copy link

Tixxx commented Jun 25, 2019

Hi @sjeaugey Could you provide any insight on the plan to support blocking NCCL send and recv? Looking forward to having some collaboration with NCCL devs. Thanks!

@sjeaugey
Copy link
Member

Hi @Tixxx. I don't see Send/Recv coming in the near future as we are still focusing on allreduce and its variants, and this is a large feature which needs a significant amount of work, with a lot of preparation and refactoring to be done before. For example (and among other things), we are trying to rewrite the topology detection and ring/tree creation to make it less ring-focused and more general, which is one of the steps needed before we start on point-to-point.

@nevion
Copy link

nevion commented Feb 24, 2020

hi @sjeaugey - the gap in point to point communication with a high level library like nccl impacts applications I work on that don't need collectives, just integrated efficient data transfer (glorified memcpy) across GPUs from the inter-thread to inter-node cases ( with gpudirect support ). Messaging semantics would be nice at times as well but the need there is lesser than that of an RMA like operation.

Is there any sort of timeline or actively worked items in support of the point-to-point communication pattern? Your messages here and in #270 indicate significant redesigns first that make me think it's a good 1+ years out - and that doesn't work for me. Is there anything that can be done to make it work in the next few months? I think the worst part is I really want to give nccl a try but none of the existing operations seem like a workable fit for my problem - simulation halo exchanges, a fairly common application.

@sjeaugey
Copy link
Member

Hi @nevion, hopefully this will arrive sooner. We are actively working on it now; the goal being to post a preview branch late next month, so that users can give it a try and provide feedback. Would that work for you ?

@nevion
Copy link

nevion commented Feb 24, 2020

@sjeaugey yes, that does indeed work for me.

@victoryang00
Copy link

Looking forward to applying the P2P function to increase the power of my project!

@2sin18
Copy link

2sin18 commented Mar 16, 2020

Any progress on this issue?

gather/scatter/alltoall are important to recommendation models(e.g. https://github.com/facebookresearch/dlrm/blob/master/dlrm_s_pytorch.py#L426), which still cannot utilize GPU very well now.

@sjeaugey
Copy link
Member

The p2p preview has been posted to the "p2p" branch. And PR #316 has been created for discussion / feedback.

@victoryang00
Copy link

e "p2p" branch. And PR #316 has been created for discussion / feedback.

Thanks, that helps a lot.

@sjeaugey sjeaugey linked a pull request Mar 31, 2020 that will close this issue
@2sin18
Copy link

2sin18 commented Apr 3, 2020

The p2p preview has been posted to the "p2p" branch. And PR #316 has been created for discussion / feedback.

Great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants