-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent initialization of communicators? #239
Comments
NCCL does not prevent concurrent initialization of communicators. The Looking at the NCCL log in pytorch/pytorch#18300, it seems multiple threads are trying to claim they are rank 0. How many communicators does the user actually want to create? Indeed 7 as shown in the log?
|
Thanks, Ken. They want to create 7 communicators and the same process is rank 0 for all of them. |
Thanks @pietern for the confirmation. |
That's taken care of by each communicator using a different prefix in the key/value store to share it with the other ranks. The locking mechanism in that reply doesn't fix the issue and I confirmed the issue exists without that patch applied. |
Hi Pieter, is there any ordering guarantee for the operations issued by those helper threads? To use multiple communicators, one needs to at least make sure collective calls are executed in the same order on different GPUs (process in this case). (That's why we don't recommend this practice in general.) NCCL comm init is a collective call too. If PyTorch inits NCCL comms in a lazy manner, the init will be followed by a collective operation immediately. If multiple helper threads issue these two operations without ordering, it is possible that the operations mingle differently on the two GPUs (processes), thus causing a hang. |
There is no ordering guarantee, only an isolation guarantee. The unique ID will always be generated once, and be propagated to the right counterparts on other processes. It's as if the communicators are created by different processes, on the same set of the GPUs. Re: your comment on lazy initialization, even though the init is followed by a collective operation, it will be a collective against a particular communicator. We've established that this is fine, as long as they are different communicators, use different streams, etc. If the helper threads issue these operations out of order, shouldn't the fact that they use different NCCL unique IDs to initialize, ensure that they don't interfere? |
« We've established that this is fine, as long as they are different communicators, use different streams, etc. » In practice, it seems the cudaEvent operations and the cudaLaunchKernel we do during ncclAllreduce do not block currently (it could change), so indeed launching multiple NCCL kernels in parallel seems to work provided you use multiple streams and it fits in the GPU. But that is not guaranteed to work. However, the operations we do during ncclCommInit and in particular cudaMalloc effectively wait for all kernels to complete, so this is a guaranteed hang if different threads on different processes launch a ncclAllreduce while the others are still in the init phase. |
Thanks for the reminder @sjeaugey, I conveniently forgot about that... So if it works, good for you, but it may stop working in some future version, on different hardware (e.g. fewer SMs), etc.
This perfectly explains the hang described in the PyTorch issue. I'm familiar with the Thanks for walking through the underlying problem with me here. |
@ckmufeng I guess they are talking about the "implicit-synchronization" I'm wondering if the |
Per pytorch/pytorch#18300 it looks like concurrent initialization of multiple NCCL communicators is not possible. The communicators are completely isolated from one another and will use independent values of
ncclGetUniqueId
. These values are generated on the same rank though. Is there anything in NCCL that prevents concurrent initialization?Thanks!
The text was updated successfully, but these errors were encountered: