Is it accepted use 2 ncclComm_t/2 stream in 1 process with 1 GPU #208

amazingyyc · 2019-04-19T17:12:38Z

I want to know does the nccl support multi-ncclComm_t/multi stream in 1 process with 1 GPU?
Like I use mpi to run distribute job, every process operate 1 GPU, I want use 2 thread in every process, and every thread has a unique ncclComm_t/cudastream communicate with the other process's thread.
process1-thread1 talk with process2-thread1,(have a unique ncclcomm)
process1-thread2 with process2-thread2,

In mpi I can use MPI_Comm_split_type to get sub-comm, How can i do the same thing in nccl?

kwen2501 · 2019-04-19T18:23:13Z

Yes, you can reuse a GPU in multiple communicators.
But you would need to make sure that you post NCCL operations in the same order on all GPUs.
Regarding communicator partition, we don't have an API for that yet. You would need to create two communicators for now.

amazingyyc · 2019-04-20T02:19:11Z

Thanks kwen2501, Yes I create 2 communicators for 2 thread in 1 process seperatly. the communicators-1 in process1 works in thread-1 only communicate with process-2's communicators-1. But I got some error like:

next-a-gpu-0006:27806:27906 [0] enqueue.cu:197 NCCL WARN Cuda failure 'invalid resource handle'
next-a-gpu-0006:27806:27906 [0] NCCL INFO enqueue.cu:438 -> 1

So i want to know the right way to create 2 communicators in 1 process, I read the example by nccl and google it, But find nothing about this.

amazingyyc · 2019-04-21T07:41:14Z

I remove calling ncclComm function in second thread, just used in thread1, my program works well.

kwen2501 · 2019-04-21T14:31:19Z

Sorry for replying late. Do you still have problem with the two-communicator case? If you do, do you mind posting your code here?

amazingyyc · 2019-04-21T14:44:17Z

Thanks kwen2501 , I fix it. the error is not caused by nccComm, It's about the GPU memory. I fix it by: call cudaSetDevice before use the GPU memory in different thread. Thanks again.

kwen2501 · 2019-04-21T14:49:34Z

Also, (not related to the problem you were seeing), while using two communicators on one GPU is okay, using two streams for those two communicators may sometimes lead to hang.

For example, the following code cannot guarantee the two all-reduce operations are scheduled in the same order on the two GPUs:
On GPU 0:
ncclAllReduce(......, comm_a, stream_a); ncclAllReduce(......, comm_b, stream_b);
On GPU 1:
ncclAllReduce(......, comm_a, stream_a); ncclAllReduce(......, comm_b, stream_b);

amazingyyc · 2019-04-21T15:37:38Z

If i do below it also may get to hang?
On process 0, GPU 0:
thread 0: ncclAllReduce(......, comm_a, stream_a); ncclreduce(comm_a, stream_a), cudaMemcpyAsync(stream_a)
thread 1: cudaMemcpyAsync(stream_b), ncclGather(......, comm_b, stream_b); ncclbcast(comm_b, stream_b)

On process 1, GPU 1:
thread 0: ncclAllReduce(......, comm_a, stream_a); ncclreduce(comm_a, stream_a), cudaMemcpyAsync(stream_a)
thread 1: cudaMemcpyAsync(stream_b), ncclGather(......, comm_b, stream_b); ncclbcast(comm_b, stream_b)

amazingyyc · 2019-04-21T15:45:56Z

Also, (not related to the problem you were seeing), while using two communicators on one GPU is okay, using two streams for those two communicators may sometimes lead to hang.

For example, the following code cannot guarantee the two all-reduce operations are scheduled in the same order on the two GPUs:
On GPU 0:
ncclAllReduce(......, comm_a, stream_a); ncclAllReduce(......, comm_b, stream_b);
On GPU 1:
ncclAllReduce(......, comm_a, stream_a); ncclAllReduce(......, comm_b, stream_b);

And i do not understand, why this situation will be hang. the stream_a and stream_b is parallel and comm_a in GPU0 always talk with comm_a in GPU1 right? So in which scenario this will be get hang.

kwen2501 · 2019-04-21T15:55:19Z

If i do below it also may get to hang?
On process 0, GPU 0:
thread 0: ncclAllReduce(......, comm_a, stream_a); ncclreduce(comm_a, stream_a), cudaMemcpyAsync(stream_a)
thread 1: cudaMemcpyAsync(stream_b), ncclGather(......, comm_b, stream_b); ncclbcast(comm_b, stream_b)

On process 1, GPU 1:
thread 0: ncclAllReduce(......, comm_a, stream_a); ncclreduce(comm_a, stream_a), cudaMemcpyAsync(stream_a)
thread 1: cudaMemcpyAsync(stream_b), ncclGather(......, comm_b, stream_b); ncclbcast(comm_b, stream_b)

It may.

kwen2501 · 2019-04-21T16:22:38Z

And i do not understand, why this situation will be hang. the stream_a and stream_b is parallel and comm_a in GPU0 always talk with comm_a in GPU1 right? So in which scenario this will be get hang.

The CUDA processing model makes no guarantee about the order of execution of operations issued to independent streams. See for example here: https://devtalk.nvidia.com/default/topic/940657/processing-order-with-cuda-streams-in-7-5/
What could happen is that GPU 0 first launches AllReduce_a, and then finds that there is no more free compute resource on the device and has to queue AllReduce_b; whereas, GPU 1 somehow launches AllReduce_b first, finds no free resource and queues AllReduce_a. Then there may be a deadlock situation where the two GPUs are waiting for each other to launch the operation they first launch respectively.

amazingyyc · 2019-04-22T02:17:42Z

Thanks kwen2501, It looks like I have to arrange a special order to avoid hange.
Thanks so much

kwen2501 · 2019-04-22T04:19:13Z

The easiest way is to use a single stream for those two communicators.

amazingyyc · 2019-04-22T04:34:24Z

The easiest way is to use a single stream for those two communicators.

It's a good idea. But in my scenario it's a little harder to use one stream.

amazingyyc · 2019-04-22T05:10:36Z

And i do not understand, why this situation will be hang. the stream_a and stream_b is parallel and comm_a in GPU0 always talk with comm_a in GPU1 right? So in which scenario this will be get hang.

The CUDA processing model makes no guarantee about the order of execution of operations issued to independent streams. See for example here: https://devtalk.nvidia.com/default/topic/940657/processing-order-with-cuda-streams-in-7-5/
What could happen is that GPU 0 first launches AllReduce_a, and then finds that there is no more free compute resource on the device and has to queue AllReduce_b; whereas, GPU 1 somehow launches AllReduce_b first, finds no free resource and queues AllReduce_a. Then there may be a deadlock situation where the two GPUs are waiting for each other to launch the operation they first launch respectively.

by the way, If I can guarantee the resource that used by nccl (like thread's count in 1 block) is smaller than the GPU's, It will avoid the hang? I check the doc and find:

NCCL_NTHREADS
The NCCL_NTHREADS variable sets the number of CUDA threads per CUDA block. NCCL will launch one block per communication ring.

Use this variable if you think your GPU clocks are low and you want to increase the number of threads.

You can also use this variable to reduce the number of threads to decrease the GPU workload.

Values accepted
Default is 256.

The values allowed are 64, 128 and 256.

The default nccl will use 1 block for communication ring and the max thread used by nccl is 256. So 256 is much smaller than the GPU's thread count right? Than in this case, the hang can be avoid?

kwen2501 · 2019-04-22T14:13:48Z

What you said is correct but that's an if. If your application has other kernels running on the GPU or not, they will take up resources as well. Even if there are only NCCL operation, if they are launched in a loop, SM resource constraint can still occur.

amazingyyc · 2019-04-22T14:57:57Z

Thanks @kwen2501 it's really helpful for me. And with your help my problem been fixed.
Thanks you again.

amazingyyc closed this as completed Apr 22, 2019

This was referenced May 3, 2019

Add a ParallelNCCLHierarchicalAllreduce use multi-thread to do Allreduce horovod/horovod#1047

Closed

Add a ParallelNCCLHierarchicalAllreduce horovod/horovod#1052

Closed

kwen2501 mentioned this issue May 8, 2019

does NCCL all reduce on two streams block each other? #217

Closed

romerojosh mentioned this issue Jun 6, 2019

Adding support for multiple CUDA streams for NCCL operations. horovod/horovod#1128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it accepted use 2 ncclComm_t/2 stream in 1 process with 1 GPU #208

Is it accepted use 2 ncclComm_t/2 stream in 1 process with 1 GPU #208

amazingyyc commented Apr 19, 2019

kwen2501 commented Apr 19, 2019

amazingyyc commented Apr 20, 2019

amazingyyc commented Apr 21, 2019

kwen2501 commented Apr 21, 2019

amazingyyc commented Apr 21, 2019

kwen2501 commented Apr 21, 2019 •

edited

Loading

amazingyyc commented Apr 21, 2019

amazingyyc commented Apr 21, 2019

kwen2501 commented Apr 21, 2019

kwen2501 commented Apr 21, 2019 •

edited

Loading

amazingyyc commented Apr 22, 2019

kwen2501 commented Apr 22, 2019

amazingyyc commented Apr 22, 2019

amazingyyc commented Apr 22, 2019

kwen2501 commented Apr 22, 2019

amazingyyc commented Apr 22, 2019

Is it accepted use 2 ncclComm_t/2 stream in 1 process with 1 GPU #208

Is it accepted use 2 ncclComm_t/2 stream in 1 process with 1 GPU #208

Comments

amazingyyc commented Apr 19, 2019

kwen2501 commented Apr 19, 2019

amazingyyc commented Apr 20, 2019

amazingyyc commented Apr 21, 2019

kwen2501 commented Apr 21, 2019

amazingyyc commented Apr 21, 2019

kwen2501 commented Apr 21, 2019 • edited Loading

amazingyyc commented Apr 21, 2019

amazingyyc commented Apr 21, 2019

kwen2501 commented Apr 21, 2019

kwen2501 commented Apr 21, 2019 • edited Loading

amazingyyc commented Apr 22, 2019

kwen2501 commented Apr 22, 2019

amazingyyc commented Apr 22, 2019

amazingyyc commented Apr 22, 2019

kwen2501 commented Apr 22, 2019

amazingyyc commented Apr 22, 2019

kwen2501 commented Apr 21, 2019 •

edited

Loading

kwen2501 commented Apr 21, 2019 •

edited

Loading