Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it accepted use 2 ncclComm_t/2 stream in 1 process with 1 GPU #208

Closed
amazingyyc opened this issue Apr 19, 2019 · 16 comments
Closed

Is it accepted use 2 ncclComm_t/2 stream in 1 process with 1 GPU #208

amazingyyc opened this issue Apr 19, 2019 · 16 comments

Comments

@amazingyyc
Copy link

I want to know does the nccl support multi-ncclComm_t/multi stream in 1 process with 1 GPU?
Like I use mpi to run distribute job, every process operate 1 GPU, I want use 2 thread in every process, and every thread has a unique ncclComm_t/cudastream communicate with the other process's thread.
process1-thread1 talk with process2-thread1,(have a unique ncclcomm)
process1-thread2 with process2-thread2,

In mpi I can use MPI_Comm_split_type to get sub-comm, How can i do the same thing in nccl?

@kwen2501
Copy link
Contributor

Yes, you can reuse a GPU in multiple communicators.
But you would need to make sure that you post NCCL operations in the same order on all GPUs.
Regarding communicator partition, we don't have an API for that yet. You would need to create two communicators for now.

@amazingyyc
Copy link
Author

Thanks kwen2501, Yes I create 2 communicators for 2 thread in 1 process seperatly. the communicators-1 in process1 works in thread-1 only communicate with process-2's communicators-1. But I got some error like:

next-a-gpu-0006:27806:27906 [0] enqueue.cu:197 NCCL WARN Cuda failure 'invalid resource handle'
next-a-gpu-0006:27806:27906 [0] NCCL INFO enqueue.cu:438 -> 1

So i want to know the right way to create 2 communicators in 1 process, I read the example by nccl and google it, But find nothing about this.

@amazingyyc
Copy link
Author

I remove calling ncclComm function in second thread, just used in thread1, my program works well.

@kwen2501
Copy link
Contributor

Sorry for replying late. Do you still have problem with the two-communicator case? If you do, do you mind posting your code here?

@amazingyyc
Copy link
Author

Thanks kwen2501 , I fix it. the error is not caused by nccComm, It's about the GPU memory. I fix it by: call cudaSetDevice before use the GPU memory in different thread. Thanks again.

@kwen2501
Copy link
Contributor

kwen2501 commented Apr 21, 2019

Also, (not related to the problem you were seeing), while using two communicators on one GPU is okay, using two streams for those two communicators may sometimes lead to hang.

For example, the following code cannot guarantee the two all-reduce operations are scheduled in the same order on the two GPUs:
On GPU 0:
ncclAllReduce(......, comm_a, stream_a); ncclAllReduce(......, comm_b, stream_b);
On GPU 1:
ncclAllReduce(......, comm_a, stream_a); ncclAllReduce(......, comm_b, stream_b);

@amazingyyc
Copy link
Author

If i do below it also may get to hang?
On process 0, GPU 0:
thread 0: ncclAllReduce(......, comm_a, stream_a); ncclreduce(comm_a, stream_a), cudaMemcpyAsync(stream_a)
thread 1: cudaMemcpyAsync(stream_b), ncclGather(......, comm_b, stream_b); ncclbcast(comm_b, stream_b)

On process 1, GPU 1:
thread 0: ncclAllReduce(......, comm_a, stream_a); ncclreduce(comm_a, stream_a), cudaMemcpyAsync(stream_a)
thread 1: cudaMemcpyAsync(stream_b), ncclGather(......, comm_b, stream_b); ncclbcast(comm_b, stream_b)

@amazingyyc
Copy link
Author

Also, (not related to the problem you were seeing), while using two communicators on one GPU is okay, using two streams for those two communicators may sometimes lead to hang.

For example, the following code cannot guarantee the two all-reduce operations are scheduled in the same order on the two GPUs:
On GPU 0:
ncclAllReduce(......, comm_a, stream_a); ncclAllReduce(......, comm_b, stream_b);
On GPU 1:
ncclAllReduce(......, comm_a, stream_a); ncclAllReduce(......, comm_b, stream_b);

And i do not understand, why this situation will be hang. the stream_a and stream_b is parallel and comm_a in GPU0 always talk with comm_a in GPU1 right? So in which scenario this will be get hang.

@kwen2501
Copy link
Contributor

If i do below it also may get to hang?
On process 0, GPU 0:
thread 0: ncclAllReduce(......, comm_a, stream_a); ncclreduce(comm_a, stream_a), cudaMemcpyAsync(stream_a)
thread 1: cudaMemcpyAsync(stream_b), ncclGather(......, comm_b, stream_b); ncclbcast(comm_b, stream_b)

On process 1, GPU 1:
thread 0: ncclAllReduce(......, comm_a, stream_a); ncclreduce(comm_a, stream_a), cudaMemcpyAsync(stream_a)
thread 1: cudaMemcpyAsync(stream_b), ncclGather(......, comm_b, stream_b); ncclbcast(comm_b, stream_b)

It may.

@kwen2501
Copy link
Contributor

kwen2501 commented Apr 21, 2019

And i do not understand, why this situation will be hang. the stream_a and stream_b is parallel and comm_a in GPU0 always talk with comm_a in GPU1 right? So in which scenario this will be get hang.

The CUDA processing model makes no guarantee about the order of execution of operations issued to independent streams. See for example here: https://devtalk.nvidia.com/default/topic/940657/processing-order-with-cuda-streams-in-7-5/
What could happen is that GPU 0 first launches AllReduce_a, and then finds that there is no more free compute resource on the device and has to queue AllReduce_b; whereas, GPU 1 somehow launches AllReduce_b first, finds no free resource and queues AllReduce_a. Then there may be a deadlock situation where the two GPUs are waiting for each other to launch the operation they first launch respectively.

@amazingyyc
Copy link
Author

Thanks kwen2501, It looks like I have to arrange a special order to avoid hange.
Thanks so much

@kwen2501
Copy link
Contributor

The easiest way is to use a single stream for those two communicators.

@amazingyyc
Copy link
Author

The easiest way is to use a single stream for those two communicators.

It's a good idea. But in my scenario it's a little harder to use one stream.

@amazingyyc
Copy link
Author

And i do not understand, why this situation will be hang. the stream_a and stream_b is parallel and comm_a in GPU0 always talk with comm_a in GPU1 right? So in which scenario this will be get hang.

The CUDA processing model makes no guarantee about the order of execution of operations issued to independent streams. See for example here: https://devtalk.nvidia.com/default/topic/940657/processing-order-with-cuda-streams-in-7-5/
What could happen is that GPU 0 first launches AllReduce_a, and then finds that there is no more free compute resource on the device and has to queue AllReduce_b; whereas, GPU 1 somehow launches AllReduce_b first, finds no free resource and queues AllReduce_a. Then there may be a deadlock situation where the two GPUs are waiting for each other to launch the operation they first launch respectively.

by the way, If I can guarantee the resource that used by nccl (like thread's count in 1 block) is smaller than the GPU's, It will avoid the hang? I check the doc and find:

NCCL_NTHREADS
The NCCL_NTHREADS variable sets the number of CUDA threads per CUDA block. NCCL will launch one block per communication ring.

Use this variable if you think your GPU clocks are low and you want to increase the number of threads.

You can also use this variable to reduce the number of threads to decrease the GPU workload.

Values accepted
Default is 256.

The values allowed are 64, 128 and 256.

The default nccl will use 1 block for communication ring and the max thread used by nccl is 256. So 256 is much smaller than the GPU's thread count right? Than in this case, the hang can be avoid?

@kwen2501
Copy link
Contributor

What you said is correct but that's an if. If your application has other kernels running on the GPU or not, they will take up resources as well. Even if there are only NCCL operation, if they are launched in a loop, SM resource constraint can still occur.

@amazingyyc
Copy link
Author

Thanks @kwen2501 it's really helpful for me. And with your help my problem been fixed.
Thanks you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants