Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does NCCL all reduce on two streams block each other? #217

Closed
kingder opened this issue May 7, 2019 · 9 comments
Closed

does NCCL all reduce on two streams block each other? #217

kingder opened this issue May 7, 2019 · 9 comments

Comments

@kingder
Copy link

kingder commented May 7, 2019

My setting is that I use 16 GPUs on 2 nodes each with 8 GPUs, 16 processes with 1 GPU and two streams each (one default stream, and one non-blocking stream), I do all reduce on the two streams concurrently, the reduce order on each stream is fixed among all the processes, but may overlap between the two streams. I expected that NCCL kernels won't take much GPU resources and can work concurrently, however I observed hangs on this setting and ncclAllReduce would block. Is that normal?

@kwen2501
Copy link
Contributor

kwen2501 commented May 7, 2019

Without the per-thread default stream option, the default stream is a special stream which implicitly synchronizes with all other streams on the device.

Please refer to "The Default Stream" section here:
https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

The Default Stream

The default stream is useful where concurrency is not crucial to performance. Before CUDA 7, each device has a single default stream used for all host threads, which causes implicit synchronization. As the section “Implicit Synchronization” in the CUDA C Programming Guide explains, two commands from different streams cannot run concurrently if the host thread issues any CUDA command to the default stream between them.

CUDA 7 introduces a new option, the per-thread default stream, that has two effects. First, it gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently. Second, these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams.

To enable per-thread default streams in CUDA 7 and later, you can either compile with the nvcc command-line option --default-stream per-thread, or #define the CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macro before including CUDA headers (cuda.h or cuda_runtime.h). It is important to note: you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM to enable this behavior in a .cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h at the top of the translation unit.

@kwen2501
Copy link
Contributor

kwen2501 commented May 7, 2019

Also, are you sure all the ncclAllReduce calls -- when those on the default stream and those on the non-blocking stream are viewed as a whole -- are in the same order across all the processes? You only mentioned that "the reduce order on each stream is fixed." The inter-stream order is also important here.

@sjeaugey
Copy link
Member

sjeaugey commented May 7, 2019

Also please refer to my comment here : #195 (comment).

There is no guarantee that two NCCL operations will be able to make progress concurrently. So you need to make sure things will still work even if one blocks the other.

@kingder
Copy link
Author

kingder commented May 8, 2019

Thanks for the reply.

@kwen2501 I thought the non-blocking stream will be an exception here, right? the More Tips section in your link:

You can create non-blocking streams which do not synchronize with the legacy default stream by passing the cudaStreamNonBlocking flag to cudaStreamCreate().

And the inter-stream order is not the same, since the two steams not blocking each other, I thought they would make progress independently.

@sjeaugey that helps, though still confused why does it not work. Could you elaborate more about what the no guarantee means?

  • Does it mean If there's enough resources for two NCCL kernels on two different streams, and there's no other implicit sync points meanwhile, does it still have changes to hang?
  • Or it means NCCL have no guarantee just because the hang might caused by we do not use it properly, say, having stream/device sync somewhere between?

@kwen2501
Copy link
Contributor

kwen2501 commented May 8, 2019

You are right. I missed that the other stream is created with the cudaStreamNonBlocking flag.

In that case (the two streams are non-blocking to each other), CUDA makes no guarantee about the order of execution of operations issued to those independent streams. Then, even if all the NCCL calls (including both streams) are coded in the same order on all the processes, there still cannot be guarantee that there is no hang.

For example, GPU 0 can first launches AllReduce_a on stream_a, and then finds that there is no more free compute resource on the device to launch AllReduce_b on stream_b; whereas, GPU 1 somehow launches AllReduce_b first, and finds no free resource to launch AllReduce_a. Then there may be a deadlock situation where the two GPUs are waiting for each other to launch the operation they first launch.

You may refer to my comments 1 & 2 in #208.

@kingder
Copy link
Author

kingder commented May 8, 2019

Well, I understand the case you given in the example, but that depends on the AllReduce would occupy the whole GPU resources, which is quite odd to me, how many resources do NCCL kernel need? BTW, what do you mean by saying the following in #208 (comment)

Even if there are only NCCL operation, if they are launched in a loop, SM resource constraint can still occur.

@sjeaugey
Copy link
Member

sjeaugey commented May 8, 2019

The number of blocks NCCL launches depends on the platform and the bandwidth we're trying to achieve. Currently, it is 1-16 blocks of 64-256 threads.

I agree that if there are enough resources on the GPU to have two NCCL operations execute concurrently, one would expect things to work, but the CUDA programming model does not guarantee that it will work, which means it is not supported, might stop working at any moment, or could work on only some GPU types.

@kwen2501
Copy link
Contributor

kwen2501 commented May 8, 2019

You can see how many rings NCCL creates by setting NCCL_DEBUG=INFO. Each ring is a block/CTA.

Even if there are only NCCL operations, if they are launched in a loop, SM resource constraint can still occur.

I was just giving an extreme example that it may not take many other types of kernels to occupy a GPU. One can also launch many NCCL operations (on concurrent streams) to achieve that effect.

@kingder
Copy link
Author

kingder commented May 9, 2019

Thank you for the detailed explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants