does NCCL all reduce on two streams block each other? #217

kingder · 2019-05-07T13:57:57Z

My setting is that I use 16 GPUs on 2 nodes each with 8 GPUs, 16 processes with 1 GPU and two streams each (one default stream, and one non-blocking stream), I do all reduce on the two streams concurrently, the reduce order on each stream is fixed among all the processes, but may overlap between the two streams. I expected that NCCL kernels won't take much GPU resources and can work concurrently, however I observed hangs on this setting and ncclAllReduce would block. Is that normal?

kwen2501 · 2019-05-07T14:35:19Z

Without the per-thread default stream option, the default stream is a special stream which implicitly synchronizes with all other streams on the device.

Please refer to "The Default Stream" section here:
https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

The Default Stream

The default stream is useful where concurrency is not crucial to performance. Before CUDA 7, each device has a single default stream used for all host threads, which causes implicit synchronization. As the section “Implicit Synchronization” in the CUDA C Programming Guide explains, two commands from different streams cannot run concurrently if the host thread issues any CUDA command to the default stream between them.

CUDA 7 introduces a new option, the per-thread default stream, that has two effects. First, it gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently. Second, these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams.

To enable per-thread default streams in CUDA 7 and later, you can either compile with the nvcc command-line option --default-stream per-thread, or #define the CUDA_API_PER_THREAD_DEFAULT_STREAM preprocessor macro before including CUDA headers (cuda.h or cuda_runtime.h). It is important to note: you cannot use #define CUDA_API_PER_THREAD_DEFAULT_STREAM to enable this behavior in a .cu file when the code is compiled by nvcc because nvcc implicitly includes cuda_runtime.h at the top of the translation unit.

kwen2501 · 2019-05-07T15:08:19Z

Also, are you sure all the ncclAllReduce calls -- when those on the default stream and those on the non-blocking stream are viewed as a whole -- are in the same order across all the processes? You only mentioned that "the reduce order on each stream is fixed." The inter-stream order is also important here.

sjeaugey · 2019-05-07T16:18:11Z

Also please refer to my comment here : #195 (comment).

There is no guarantee that two NCCL operations will be able to make progress concurrently. So you need to make sure things will still work even if one blocks the other.

kingder · 2019-05-08T05:35:15Z

Thanks for the reply.

@kwen2501 I thought the non-blocking stream will be an exception here, right? the More Tips section in your link:

You can create non-blocking streams which do not synchronize with the legacy default stream by passing the cudaStreamNonBlocking flag to cudaStreamCreate().

And the inter-stream order is not the same, since the two steams not blocking each other, I thought they would make progress independently.

@sjeaugey that helps, though still confused why does it not work. Could you elaborate more about what the no guarantee means?

Does it mean If there's enough resources for two NCCL kernels on two different streams, and there's no other implicit sync points meanwhile, does it still have changes to hang?
Or it means NCCL have no guarantee just because the hang might caused by we do not use it properly, say, having stream/device sync somewhere between?

kwen2501 · 2019-05-08T15:14:28Z

You are right. I missed that the other stream is created with the cudaStreamNonBlocking flag.

In that case (the two streams are non-blocking to each other), CUDA makes no guarantee about the order of execution of operations issued to those independent streams. Then, even if all the NCCL calls (including both streams) are coded in the same order on all the processes, there still cannot be guarantee that there is no hang.

For example, GPU 0 can first launches AllReduce_a on stream_a, and then finds that there is no more free compute resource on the device to launch AllReduce_b on stream_b; whereas, GPU 1 somehow launches AllReduce_b first, and finds no free resource to launch AllReduce_a. Then there may be a deadlock situation where the two GPUs are waiting for each other to launch the operation they first launch.

You may refer to my comments 1 & 2 in #208.

kingder · 2019-05-08T15:56:15Z

Well, I understand the case you given in the example, but that depends on the AllReduce would occupy the whole GPU resources, which is quite odd to me, how many resources do NCCL kernel need? BTW, what do you mean by saying the following in #208 (comment)

Even if there are only NCCL operation, if they are launched in a loop, SM resource constraint can still occur.

sjeaugey · 2019-05-08T16:04:11Z

The number of blocks NCCL launches depends on the platform and the bandwidth we're trying to achieve. Currently, it is 1-16 blocks of 64-256 threads.

I agree that if there are enough resources on the GPU to have two NCCL operations execute concurrently, one would expect things to work, but the CUDA programming model does not guarantee that it will work, which means it is not supported, might stop working at any moment, or could work on only some GPU types.

kwen2501 · 2019-05-08T16:15:11Z

You can see how many rings NCCL creates by setting NCCL_DEBUG=INFO. Each ring is a block/CTA.

Even if there are only NCCL operations, if they are launched in a loop, SM resource constraint can still occur.

I was just giving an extreme example that it may not take many other types of kernels to occupy a GPU. One can also launch many NCCL operations (on concurrent streams) to achieve that effect.

kingder · 2019-05-09T05:33:50Z

Thank you for the detailed explanation.

kingder closed this as completed May 9, 2019

zpcalan mentioned this issue Mar 30, 2020

Can two AllReduce operations run concurrently using different communicators and different streams? #315

Closed

lidoo233 mentioned this issue Apr 23, 2024

Can two AllReduce operations run in parallel using different non-blocking cudaStreams? #1262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

does NCCL all reduce on two streams block each other? #217

does NCCL all reduce on two streams block each other? #217

kingder commented May 7, 2019

kwen2501 commented May 7, 2019

kwen2501 commented May 7, 2019

sjeaugey commented May 7, 2019

kingder commented May 8, 2019 •

edited

Loading

kwen2501 commented May 8, 2019

kingder commented May 8, 2019 •

edited

Loading

sjeaugey commented May 8, 2019

kwen2501 commented May 8, 2019

kingder commented May 9, 2019

does NCCL all reduce on two streams block each other? #217

does NCCL all reduce on two streams block each other? #217

Comments

kingder commented May 7, 2019

kwen2501 commented May 7, 2019

kwen2501 commented May 7, 2019

sjeaugey commented May 7, 2019

kingder commented May 8, 2019 • edited Loading

kwen2501 commented May 8, 2019

kingder commented May 8, 2019 • edited Loading

sjeaugey commented May 8, 2019

kwen2501 commented May 8, 2019

kingder commented May 9, 2019

kingder commented May 8, 2019 •

edited

Loading

kingder commented May 8, 2019 •

edited

Loading