performance degradation of ncclSend&ncclRecv in training #481

zarzen · 2021-03-16T14:37:29Z

Hi,

I am using ncclSend/ncclRecv primitives in the 10Gbps TCP network. I did a microbenchmark for ncclSend/ncclRecv 2MB-4MB sized messages. nccl can easily saturate full bandwidth.

While, if I am using ncclSend/ncclRecv during training (specifically in the backward phase), sending&receiving message size around 4MB could only saturate 8Gbps.

What could be the reason for such performance degradation?
One reason I can think of is the backward computation also uses SMs, at the same time nccl needs some SMs for data movement.
If so, will using the cudaMemcpyAsync function to do data movement with cuda memcpy engine and pipeline it with TCP could perform better? (Because it does not need SM)

(Besides, in tcp network, I also notice the ncclSend/ncclRecv performance is sensitive to CPU usage. If my program has several busy while loop for events/condition check, the bandwidth saturation could be 10%-15% worse, though the CPU is not all cores are saturated. I cannot reason this phenomenon either, and not sure whether it is relavant.)

The text was updated successfully, but these errors were encountered:

sjeaugey · 2021-03-16T17:56:32Z

How did you measure 8Gbps? Did you time the NCCL operation and found it was slower? Or did you monitor the NIC bandwidth?

Monitoring the NIC bandwidth can be misleading, since the NCCL operation might run at 10Gbps, but there could be phases when nothing is happening on the network (e.g. during the forward phase), causing the average to appear lower.

Timing NCCL calls can also be tricky and you'd need to make sure the sender/receiver are well synchronized (which is the case in microbenchmarks, but not necessarily in real applications), so you'd need to measure the time between the last rank entering NCCL and when ranks exit.

zarzen · 2021-03-16T19:52:44Z

Hi @sjeaugey
I monitored the time of ncclSend&ncclRecv operations. for each group of ncclSend&ncclRecv I have a synchronization to make sure the timing of nccl operation is correct. I have also monitored the NIC bandwidth, which indicates a similar result.

I timed the nccl calls on each node, even in this case nccl require all node to exit an operation as an end point?

(If so, should pair-wise nccl-communicator for send&recv ops help? I mean, a pair of ranks who want to communicate have their own nccl-communicator, so that they don't need to wait for other nodes to exit a nccl operation)

sjeaugey · 2021-03-16T20:10:18Z

Creating pair-wise communicators would prevent you from grouping operations together, while using the same communicator gives you the ability to group them or not, depending on your needs.

So you can put the different send/recv operations in different groups, or not in groups at all, but then there will be no concurrency and you need to be careful not to create deadlocks due to interdependency of operations.

zarzen · 2021-03-16T20:18:06Z

I am thinking to put send with comm1 on stream1, and put recv with comm2 on stream2, so that they can be executed in parallel.

And once the peers I am sending-to and receiving-from have the matching communicators, it should be fine?
At least, it seems to address the misalignment issue when all ranks involved in the peer-to-peer operation.

sjeaugey · 2021-03-16T20:28:15Z

This may work once you try it, but it would not be safe from deadlocks. CUDA provides no guarantee that comm1 will not block comm2 or vice-versa. That's why if you need both operations to progress in parallel, you need to group them into a single call.

zarzen · 2021-03-16T20:35:14Z

Do you mean CUDA launches both send ops on two ranks, but as rank0-send-comm1 and rank1-send-comm2, none of them could finish?
If so, Why would CUDA not able to launch rank0-recv-comm2 and rank1-recv-comm1 at the same when launching send?

zarzen · 2021-03-16T21:14:45Z

Thanks for your explanations, I found several comments you made at several places here #239 (comment) and here #195 (comment)

But I didn't get the reason for CUDA calls would block nccl, causing hangs.
may I ask why it would block? so that I could stop asking/getting dumb questions/ideas like this. Thanks
Is it because of some low-level ptx-asm code?

zarzen · 2021-03-16T22:08:05Z

Found new explanations #231 (comment) for deadlock

sjeaugey · 2021-03-16T23:56:55Z

CUDA does not guarantee two asynchronous streams will not block each other. So if one one GPU stream1 is executed before stream2 and on the other GPU it is the opposite, and for both GPUs the streams block each other, then we end up in a deadlock: stream1 send on GPU1 will be executed, stream2 send on GPU2 will be executed as well, but neither receive operation will ever start. Does that make sense?

zarzen · 2021-03-17T13:47:55Z

CUDA does not guarantee two asynchronous streams will not block each other. So if one one GPU stream1 is executed before stream2 and on the other GPU it is the opposite, and for both GPUs the streams block each other, then we end up in a deadlock: stream1 send on GPU1 will be executed, stream2 send on GPU2 will be executed as well, but neither receive operation will ever start. Does that make sense?

yes, make sense. But how could stream1 and stream2 block each other with ncclKernels?
One case I can think of is when there are no free SMs for launching receiving on stream2. But nccl only uses 1-2 SMs, with GPU like V100 has 80SMs, once other programs exit, there should be some room for launching the task on stream2.

Could you give an example to show stream1 and stream2 could block each other with only ncclSend&ncclRecv kernels and CUDA default APIs? Thanks!

sjeaugey · 2021-03-17T15:39:16Z

There could be other parts in the GPU where stream1 and stream2 could use the same resources and therefore create a dependency, so it's not only about SM usage. And for performance reasons, the CUDA driver will try to avoid that case, which is why it will work ok most of the time.

zarzen · 2021-03-17T16:15:42Z

Real thanks for the further explanation!

Timing NCCL calls can also be tricky and you'd need to make sure the sender/receiver are well synchronized (which is the case in microbenchmarks, but not necessarily in real applications), so you'd need to measure the time between the last rank entering NCCL and when ranks exit.

Back to the performance issue of ncclSend/ncclRecv, you said I should timing the last rank entering NCCL, is there a way to do so in NCCL? (different ranks have different clock time which is bad for aligning the timeline based on system clock time)

sjeaugey · 2021-03-17T17:13:08Z

is there a way to do so in NCCL

Not in NCCL. That needs to be done by the profiling tool.

zarzen · 2021-03-17T18:17:36Z

Not in NCCL. That needs to be done by the profiling tool.

What specific profiling tools are you referring to? when I use nsight-system for profiling, it provides timeline information about of each rank, can those timelines be properly aligned? Thanks

zarzen changed the title ~~What could be the reason for performance degradation of ncclSend&ncclRecv in training?~~ performance degradation of ncclSend&ncclRecv in training Mar 16, 2021

zarzen closed this as completed Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance degradation of ncclSend&ncclRecv in training #481

performance degradation of ncclSend&ncclRecv in training #481

zarzen commented Mar 16, 2021 •

edited

Loading

sjeaugey commented Mar 16, 2021 •

edited

Loading

zarzen commented Mar 16, 2021 •

edited

Loading

sjeaugey commented Mar 16, 2021 •

edited

Loading

zarzen commented Mar 16, 2021

sjeaugey commented Mar 16, 2021

zarzen commented Mar 16, 2021

zarzen commented Mar 16, 2021

zarzen commented Mar 16, 2021

sjeaugey commented Mar 16, 2021

zarzen commented Mar 17, 2021

sjeaugey commented Mar 17, 2021

zarzen commented Mar 17, 2021

sjeaugey commented Mar 17, 2021

zarzen commented Mar 17, 2021

performance degradation of ncclSend&ncclRecv in training #481

performance degradation of ncclSend&ncclRecv in training #481

Comments

zarzen commented Mar 16, 2021 • edited Loading

sjeaugey commented Mar 16, 2021 • edited Loading

zarzen commented Mar 16, 2021 • edited Loading

sjeaugey commented Mar 16, 2021 • edited Loading

zarzen commented Mar 16, 2021

sjeaugey commented Mar 16, 2021

zarzen commented Mar 16, 2021

zarzen commented Mar 16, 2021

zarzen commented Mar 16, 2021

sjeaugey commented Mar 16, 2021

zarzen commented Mar 17, 2021

sjeaugey commented Mar 17, 2021

zarzen commented Mar 17, 2021

sjeaugey commented Mar 17, 2021

zarzen commented Mar 17, 2021

zarzen commented Mar 16, 2021 •

edited

Loading

sjeaugey commented Mar 16, 2021 •

edited

Loading

zarzen commented Mar 16, 2021 •

edited

Loading

sjeaugey commented Mar 16, 2021 •

edited

Loading