Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance degradation of ncclSend&ncclRecv in training #481

Closed
zarzen opened this issue Mar 16, 2021 · 14 comments
Closed

performance degradation of ncclSend&ncclRecv in training #481

zarzen opened this issue Mar 16, 2021 · 14 comments

Comments

@zarzen
Copy link

zarzen commented Mar 16, 2021

Hi,

I am using ncclSend/ncclRecv primitives in the 10Gbps TCP network. I did a microbenchmark for ncclSend/ncclRecv 2MB-4MB sized messages. nccl can easily saturate full bandwidth.

While, if I am using ncclSend/ncclRecv during training (specifically in the backward phase), sending&receiving message size around 4MB could only saturate 8Gbps.

What could be the reason for such performance degradation?
One reason I can think of is the backward computation also uses SMs, at the same time nccl needs some SMs for data movement.
If so, will using the cudaMemcpyAsync function to do data movement with cuda memcpy engine and pipeline it with TCP could perform better? (Because it does not need SM)

(Besides, in tcp network, I also notice the ncclSend/ncclRecv performance is sensitive to CPU usage. If my program has several busy while loop for events/condition check, the bandwidth saturation could be 10%-15% worse, though the CPU is not all cores are saturated. I cannot reason this phenomenon either, and not sure whether it is relavant.)

@zarzen zarzen changed the title What could be the reason for performance degradation of ncclSend&ncclRecv in training? performance degradation of ncclSend&ncclRecv in training Mar 16, 2021
@sjeaugey
Copy link
Member

sjeaugey commented Mar 16, 2021

How did you measure 8Gbps? Did you time the NCCL operation and found it was slower? Or did you monitor the NIC bandwidth?

Monitoring the NIC bandwidth can be misleading, since the NCCL operation might run at 10Gbps, but there could be phases when nothing is happening on the network (e.g. during the forward phase), causing the average to appear lower.

Timing NCCL calls can also be tricky and you'd need to make sure the sender/receiver are well synchronized (which is the case in microbenchmarks, but not necessarily in real applications), so you'd need to measure the time between the last rank entering NCCL and when ranks exit.

@zarzen
Copy link
Author

zarzen commented Mar 16, 2021

Hi @sjeaugey
I monitored the time of ncclSend&ncclRecv operations. for each group of ncclSend&ncclRecv I have a synchronization to make sure the timing of nccl operation is correct. I have also monitored the NIC bandwidth, which indicates a similar result.

I timed the nccl calls on each node, even in this case nccl require all node to exit an operation as an end point?

(If so, should pair-wise nccl-communicator for send&recv ops help? I mean, a pair of ranks who want to communicate have their own nccl-communicator, so that they don't need to wait for other nodes to exit a nccl operation)

@sjeaugey
Copy link
Member

sjeaugey commented Mar 16, 2021

Creating pair-wise communicators would prevent you from grouping operations together, while using the same communicator gives you the ability to group them or not, depending on your needs.

So you can put the different send/recv operations in different groups, or not in groups at all, but then there will be no concurrency and you need to be careful not to create deadlocks due to interdependency of operations.

@zarzen
Copy link
Author

zarzen commented Mar 16, 2021

I am thinking to put send with comm1 on stream1, and put recv with comm2 on stream2, so that they can be executed in parallel.

And once the peers I am sending-to and receiving-from have the matching communicators, it should be fine?
At least, it seems to address the misalignment issue when all ranks involved in the peer-to-peer operation.

@sjeaugey
Copy link
Member

This may work once you try it, but it would not be safe from deadlocks. CUDA provides no guarantee that comm1 will not block comm2 or vice-versa. That's why if you need both operations to progress in parallel, you need to group them into a single call.

@zarzen
Copy link
Author

zarzen commented Mar 16, 2021

Do you mean CUDA launches both send ops on two ranks, but as rank0-send-comm1 and rank1-send-comm2, none of them could finish?
If so, Why would CUDA not able to launch rank0-recv-comm2 and rank1-recv-comm1 at the same when launching send?

@zarzen
Copy link
Author

zarzen commented Mar 16, 2021

Thanks for your explanations, I found several comments you made at several places here #239 (comment) and here #195 (comment)

But I didn't get the reason for CUDA calls would block nccl, causing hangs.
may I ask why it would block? so that I could stop asking/getting dumb questions/ideas like this. Thanks
Is it because of some low-level ptx-asm code?

@zarzen
Copy link
Author

zarzen commented Mar 16, 2021

Found new explanations #231 (comment) for deadlock

@sjeaugey
Copy link
Member

CUDA does not guarantee two asynchronous streams will not block each other. So if one one GPU stream1 is executed before stream2 and on the other GPU it is the opposite, and for both GPUs the streams block each other, then we end up in a deadlock: stream1 send on GPU1 will be executed, stream2 send on GPU2 will be executed as well, but neither receive operation will ever start. Does that make sense?

@zarzen
Copy link
Author

zarzen commented Mar 17, 2021

CUDA does not guarantee two asynchronous streams will not block each other. So if one one GPU stream1 is executed before stream2 and on the other GPU it is the opposite, and for both GPUs the streams block each other, then we end up in a deadlock: stream1 send on GPU1 will be executed, stream2 send on GPU2 will be executed as well, but neither receive operation will ever start. Does that make sense?

yes, make sense. But how could stream1 and stream2 block each other with ncclKernels?
One case I can think of is when there are no free SMs for launching receiving on stream2. But nccl only uses 1-2 SMs, with GPU like V100 has 80SMs, once other programs exit, there should be some room for launching the task on stream2.

Could you give an example to show stream1 and stream2 could block each other with only ncclSend&ncclRecv kernels and CUDA default APIs? Thanks!

@sjeaugey
Copy link
Member

There could be other parts in the GPU where stream1 and stream2 could use the same resources and therefore create a dependency, so it's not only about SM usage. And for performance reasons, the CUDA driver will try to avoid that case, which is why it will work ok most of the time.

@zarzen
Copy link
Author

zarzen commented Mar 17, 2021

Real thanks for the further explanation!

Timing NCCL calls can also be tricky and you'd need to make sure the sender/receiver are well synchronized (which is the case in microbenchmarks, but not necessarily in real applications), so you'd need to measure the time between the last rank entering NCCL and when ranks exit.

Back to the performance issue of ncclSend/ncclRecv, you said I should timing the last rank entering NCCL, is there a way to do so in NCCL? (different ranks have different clock time which is bad for aligning the timeline based on system clock time)

@sjeaugey
Copy link
Member

is there a way to do so in NCCL

Not in NCCL. That needs to be done by the profiling tool.

@zarzen
Copy link
Author

zarzen commented Mar 17, 2021

Not in NCCL. That needs to be done by the profiling tool.

What specific profiling tools are you referring to? when I use nsight-system for profiling, it provides timeline information about of each rank, can those timelines be properly aligned? Thanks

@zarzen zarzen closed this as completed Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants