-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance degradation of ncclSend&ncclRecv in training #481
Comments
How did you measure 8Gbps? Did you time the NCCL operation and found it was slower? Or did you monitor the NIC bandwidth? Monitoring the NIC bandwidth can be misleading, since the NCCL operation might run at 10Gbps, but there could be phases when nothing is happening on the network (e.g. during the forward phase), causing the average to appear lower. Timing NCCL calls can also be tricky and you'd need to make sure the sender/receiver are well synchronized (which is the case in microbenchmarks, but not necessarily in real applications), so you'd need to measure the time between the last rank entering NCCL and when ranks exit. |
Hi @sjeaugey I timed the nccl calls on each node, even in this case nccl require all node to exit an operation as an end point? (If so, should pair-wise nccl-communicator for send&recv ops help? I mean, a pair of ranks who want to communicate have their own nccl-communicator, so that they don't need to wait for other nodes to exit a nccl operation) |
Creating pair-wise communicators would prevent you from grouping operations together, while using the same communicator gives you the ability to group them or not, depending on your needs. So you can put the different send/recv operations in different groups, or not in groups at all, but then there will be no concurrency and you need to be careful not to create deadlocks due to interdependency of operations. |
I am thinking to put And once the peers I am sending-to and receiving-from have the matching communicators, it should be fine? |
This may work once you try it, but it would not be safe from deadlocks. CUDA provides no guarantee that comm1 will not block comm2 or vice-versa. That's why if you need both operations to progress in parallel, you need to group them into a single call. |
Do you mean CUDA launches both |
Thanks for your explanations, I found several comments you made at several places here #239 (comment) and here #195 (comment) But I didn't get the reason for CUDA calls would block nccl, causing hangs. |
Found new explanations #231 (comment) for deadlock |
CUDA does not guarantee two asynchronous streams will not block each other. So if one one GPU stream1 is executed before stream2 and on the other GPU it is the opposite, and for both GPUs the streams block each other, then we end up in a deadlock: stream1 send on GPU1 will be executed, stream2 send on GPU2 will be executed as well, but neither receive operation will ever start. Does that make sense? |
yes, make sense. But how could stream1 and stream2 block each other with ncclKernels? Could you give an example to show stream1 and stream2 could block each other with only ncclSend&ncclRecv kernels and CUDA default APIs? Thanks! |
There could be other parts in the GPU where stream1 and stream2 could use the same resources and therefore create a dependency, so it's not only about SM usage. And for performance reasons, the CUDA driver will try to avoid that case, which is why it will work ok most of the time. |
Real thanks for the further explanation!
Back to the performance issue of ncclSend/ncclRecv, you said I should timing the last rank entering NCCL, is there a way to do so in NCCL? (different ranks have different clock time which is bad for aligning the timeline based on system clock time) |
Not in NCCL. That needs to be done by the profiling tool. |
What specific profiling tools are you referring to? when I use nsight-system for profiling, it provides timeline information about of each rank, can those timelines be properly aligned? Thanks |
Hi,
I am using
ncclSend/ncclRecv
primitives in the 10Gbps TCP network. I did a microbenchmark forncclSend/ncclRecv
2MB-4MB sized messages. nccl can easily saturate full bandwidth.While, if I am using
ncclSend/ncclRecv
during training (specifically in the backward phase), sending&receiving message size around 4MB could only saturate 8Gbps.What could be the reason for such performance degradation?
One reason I can think of is the backward computation also uses SMs, at the same time
nccl
needs some SMs for data movement.If so, will using the
cudaMemcpyAsync
function to do data movement with cuda memcpy engine and pipeline it with TCP could perform better? (Because it does not need SM)(Besides, in tcp network, I also notice the ncclSend/ncclRecv performance is sensitive to CPU usage. If my program has several busy while loop for events/condition check, the bandwidth saturation could be 10%-15% worse, though the CPU is not all cores are saturated. I cannot reason this phenomenon either, and not sure whether it is relavant.)
The text was updated successfully, but these errors were encountered: