-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can two AllReduce operations run concurrently using different communicators and different streams? #315
Comments
Do cudaEventRecord or cudaStreamWaitEvent have something to do with this behavior?Not running concurrently? |
Hi, you are getting 2*T likely because the two AllReduces are sharing (competing for) the same bandwidth. 128 MB is quite large a message size that each AllReduce could have consumed the full bandwidth if they run alone. |
Thanks for your quick reply! @kwen2501 Well I just pick this number 128MB for testing because the size of gradients of some simple networks like ResNet50 is nearly 100MB if i'm not mistaken.
And here's my GPU topology:
|
Actually when I use two AllReduce the time cost is down below: After I delete one AllReduce, the time cost is: |
You have a DGX1-like system so all the communication intra-node would be via NVLinks. However, even with the large bandwidth of NVLink, 128MB is still big enough to achieve almost the peak. I don't see how you record the start time in the code. Do you have a barrier before it (to sync all the processes)? |
I record start and end time right before and after |
You need to record the start time before calling
|
Thanks for the tip! Now I move the code of recording time before the first AllReduce and add MPI_Barrier() before it. The time costs do look more accurate. I think about your explanation about the badwidth so I reduce the buffer size to 40MB, but the time is still near 2T ms. Before that I notice you mentioned that all 8 GPUs comminicate via NVLinks, which can reach up to 62GB/s according to this benchmark. But I find out the speed is not that fast according to my test. Does this mean some GPUs are not using NVLink to communicate? Like GPU0 and GPU7 via SYS as shown in the topology? I'm new to the hardware so I'm confused. It would help a lot if you explain.(^__^) |
The bandwidth reported on that web page is what we call Bus Bandwidth. You can find a difference between the definitations of Bus Bandwidth and Algorithm Bandwidth here. Simply put, when you are doing an AllReduce with a large enough number of ranks, BusBw is about 2x AlgoBw. It looks like you are using 8x V100 GPUs, in that case, you should see a peak BusBw of ~130 GB/s. Of course, that's achieved when the message size is large enough (e.g. >= 128 MB). If you suspect your system has a performance issue, I would recommend running the NCCL perf tests. You may also turn on |
Issue closed. |
Hi developers of NCCL, lately I've been working with ncclAllReduce operation.
I found out two AllReduce operations can't run concurrently even though I pass different communicators and different streams.
I've read the similar issues like #217 #195 . But none of them is exactly the same as my senario.
So here's my code(I did a little modfication of your example on your website):
This is how I run the executable file:
mpirun -n 8 ./file
. I run 8 same processes which communicate with each other.Each AllReduce operation processes 128MB buffer. I assume each AllReduce will cost T milliseconds. I ensure the execution order of my cuda kernels and NCCL operations in each process is identical by
cudaEventCreate
andcudaStreamWaitEvent
.So I think the output of
print(end_time - start_time);
should be quite close to T ms since my cuda kernels costs a little time and two AllReduce is running concurrently. But the result turned out to be 2*T ms.So here's my question: Can two AllReduce operations run concurrently using different communicators and different streams in the same process? I notice lots of comments of @sjeaugey very helpful especially this one #239 (comment). But I still can't figure out why these two AllReduce operations can't run concurrently while the excution order is fixed.
Please correct me if my code has mistakes or my thinking is wrong. Thank you very much.
My NCCL+cuda version is NCCL2.4.8+cuda_10.1
operating system: ubuntu-16.04
GPU: Tesla V100-SXM2
OpenMPI version: 3.1.5
Best regard,
zpcalan
The text was updated successfully, but these errors were encountered: