New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 #770
Comments
NCCL 2.15.5 is faster than 2.16.2 as well:
|
This is with 2x AMD EPYC 7313 CPUs. I've just seen this line in the release commit message 28189e2:
Could that affect us negatively? |
Indeed the rings we create are no longer what they're supposed to be. Could you run with Thanks! |
Thanks for looking into this! Here's the output with |
Ok I can confirm this is due to that change indeed. The solution 2.16 finds is interesting and actually a good one in theory, based on the topology we detect. Just turns out performance is bad in practice. So we need to either find a way to better reflect the performance constraints, or find a trick to nudge the search in the right direction. |
The performance regression is resolved with NCCL v2.16.5-1. Thanks @sjeaugey! |
I've built NCCL 2.16.2-1 from source on Ubuntu 20.04.
nvidia-smi
reportsDriver Version: 470.161.03
andCUDA Version: 11.4
. CUDA toolkit is version 11.2 (/usr/local/cuda-11.2
).On a system with 8x A40 GPUs and 4x NVLink bridges,
reduce_scatter_perf
,all_gather_perf
, orall_reduce_perf
from https://github.com/NVIDIA/nccl-tests are ~10x slower than with NCCL 2.14.3 that I've been using up to now. I've also seen considerable slowdowns in a Horovod training.I've built NCCL and nccl-tests like this:
This is how I've run
reduce_scatter_perf
:Performance was much better with NCCL 2.14.3, built similarly:
Detailed logs with
NCCL_DEBUG=INFO
:Those logs look quite similar to me.
Things that I've tried, but that do not improve the bandwidth:
-x NCCL_P2P_LEVEL=NVL
to disable peer-to-peer communication over PCIenccl-tests
without MPI and run likenccl-tests/build$ LD_LIBRARY_PATH=nccl/v2.16/nccl-build/lib ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 8 -c 0
Could my observation be related to the old version of CUDA that's installed here?
The text was updated successfully, but these errors were encountered: