Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 #770

Closed
maxhgerlach opened this issue Jan 6, 2023 · 6 comments
Closed

Comments

@maxhgerlach
Copy link

maxhgerlach commented Jan 6, 2023

I've built NCCL 2.16.2-1 from source on Ubuntu 20.04. nvidia-smi reports Driver Version: 470.161.03 and CUDA Version: 11.4. CUDA toolkit is version 11.2 (/usr/local/cuda-11.2).

On a system with 8x A40 GPUs and 4x NVLink bridges, reduce_scatter_perf, all_gather_perf, or all_reduce_perf from https://github.com/NVIDIA/nccl-tests are ~10x slower than with NCCL 2.14.3 that I've been using up to now. I've also seen considerable slowdowns in a Horovod training.

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      NV4     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     12-15,44-47     3
GPU1    NV4      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     8-11,40-43      2
GPU2    SYS     SYS      X      NV4     SYS     SYS     SYS     SYS     SYS     SYS     4-7,36-39       1
GPU3    SYS     SYS     NV4      X      SYS     SYS     SYS     SYS     SYS     SYS     0-3,32-35       0
GPU4    SYS     SYS     SYS     SYS      X      NV4     SYS     SYS     SYS     SYS     28-31,60-63     7
GPU5    SYS     SYS     SYS     SYS     NV4      X      SYS     SYS     SYS     SYS     24-27,56-59     6
GPU6    SYS     SYS     SYS     SYS     SYS     SYS      X      NV4     PHB     PHB     20-23,52-55     5
GPU7    SYS     SYS     SYS     SYS     SYS     SYS     NV4      X      SYS     SYS     16-19,48-51     4
mlx5_0  SYS     SYS     SYS     SYS     SYS     SYS     PHB     SYS      X      PIX
mlx5_1  SYS     SYS     SYS     SYS     SYS     SYS     PHB     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

I've built NCCL and nccl-tests like this:

# Tarball obtained from https://github.com/NVIDIA/nccl/archive/refs/tags/v2.16.2-1.tar.gz
nccl/v2.16$ tar xvf nccl-2.16.2-1.tar.gz
nccl/v2.16/nccl-2.16.2-1$ make -j src.build BUILDDIR=/learndata4/maxDev/nccl/v2.16/nccl-build

nccl/v2.16$ git clone https://github.com/NVIDIA/nccl-tests.git
nccl/v2.16/nccl-tests$ make -j BUILDDIR=build_mpi MPI=1 NCCL_HOME=nccl/v2.16/nccl-build/ MPI_HOME=/opt/openmpi  

This is how I've run reduce_scatter_perf:

nccl/v2.16/nccl-tests/build_mpi$ mpirun -x LD_LIBRARY_PATH=nccl/v2.16/nccl-build/lib -x NCCL_DEBUG= -x NCCL_DEBUG_FILE= -H localhost:8 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.20.0.0/16 -np 8 ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 1 -c 0
# Avg bus bandwidth    : 2.40877

Performance was much better with NCCL 2.14.3, built similarly:

nccl/v2.14/nccl-tests/build_mpi$ mpirun -x LD_LIBRARY_PATH=nccl/v2.14/nccl-build/lib -x NCCL_DEBUG=info -x NCCL_DEBUG_FILE= -H localhost:8 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.20.0.0/16 -np 8 ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 1 -c 0
# Avg bus bandwidth    : 26.0137

Detailed logs with NCCL_DEBUG=INFO:

Those logs look quite similar to me.

Things that I've tried, but that do not improve the bandwidth:

  • Set -x NCCL_P2P_LEVEL=NVL to disable peer-to-peer communication over PCIe
  • Build nccl-tests without MPI and run like nccl-tests/build$ LD_LIBRARY_PATH=nccl/v2.16/nccl-build/lib ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 8 -c 0

Could my observation be related to the old version of CUDA that's installed here?

@maxhgerlach
Copy link
Author

maxhgerlach commented Jan 6, 2023

NCCL 2.15.5 is faster than 2.16.2 as well:

nccl/v2.15$ tar xvf nccl-2.15.5-1.tar.gz
nccl/v2.15/nccl-2.15.5-1$ make -j src.build BUILDDIR=nccl/v2.15/nccl-build

nccl/v2.15$ git clone https://github.com/NVIDIA/nccl-tests.git
nccl/v2.15/nccl-tests$ make -j BUILDDIR=build_mpi MPI=1 NCCL_HOME=nccl/v2.15/nccl-build/ MPI_HOME=/opt/openmpi

nccl/v2.15/nccl-tests/build_mpi$ mpirun -x LD_LIBRARY_PATH="nccl/v2.15/nccl-build/lib" -x NCCL_DEBUG= -x NCCL_DEBUG_FILE= -H localhost:8 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.20.0.0/16 -np 8 ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 1 -c 0
# Avg bus bandwidth    : 25.9255

@maxhgerlach maxhgerlach changed the title NCCL v2.16.2 is slower than v2.14.3 by a factor of 10 NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 Jan 6, 2023
@maxhgerlach
Copy link
Author

This is with 2x AMD EPYC 7313 CPUs. I've just seen this line in the release commit message 28189e2:

Reduce inter-socket bandwidth on AMD CPUs to favor better paths.

Could that affect us negatively?

@sjeaugey
Copy link
Member

sjeaugey commented Jan 6, 2023

Indeed the rings we create are no longer what they're supposed to be.

Could you run with NCCL_TOPO_DUMP_FILE=system.txt to dump the node topology so that I can reproduce internally and understand why they changed?

Thanks!

@maxhgerlach
Copy link
Author

Thanks for looking into this!

Here's the output with NCCL_TOPO_DUMP_FILE=system.txt: system.txt [file is identical for v2.15 and v2.16]

@sjeaugey
Copy link
Member

sjeaugey commented Jan 6, 2023

Ok I can confirm this is due to that change indeed. The solution 2.16 finds is interesting and actually a good one in theory, based on the topology we detect. Just turns out performance is bad in practice.

So we need to either find a way to better reflect the performance constraints, or find a trick to nudge the search in the right direction.

@maxhgerlach
Copy link
Author

The performance regression is resolved with NCCL v2.16.5-1. Thanks @sjeaugey!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants