NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 #770

maxhgerlach · 2023-01-06T15:10:30Z

I've built NCCL 2.16.2-1 from source on Ubuntu 20.04. nvidia-smi reports Driver Version: 470.161.03 and CUDA Version: 11.4. CUDA toolkit is version 11.2 (/usr/local/cuda-11.2).

On a system with 8x A40 GPUs and 4x NVLink bridges, reduce_scatter_perf, all_gather_perf, or all_reduce_perf from https://github.com/NVIDIA/nccl-tests are ~10x slower than with NCCL 2.14.3 that I've been using up to now. I've also seen considerable slowdowns in a Horovod training.

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      NV4     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     12-15,44-47     3
GPU1    NV4      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     8-11,40-43      2
GPU2    SYS     SYS      X      NV4     SYS     SYS     SYS     SYS     SYS     SYS     4-7,36-39       1
GPU3    SYS     SYS     NV4      X      SYS     SYS     SYS     SYS     SYS     SYS     0-3,32-35       0
GPU4    SYS     SYS     SYS     SYS      X      NV4     SYS     SYS     SYS     SYS     28-31,60-63     7
GPU5    SYS     SYS     SYS     SYS     NV4      X      SYS     SYS     SYS     SYS     24-27,56-59     6
GPU6    SYS     SYS     SYS     SYS     SYS     SYS      X      NV4     PHB     PHB     20-23,52-55     5
GPU7    SYS     SYS     SYS     SYS     SYS     SYS     NV4      X      SYS     SYS     16-19,48-51     4
mlx5_0  SYS     SYS     SYS     SYS     SYS     SYS     PHB     SYS      X      PIX
mlx5_1  SYS     SYS     SYS     SYS     SYS     SYS     PHB     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

I've built NCCL and nccl-tests like this:

# Tarball obtained from https://github.com/NVIDIA/nccl/archive/refs/tags/v2.16.2-1.tar.gz
nccl/v2.16$ tar xvf nccl-2.16.2-1.tar.gz
nccl/v2.16/nccl-2.16.2-1$ make -j src.build BUILDDIR=/learndata4/maxDev/nccl/v2.16/nccl-build

nccl/v2.16$ git clone https://github.com/NVIDIA/nccl-tests.git
nccl/v2.16/nccl-tests$ make -j BUILDDIR=build_mpi MPI=1 NCCL_HOME=nccl/v2.16/nccl-build/ MPI_HOME=/opt/openmpi

This is how I've run reduce_scatter_perf:

nccl/v2.16/nccl-tests/build_mpi$ mpirun -x LD_LIBRARY_PATH=nccl/v2.16/nccl-build/lib -x NCCL_DEBUG= -x NCCL_DEBUG_FILE= -H localhost:8 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.20.0.0/16 -np 8 ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 1 -c 0
# Avg bus bandwidth    : 2.40877

Performance was much better with NCCL 2.14.3, built similarly:

nccl/v2.14/nccl-tests/build_mpi$ mpirun -x LD_LIBRARY_PATH=nccl/v2.14/nccl-build/lib -x NCCL_DEBUG=info -x NCCL_DEBUG_FILE= -H localhost:8 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.20.0.0/16 -np 8 ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 1 -c 0
# Avg bus bandwidth    : 26.0137

Detailed logs with NCCL_DEBUG=INFO:

Those logs look quite similar to me.

Things that I've tried, but that do not improve the bandwidth:

Set -x NCCL_P2P_LEVEL=NVL to disable peer-to-peer communication over PCIe
Build nccl-tests without MPI and run like nccl-tests/build$ LD_LIBRARY_PATH=nccl/v2.16/nccl-build/lib ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 8 -c 0

Could my observation be related to the old version of CUDA that's installed here?

The text was updated successfully, but these errors were encountered:

maxhgerlach · 2023-01-06T15:54:29Z

NCCL 2.15.5 is faster than 2.16.2 as well:

nccl/v2.15$ tar xvf nccl-2.15.5-1.tar.gz
nccl/v2.15/nccl-2.15.5-1$ make -j src.build BUILDDIR=nccl/v2.15/nccl-build

nccl/v2.15$ git clone https://github.com/NVIDIA/nccl-tests.git
nccl/v2.15/nccl-tests$ make -j BUILDDIR=build_mpi MPI=1 NCCL_HOME=nccl/v2.15/nccl-build/ MPI_HOME=/opt/openmpi

nccl/v2.15/nccl-tests/build_mpi$ mpirun -x LD_LIBRARY_PATH="nccl/v2.15/nccl-build/lib" -x NCCL_DEBUG= -x NCCL_DEBUG_FILE= -H localhost:8 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include 10.20.0.0/16 -np 8 ./reduce_scatter_perf -b 1417M -e 1417M -f 1 -g 1 -c 0
# Avg bus bandwidth    : 25.9255

maxhgerlach · 2023-01-06T16:07:58Z

This is with 2x AMD EPYC 7313 CPUs. I've just seen this line in the release commit message 28189e2:

Reduce inter-socket bandwidth on AMD CPUs to favor better paths.

Could that affect us negatively?

sjeaugey · 2023-01-06T16:23:24Z

Indeed the rings we create are no longer what they're supposed to be.

Could you run with NCCL_TOPO_DUMP_FILE=system.txt to dump the node topology so that I can reproduce internally and understand why they changed?

Thanks!

maxhgerlach · 2023-01-06T16:47:29Z

Thanks for looking into this!

Here's the output with NCCL_TOPO_DUMP_FILE=system.txt: system.txt [file is identical for v2.15 and v2.16]

sjeaugey · 2023-01-06T17:04:57Z

Ok I can confirm this is due to that change indeed. The solution 2.16 finds is interesting and actually a good one in theory, based on the topology we detect. Just turns out performance is bad in practice.

So we need to either find a way to better reflect the performance constraints, or find a trick to nudge the search in the right direction.

maxhgerlach · 2023-02-22T15:14:32Z

The performance regression is resolved with NCCL v2.16.5-1. Thanks @sjeaugey!

maxhgerlach changed the title ~~NCCL v2.16.2 is slower than v2.14.3 by a factor of 10~~ NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 Jan 6, 2023

maxhgerlach closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 #770

NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 #770

maxhgerlach commented Jan 6, 2023 •

edited

maxhgerlach commented Jan 6, 2023 •

edited

maxhgerlach commented Jan 6, 2023

sjeaugey commented Jan 6, 2023

maxhgerlach commented Jan 6, 2023

sjeaugey commented Jan 6, 2023

maxhgerlach commented Feb 22, 2023

NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 #770

NCCL v2.16.2 is slower than v2.14.3 or v2.15.5 by a factor of 10 #770

Comments

maxhgerlach commented Jan 6, 2023 • edited

maxhgerlach commented Jan 6, 2023 • edited

maxhgerlach commented Jan 6, 2023

sjeaugey commented Jan 6, 2023

maxhgerlach commented Jan 6, 2023

sjeaugey commented Jan 6, 2023

maxhgerlach commented Feb 22, 2023

maxhgerlach commented Jan 6, 2023 •

edited

maxhgerlach commented Jan 6, 2023 •

edited