Skip to content

Azure/msccl

Repository files navigation

MSCCL

Microsoft Collective Communication Library (MSCCL) is a platform to execute custom collective communication algorithms on heterogenous accelerators supported by Microsoft Azure. MSCCL currently supports NVIDIA and AMD GPUs. The research prototype of this project is microsoft/msccl.

Introduction

MSCCL vision is to provide a unified, efficient, and scalable framework for executing collective communication algorithms on heterogenous accelerators. To achieve this, MSCCL has multiple components:

  • MSCCL toolkit: Inter-connection among accelerators have different latencies and bandwidths. Therefore, a generic collective communication algorithm does not necessarily well for all topologies and buffer sizes. In order to provide the flexibility, we provide the MSCCL toolkit, which allows a user to write a hyper-optimized collective communication algorithm for a given topology and a buffer size. MSCCL toolkit contains a high-level DSL (MSCCLang) and a compiler which generate an IR for the MSCCL executor to run on the backend. Example provides some instances on how MSCCL toolkit with the runtime works. Please refer to MSCCL toolkit for more information.

  • MSCCL scheduler: MSCCL scheduler provides an example design and implementation of how to select optimal MSCCL algorithms for MSCCL executors.

  • MSCCL executor: MSCCL executor is a set of libraries that are responsible for running custom-written collective communication algorithms on heterogenous accelerators. Each kind of accelerator has a corresponding executor library that is specifically optimized it. Different executor libraries share the same interface to run MSCCL algorithm IR from MSCCL toolkit and talk with MSCCL scheduler. For NVIDIA GPUs, it's msccl-executor-nccl which is built on top of NCCL. For AMD GPUs, it's RCCL which already integrated all MSCCL executor features.

  • MSCCL test toolkit(msccl-tests-nccl): These tests check both the performance and the correctness of MSCCL operations.

Performance

For reference, FP16 All-Reduce and All-Gather algorithms were tested and compared on ND H100 v5 VM, using msccl-tests-nccl.

FP16 All-Reduce Latency (us) All-Gather Latency (us)
Message Size NCCL MSCCL MSCCL Speedup Message Size NCCL MSCCL MSCCL Speedup
1KB 13.12 7.50 1.80x 1KB 9.54 5.65 1.69x
2KB 14.39 7.48 1.92x 2KB 9.8 5.7 1.72x
4KB 15.28 7.49 2.04x 4KB 9.78 5.43 1.80x
8KB 15.69 7.67 2.04x 8KB 9.78 5.47 1.81x
16KB 16.64 8.03 2.07x 16KB 10.29 5.53 1.86x
32KB 19.3 9.08 2.13x 32KB 12.49 5.75 2.17x
64KB 20 10.36 1.93x 64KB 12.87 5.95 2.16x
128KB 20.42 11.06 1.85x 128KB 13.16 6.38 2.06x
256KB 20.5 12.86 1.60x 256KB 13.23 7.26 1.82x
512KB 29.89 19.14 1.56x 512KB 13.39 8.71 1.54x
1MB 31.94 22.31 1.43x 1MB 18.33 12.3 1.49x
2MB 37.95 33.43 1.14x 2MB 23.18 17.75 1.31x
4MB 49.28 43.97 1.12x 4MB 33.66 23.37 1.44x
8MB 77.01 68.16 1.13x 8MB 44.7 38.54 1.16x
16MB 116 115.7 1.00x 16MB 67.19 67.16 1.00x
32MB 187.2 186.5 1.00x 32MB 104.7 98.4 1.06x
64MB 317.4 315.7 1.01x 64MB 192.4 181.9 1.06x
128MB 572.5 570.4 1.00x 128MB 368.3 348.4 1.06x
256MB 1079 1075.6 1.00x 256MB 699.5 680.7 1.03x
512MB 2071.1 2067.9 1.00x 512MB 1358.6 1339.3 1.01x
1GB 4028.7 4026.8 1.00x 1GB 2663.8 2633 1.01x

Example

In order to use MSCCL, you may follow these steps to use two different MSCCL algorithms for AllReduce on Azure NDv4 which has 8xA100 GPUs:

1. Download the source code of msccl and related submodules

$ git clone https://github.com/Azure/msccl.git --recurse-submodules

2. Below is the steps to install MSCCL executor:

$ git clone https://github.com/Azure/msccl.git --recurse-submodules
$ cd msccl/executor/msccl-executor-nccl
$ make -j src.build
$ cd ../
$ cd ../

3. Below is the steps to install msccl-tests-nccl for performance evaluation:

$ cd tests/msccl-tests-nccl/
$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=$HOME/msccl/executor/msccl-executor-nccl/build/ -j
$ cd ../
$ cd ../

4. Apply the msccl algo when using msccl external scheduler

  • for ndv4, we already have algo optimized, you can use msccl scheduler to apply this algo directly to the executor, below is the steps to apply the scheduler
$ sudo apt-get install libcurl4-openssl-dev nlohmann-json3-dev
$ cd scheduler/msccl-scheduler

for nccl:
$ CXX=/path/to/nvcc BIN_HOME=/path/to/nccl/binary SRC_HOME=/path/to/nccl/source make
for rccl:
$ CXX=/path/to/nvcc BIN_HOME=/path/to/nccl/binary SRC_HOME=/path/to/nccl/source make PLATFORM=RCCL

$ make install 
  • for customize the msccl algo for your system, you can install MSCCL toolkit to compile a few custom algorithms:
$ git clone https://github.com/microsoft/msccl-tools.git
$ cd msccl-tools/
$ pip install .
$ cd ../
$ python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
$ cd ../

The compiler's generated code is an XML file (test.xml) that is fed to MSCCL runtime. To evaluate its performance, copy the test.xml to the msccl/exector/msccl-executor-nccl/build/lib/msccl-algorithms/ and execute the following command line on an Azure NDv4 node or any 8xA100 system:

5. Below is the command to run test using msccl-executor-nccl

$ mpirun -np 8 -x LD_LIBRARY_PATH=msccl/exector/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0

6. If everything is installed correctly, you should see the following output in log:

[0] NCCL INFO Connected 1 MSCCL algorithms

You may evaluate the performance of test.xml by comparing in-place (the new algorithm) vs out-of-place (default ring algorithm) and it should up-to 2-3x faster on 8xA100 NVLink-interconnected GPUs. MSCCL toolkit has a rich set of algorithms for different Azure SKUs and collective operations with significant speedups over vanilla NCCL.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit CLA.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.