MSCCL

Microsoft Collective Communication Library (MSCCL) is a platform to execute custom collective communication algorithms on heterogenous accelerators supported by Microsoft Azure. MSCCL currently supports NVIDIA and AMD GPUs. The research prototype of this project is microsoft/msccl.

Introduction

MSCCL vision is to provide a unified, efficient, and scalable framework for executing collective communication algorithms on heterogenous accelerators. To achieve this, MSCCL has multiple components:

MSCCL toolkit: Inter-connection among accelerators have different latencies and bandwidths. Therefore, a generic collective communication algorithm does not necessarily well for all topologies and buffer sizes. In order to provide the flexibility, we provide the MSCCL toolkit, which allows a user to write a hyper-optimized collective communication algorithm for a given topology and a buffer size. MSCCL toolkit contains a high-level DSL (MSCCLang) and a compiler which generate an IR for the MSCCL executor to run on the backend. Example provides some instances on how MSCCL toolkit with the runtime works. Please refer to MSCCL toolkit for more information.
MSCCL scheduler: MSCCL scheduler provides an example design and implementation of how to select optimal MSCCL algorithms for MSCCL executors.
MSCCL executor: MSCCL executor is a set of libraries that are responsible for running custom-written collective communication algorithms on heterogenous accelerators. Each kind of accelerator has a corresponding executor library that is specifically optimized it. Different executor libraries share the same interface to run MSCCL algorithm IR from MSCCL toolkit and talk with MSCCL scheduler. For NVIDIA GPUs, it's msccl-executor-nccl which is built on top of NCCL. For AMD GPUs, it's RCCL which already integrated all MSCCL executor features.
MSCCL test toolkit(msccl-tests-nccl): These tests check both the performance and the correctness of MSCCL operations.

Performance

For reference, FP16 All-Reduce and All-Gather algorithms were tested and compared on ND H100 v5 VM, using msccl-tests-nccl.

FP16 All-Reduce Latency (us)				All-Gather Latency (us)
Message Size	NCCL	MSCCL	MSCCL Speedup	Message Size	NCCL	MSCCL	MSCCL Speedup
1KB	13.12	7.50	1.80x	1KB	9.54	5.65	1.69x
2KB	14.39	7.48	1.92x	2KB	9.8	5.7	1.72x
4KB	15.28	7.49	2.04x	4KB	9.78	5.43	1.80x
8KB	15.69	7.67	2.04x	8KB	9.78	5.47	1.81x
16KB	16.64	8.03	2.07x	16KB	10.29	5.53	1.86x
32KB	19.3	9.08	2.13x	32KB	12.49	5.75	2.17x
64KB	20	10.36	1.93x	64KB	12.87	5.95	2.16x
128KB	20.42	11.06	1.85x	128KB	13.16	6.38	2.06x
256KB	20.5	12.86	1.60x	256KB	13.23	7.26	1.82x
512KB	29.89	19.14	1.56x	512KB	13.39	8.71	1.54x
1MB	31.94	22.31	1.43x	1MB	18.33	12.3	1.49x
2MB	37.95	33.43	1.14x	2MB	23.18	17.75	1.31x
4MB	49.28	43.97	1.12x	4MB	33.66	23.37	1.44x
8MB	77.01	68.16	1.13x	8MB	44.7	38.54	1.16x
16MB	116	115.7	1.00x	16MB	67.19	67.16	1.00x
32MB	187.2	186.5	1.00x	32MB	104.7	98.4	1.06x
64MB	317.4	315.7	1.01x	64MB	192.4	181.9	1.06x
128MB	572.5	570.4	1.00x	128MB	368.3	348.4	1.06x
256MB	1079	1075.6	1.00x	256MB	699.5	680.7	1.03x
512MB	2071.1	2067.9	1.00x	512MB	1358.6	1339.3	1.01x
1GB	4028.7	4026.8	1.00x	1GB	2663.8	2633	1.01x

Example

In order to use MSCCL, you may follow these steps to use two different MSCCL algorithms for AllReduce on Azure NDv4 which has 8xA100 GPUs:

1. Download the source code of msccl and related submodules

$ git clone https://github.com/Azure/msccl.git --recurse-submodules

2. Below is the steps to install MSCCL executor:

$ git clone https://github.com/Azure/msccl.git --recurse-submodules
$ cd msccl/executor/msccl-executor-nccl
$ make -j src.build
$ cd ../
$ cd ../

3. Below is the steps to install msccl-tests-nccl for performance evaluation:

$ cd tests/msccl-tests-nccl/
$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=$HOME/msccl/executor/msccl-executor-nccl/build/ -j
$ cd ../
$ cd ../

4. Apply the msccl algo when using msccl external scheduler

for ndv4, we already have algo optimized, you can use msccl scheduler to apply this algo directly to the executor, below is the steps to apply the scheduler

$ sudo apt-get install libcurl4-openssl-dev nlohmann-json3-dev
$ cd scheduler/msccl-scheduler

for nccl:
$ CXX=/path/to/nvcc BIN_HOME=/path/to/nccl/binary SRC_HOME=/path/to/nccl/source make
for rccl:
$ CXX=/path/to/nvcc BIN_HOME=/path/to/nccl/binary SRC_HOME=/path/to/nccl/source make PLATFORM=RCCL

$ make install

for customize the msccl algo for your system, you can install MSCCL toolkit to compile a few custom algorithms:

$ git clone https://github.com/microsoft/msccl-tools.git
$ cd msccl-tools/
$ pip install .
$ cd ../
$ python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
$ cd ../

The compiler's generated code is an XML file (test.xml) that is fed to MSCCL runtime. To evaluate its performance, copy the test.xml to the msccl/exector/msccl-executor-nccl/build/lib/msccl-algorithms/ and execute the following command line on an Azure NDv4 node or any 8xA100 system:

5. Below is the command to run test using msccl-executor-nccl

$ mpirun -np 8 -x LD_LIBRARY_PATH=msccl/exector/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0

6. If everything is installed correctly, you should see the following output in log:

[0] NCCL INFO Connected 1 MSCCL algorithms

You may evaluate the performance of test.xml by comparing in-place (the new algorithm) vs out-of-place (default ring algorithm) and it should up-to 2-3x faster on 8xA100 NVLink-interconnected GPUs. MSCCL toolkit has a rich set of algorithms for different Azure SKUs and collective operations with significant speedups over vanilla NCCL.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit CLA.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
docs		docs
executor		executor
scheduler		scheduler
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

License

Azure/msccl

Folders and files

Latest commit

History

Repository files navigation

MSCCL

Introduction

Performance

Example

1. Download the source code of msccl and related submodules

2. Below is the steps to install MSCCL executor:

3. Below is the steps to install msccl-tests-nccl for performance evaluation:

4. Apply the msccl algo when using msccl external scheduler

5. Below is the command to run test using msccl-executor-nccl

6. If everything is installed correctly, you should see the following output in log:

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks