Aluminum provides a generic interface to high-performance communication libraries, with a focus on allreduce algorithms. Blocking and non-blocking algorithms and GPU-aware algorithms are supported. Aluminum also contains custom implementations of select algorithms to optimize for certain situations.
- Blocking and non-blocking algorithms
- GPU-aware algorithms
MPI: MPI and custom algorithms implemented on top of MPI
NCCL: Interface to Nvidia's NCCL 2 library
MPI-CUDA: Custom GPU-aware algorithms
- A compiler that supports at least C++11
- MPI (at least MPI 3.0)
- CUDA (at least 9.0, optional if no GPU support is needed)
- NCCL2 (optional if no NCCL support is needed)
CMake 3.9 or newer is required. An out-of-source build is required:
mkdir build && cd build cmake <options> /path/to/aluminum/source
The required packages are
OpenMP use the standard CMake packages and can be manipulated in the
HWLOC, if installed in a nonstandard location, may
HWLOC_DIR to be set to the appropriate installation prefix.
CUDA-based backends assume
CUDA is a first-class language in
CMake. An alternative
CUDA compiler can be selected using
NCCL backend is used, the
NCCL_DIR variable may be
used to point CMake to a nonstandard installation prefix.
MPI-CUDA backends can be combined.
Here is a complete example:
CMAKE_PREFIX_PATH=/path/to/your/MPI:$CMAKE_PREFIX_PATH cmake -D ALUMINUM_ENABLE_NCCL=YES -D ALUMINUM_ENABLE_MPI_CUDA=YES -D NCCL_DIR=/path/to/NCCL ..
Tests and benchmarks
test_correctness binary will check the correctness of Aluminum's allreduce implementations. The usage is
test_correctness [Al backend: MPI, NCCL, MPI-CUDA]
For example, to test the
mpirun -n 128 ./test_correctness
To test the
NCCL backend, instead:
mpirun -n 128 ./test_correctness NCCL
benchmark_allreduce benchmark can be run similarly, and will report runtimes for different allreduce algorithms.
See also contributors.
Aluminum is licensed under the Apache License 2.0. See LICENSE for details.