Aluminum GPU-aware Communication Library
Branch: master
Clone or download
naoyam Merge pull request #42 from naoyam/feature/use-priority-streams
Optional use highest priority for the internal streams
Latest commit 677d2e8 Feb 15, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmark Add benchmarks for event and wait implementations. Feb 1, 2019
cmake Add flag for host-transfer allreduce to use MPI passthrough. Feb 8, 2019
src Merge pull request #42 from naoyam/feature/use-priority-streams Feb 15, 2019
test Non-blocking point-to-point operations. Jan 10, 2019
CMakeLists.txt Bumped to version v0.2.1 Feb 12, 2019
LICENSE Update license and add boilerplate to every source file. Sep 14, 2018 Note CUDA >= 9.0 requirement. Feb 10, 2019


Aluminum provides a generic interface to high-performance communication libraries, with a focus on allreduce algorithms. Blocking and non-blocking algorithms and GPU-aware algorithms are supported. Aluminum also contains custom implementations of select algorithms to optimize for certain situations.


  • Blocking and non-blocking algorithms
  • GPU-aware algorithms
  • Implementations/interfaces:
    • MPI: MPI and custom algorithms implemented on top of MPI
    • NCCL: Interface to Nvidia's NCCL 2 library
    • MPI-CUDA: Custom GPU-aware algorithms

Getting started


  • A compiler that supports at least C++11
  • MPI (at least MPI 3.0)
  • CUDA (at least 9.0, optional if no GPU support is needed)
  • NCCL2 (optional if no NCCL support is needed)


CMake 3.9 or newer is required. An out-of-source build is required:

mkdir build && cd build
cmake <options> /path/to/aluminum/source

The required packages are MPI, OpenMP, and HWLOC. MPI and OpenMP use the standard CMake packages and can be manipulated in the standard way. HWLOC, if installed in a nonstandard location, may require HWLOC_DIR to be set to the appropriate installation prefix.

The CUDA-based backends assume CUDA is a first-class language in CMake. An alternative CUDA compiler can be selected using


If the NCCL backend is used, the NCCL_DIR variable may be used to point CMake to a nonstandard installation prefix.

For the NCCL backend:


For the MPI-CUDA backend:


The NCCL and MPI-CUDA backends can be combined.

Here is a complete example:


Tests and benchmarks

The test_correctness binary will check the correctness of Aluminum's allreduce implementations. The usage is

test_correctness [Al backend: MPI, NCCL, MPI-CUDA]

For example, to test the MPI backend:

mpirun -n 128 ./test_correctness

To test the NCCL backend, instead:

mpirun -n 128 ./test_correctness NCCL

The benchmark_allreduce benchmark can be run similarly, and will report runtimes for different allreduce algorithms.

API overview

Coming soon...


See also contributors.


Aluminum is licensed under the Apache License 2.0. See LICENSE for details.