Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 31 million developers.Sign up
- Host-transfer implementations of standard collectives in the
MPI-CUDAbackend: AllGather, AllToAll, Broadcast, Gather, Reduce, ReduceScatter, and Scatter.
- Progress engine is now aware of separate compute streams. This enables better scheduling of non-interfering operations.
- Experimental RMA Put/Get operations.
- Improved Aluminum algorithm specification.
- Non-blocking point-to-point operations.
- Improved testing and benchmarks.
- Bugfixes and performance improvements.
Aluminum provides a generic interface to high-performance communication libraries, with a focus on allreduce algorithms. Blocking and non-blocking algorithms and GPU-aware algorithms are supported. Aluminum also contains custom implementations of select algorithms to optimize for certain situations.
- Blocking and non-blocking algorithms
- GPU-aware algorithms
- MPI: MPI and custom algorithms implemented on top of MPI
- NCCL: Interface to Nvidia's NCCL 2 library
- MPI-CUDA: Custom GPU-aware algorithms