- Host-transfer implementations of standard collectives in the
MPI-CUDAbackend: AllGather, AllToAll, Broadcast, Gather, Reduce, ReduceScatter, and Scatter.
- Progress engine is now aware of separate compute streams. This enables better scheduling of non-interfering operations.
- Experimental RMA Put/Get operations.
- Improved Aluminum algorithm specification.
- Non-blocking point-to-point operations.
- Improved testing and benchmarks.
- Bugfixes and performance improvements.