v0.2
New features/changes:
- Host-transfer implementations of standard collectives in the
MPI-CUDA
backend: AllGather, AllToAll, Broadcast, Gather, Reduce, ReduceScatter, and Scatter. - Progress engine is now aware of separate compute streams. This enables better scheduling of non-interfering operations.
- Experimental RMA Put/Get operations.
- Improved Aluminum algorithm specification.
- Non-blocking point-to-point operations.
- Improved testing and benchmarks.
- Bugfixes and performance improvements.