Release v0.5.0 · NVIDIA/cuDecomp

What's Changed

This release includes a number of major updates to cuDecomp. This release adds new features to make cuDecomp more flexible for users (more customizable memory orderings by pencil axis via new transpose_mem_order configuration option and support for input/output buffer padding in transpose and halo update APIs). This release also improves support for multi-node NVLINK (MNNVL) equipped clusters with opt-in support for fabric allocated cuDecomp workspace memory. Beyond this, this release includes expanded autotuning options and general improvements.

Breaking changes

#60 adds a new padding argument to several cuDecomp APIs: cudecompGetPencilInfo, cudecompTranspose*, and cudecompHaloUpdate* functions. This will require updates to existing C++ code and Fortran code (depending on usage). See #60 and documentation for more details.

Deprecations

The Makefile-based build has been removed.

PRs included in this release

Made it possible to include library header from pure C program (#40)
Adding Fortran version of Taylor Green example (#41)
Fix integer overflow issue with C++ TG example for large problems. (#42)
Benchmark updates (#43)
Use unique ID based NVSHMEM initialization method for newer NVSHMEM versions (#44)
Removing Makefile build support and related files. (#45)
Add missing preprocessor guards to fix compilation without NVSHMEM enabled. (#46)
Add small MPI_Alltoall after autotuning to work around MPI memory registration delaying cudaFree. (#47)
Address narrowing conversion errors/warnings. (#48)
Add new transpose_mem_order configuration argument to enable more flexible pencil memory layouts. (#49)
Add opt-in support for fabric-registered workspace allocations via cuMem* APIs. (#50)
Dynamically load CUDA driver functions at runtime. (#51)
Increase buffer size used in post-autotuning MPI_Alltoall. (#52)
Fix integer overflow issue in Fortran poisson example. (#53)
Extend transpose shortcut handling to cases with halos. (#54)
Fix bug in handling of NVSHMEM halo backends from recent change. (#55)
Improve multi-node NVLink topology detection and communication ordering using NVML utilities. (#56)
Fix CUDART_VERSION guard for nvmlDeviceGetGpuFabricInfoV to restrict usage to CUDA >= 12.4. (#57)
Silence messages about NVML symbols failing to load. (#58)
Improve tests (#59)
Preserve original user transpose_mem_order settings after grid descriptor creation. (#61)
Add support for padded input/output buffers in transpose and halo communication routines (#60)
Improvements to batched memcpy kernel implementation. (#62)
Remove redundant axis-contiguous/transpose_mem_order configurations from halo tests. Update axis-contiguous test configurations to not supply transpose_mem_order argument. (#63)
Add Blackwell (cc100) support to default builds when using CUDA 12.8 or newer. (#64)
C++ Taylor Green example updates. (#65)
Add new autotuning options to set per operation halo extent and padding arguments. (#66)

Full Changelog: v0.4.2...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

Choose a tag to compare

Sorry, something went wrong.