v1.16.0-release
cuDNN Frontend v1.16.0 Release Notes
cuDNN Frontend v1.16.0 is the recommended version for cuDNN 9.15.0 and later releases.
New Features 🚀
Open-Source Kernels
This release introduces open-source implementations of commonly requested fused kernels for select architectures (Blackwell). These experimental kernels may require additional dependencies such as CuteDSL. The initial release includes:
Additional dependencies can be installed optionally using pip install nvidia-cudnn-frontend[cutedsl]. Usage examples and detailed documentation are available in the test/python/fe_api directory.
Please submit issue reports for additional kernel requests or bug reports.
Enhancements ✨
Scaled Dot-Product Attention (SDPA)
-
Block Mask Support: Starting with cuDNN 9.14.0, SDPA attributes now support block masks to exclude tiles that do not require computation. Refer to the sample implementation for usage details.
-
Bug Fix: Resolved an invalid memory access (IMA) issue in SDPA backward propagation (fixed in cuDNN backend version 9.15.1 and later) that occurred when
s_kvis not a multiple of 128, padding mask is disabled, and operations are performed in CUDA graph replay mode.
Matrix Multiplication
- CUDA Graph Compatibility: Added
BehaviorNote_t::CUDNN_BEHAVIOR_NOTE_CUBLASLT_DEPENDENCYas a behavior note. This enables filtering of engine configurations (execution plans) that use cuBLAS as a backend, available starting with cuDNN version 9.15.0.
Additional Improvements
-
Block Scale Quantization: Added Python bindings for block scale quantize operations (#173). Refer to the sample implementation for usage details.
-
Dependency Optimization: PyTorch is no longer a required dependency for cuDNN Frontend (#177).
-
Tensor Alignment: Enhanced tensor descriptor API to accept alignment as an attribute (#153).
-
Plan Generation Control: Updated
cudnnGetPlanAPI to accept an optional maximum plan count parameter, enabling users to limit the number of plans built and autotuned.
Benchmarking 📊
- Updated benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py to use correct parameter names and fixed FLOPS calculations for accurate performance measurements.