cuDNN Frontend v1.16.0 Release Notes

cuDNN Frontend v1.16.0 is the recommended version for cuDNN 9.15.0 and later releases.

New Features 🚀

Open-Source Kernels

This release introduces open-source implementations of commonly requested fused kernels for select architectures (Blackwell). These experimental kernels may require additional dependencies such as CuteDSL. The initial release includes:

Additional dependencies can be installed optionally using pip install nvidia-cudnn-frontend[cutedsl]. Usage examples and detailed documentation are available in the test/python/fe_api directory.

Please submit issue reports for additional kernel requests or bug reports.

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Block Mask Support: Starting with cuDNN 9.14.0, SDPA attributes now support block masks to exclude tiles that do not require computation. Refer to the sample implementation for usage details.
Bug Fix: Resolved an invalid memory access (IMA) issue in SDPA backward propagation (fixed in cuDNN backend version 9.15.1 and later) that occurred when s_kv is not a multiple of 128, padding mask is disabled, and operations are performed in CUDA graph replay mode.

Matrix Multiplication

CUDA Graph Compatibility: Added BehaviorNote_t::CUDNN_BEHAVIOR_NOTE_CUBLASLT_DEPENDENCY as a behavior note. This enables filtering of engine configurations (execution plans) that use cuBLAS as a backend, available starting with cuDNN version 9.15.0.

Additional Improvements

Block Scale Quantization: Added Python bindings for block scale quantize operations (#173). Refer to the sample implementation for usage details.
Dependency Optimization: PyTorch is no longer a required dependency for cuDNN Frontend (#177).
Tensor Alignment: Enhanced tensor descriptor API to accept alignment as an attribute (#153).
Plan Generation Control: Updated cudnnGetPlan API to accept an optional maximum plan count parameter, enabling users to limit the number of plans built and autotuned.

Benchmarking 📊

Updated benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py to use correct parameter names and fixed FLOPS calculations for accurate performance measurements.

Resolved Issues 🔧

#153 - Tensor descriptor alignment support
#173 - Block scale quantize Python bindings
#177 - PyTorch dependency removal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.16.0-release

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

cuDNN Frontend v1.16.0 Release Notes

New Features 🚀

Open-Source Kernels

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Matrix Multiplication

Additional Improvements

Benchmarking 📊

Resolved Issues 🔧

Uh oh!