v1.19.0-release
cuDNN Frontend v1.19.0 Release Notes
cuDNN Frontend v1.19.0 is the recommended version for cuDNN 9.19.1 and later releases.
Open-Source Kernels 🚀 🚀
- Blackwell and Hopper SDPA Fprop Kernels: cuDNN's SDPA Fprop implementation is now open source. This kernel supports causal masking and outputs stats for use in bprop. Additional kernels will be added in future releases.
- Grouped GEMM + dSwiGLU Fusion: A contiguous grouped block-scaled GEMM fused with a dSwiGLU backward epilogue on NVIDIA Blackwell GPUs (SM100+), designed for MoE (Mixture of Experts) workloads.
General Improvements 🚀
- Removed multiple device queries for SM version during graph validation and replaced with a single query that can be skipped by setting
sm_versionon the cuDNN graph. - Fixed an issue where enabling logging with CUDA graphs in certain scenarios would cause a crash.
- Significantly reduced the CPU overhead of the cuDNN OSS API by using tvm-ffi.
- We are adding a new cudnn-repro tool to have a standalone reproducer from the cudnn frontend logs. See details
Enhancements ✨
Scaled Dot-Product Attention (SDPA)
- Support Checks: Improved support checks for cleaner support surface queries.
- New API: Added Python bindings for score-mod bprop function to enable the score bprop API.
- Stats: Support independent generation of SDPA stats (LSE, SE, Max) in sdpa fprop (Requires 9.20.0 and up).
Normalization
- More Benchmarks: New normalization benchmark results posted for GB200, GB300, and H200.
Benchmarking 📊
- Updated the benchmark results for the SDPA improvements added in cuDNN 9.19.1