Skip to content

v1.24.1 release

Choose a tag to compare

@Anerudhan Anerudhan released this 08 Jun 19:04
· 2 commits to main since this release
de41880

cuDNN Frontend v1.24.1 Release Notes

cuDNN Frontend v1.24.1 is the recommended version for cuDNN 9.23.0 and later releases.

General Improvements 🚀 🚀

Updates to Graph API

  • Rotary Position Embedding (RoPE) is now available as a cudnn operation, usable both standalone and as a preprocessing stage for the SDPA engine. See the sample for usage. RoPE fusion with SDPA requires cuDNN 9.24.0.
  • SDPA backward now supports hidden dimension d=256. Requires cuDNN 9.23.0 or later.

Open-Source Kernels 🚀 🚀

  • Introduced a DSA module featuring the following DSA/CSA kernels for DsV4:
    • Indexer Forward: CuTe-DSL score kernel (Q @ Kᵗ, ReLU, head reduce, ratio causal mask). Non-fused; pair with Indexer Top-K for the top-K stage.
    • Indexer Top-K: SM100 CuTe-DSL radix top-K kernel with per-row seq_lens.
    • Sparse Attention Backward: DSA backward (FlashMLA-shape, SM90/SM100).
    • Sparse Indexer / Attention Score Recompute: Sparse (top-K) recomputation of indexer and attention scores for training loss.
    • Dense Indexer / Attention Score Recompute: Dense (full-KV) analogues of the above.
    • Indexer Backward: Three-stage pipeline (score-grad, three GEMMs, dtype cast) for sparse top-K score tensors.
    • Dense Indexer Backward: Full-KV counterpart of Indexer Backward.
  • Grouped GEMM GLU forward kernel with fused Hadamard transform.

Skills

  • Added a new Claude skill for converting cuteDSL kernels into experimental cuDNN APIs.

Enhancements

  • Noisy logging messages are now emitted only once per process.
  • Convolution problems are now rejected when total filter size exceeds INT32_MAX.
  • Support for ragged input order has been added for grouped GEMM weight gradients.

Bug Fixes

  • Fixed an issue in the reshape operator when called with 1D tensors.
  • Fixed missing square_alpha scaling in dgeglu and dswiglu.
  • Fixed a race condition in lazy variant-pack-template preparation observed in some single-threaded scenarios.

New Samples

Acknowledgements

The Native Sparse Attention forward-prop kernels, supporting head dim = 128 and optimized for the Blackwell architecture, were implemented in CuteDSL.

These kernels were a collaborative effort, jointly developed by: Jie Feng, Akash Mehra, Vincent Zhang, Dominik Ernst, Xinbo Zhao, Aditya Vavre, Vedaanta Agarwalla, Mingyang Wang, Anerudhan Gopal, Paul Springer, Yang Xu, and Nima Tajbakhsh.