Release v0.2.1
Highlights
This release expands FMHA compatibility, improves FP8 output support, adds MODEL1(DS V4) support for DSA, optimizes GDN performance, and introduces new mHC and DeepGEMM MQA Logits capabilities.
Starting from this release, wrapper versions are strictly aligned with the MATE version to prevent incompatible package combinations.
What's Changed
FMHA Updates
Added compatibility support for additional FMHA features:
- RoPE
k_leftpadkv_batch_idx
FP8 SageAttention & FP8 DenseGEMM
Added support for FP8 output and quantization scale outputs.
- Added FP8 output support.
- Added quant scale output support.
DeepSeek Sparse Attention
Added MODEL1(DS V4) support for DeepSeek Sparse Attention.
- Added DSA Prefill support for MODEL1(DS V4).
- Added DSA Decode support for MODEL1(DS V4).
GDN Updates
Improved GDN performance and expanded Decode capability.
- Optimized Prefill performance.
- Optimized Decode performance.
- Added MTP support for Decode.
mHC
Added support for TF32 mHC pre-norm.
DeepGEMM MQA Logits
Improved paged MQA Logits support for larger batch sizes.
- Paged MQA Logits now supports larger batch sizes.
- The maximum supported batch size is only limited by shared memory capacity.
Wrapper Updates
Starting from v0.2.1, wrapper versions are strictly aligned with the MATE version to avoid incompatible package combinations.
- Use
mate checkto verify wrapper consistency, including version and commit information.
DeepGEMM Wrapper
Added new interfaces:
mHCbf16_gemm_nt
FlashMLA Wrapper
Added support for MODEL1(DS V4) related input arguments.
Bug Fixes
Fixed the following issues:
- Fixed NaN outputs in Fused MoE Gate under certain scenarios.
- Fixed IMA issues in MQA Logits.
- Fixed incorrect FA3 backend selection for Softcap scenarios.