Highlights
This release improves documentation, expands FMHA capabilities with QV and FP8 support, introduces a TileLang-based FMHA backward implementation for large head dimensions, and adds new debugging and attention components including Guard Allocator and FlashKDA. It also delivers performance improvements across Paged MQA Logits and GDN Decode.
What's Changed
Documentation
Improved MATE documentation with clearer usage guides and tutorials.
- Enhanced documentation structure and usability.
- Added more comprehensive tutorials and examples.
FMHA Updates
Expanded FMHA functionality and improved runtime performance.
FMHA Forward:
- Added QV support.
- Added FP8 support.
- FP8 performance optimizations require an upcoming compiler release.
- Improved workload balancing and partitioning in selected scenarios.
FMHA Backward:
- Added a TileLang-based implementation for
HeadDim 256-256.
DeepGEMM Updates
Added new DeepGEMM implementations and improved performance.
- Added MUTLASS-based FP8 DeepGEMM implementation.
- Added MUTLASS-based BF16 DeepGEMM implementation.
- Improved Paged MQA Logits performance.
GDN Updates
- Improved GDN Decode performance.
Memory Debugging
Added Guard Allocator for debugging memory-related issues.
- Helps identify and diagnose illegal memory access problems.
- Intended for debugging and validation workflows.
KDA Support
Added KDA Prefill support.
- Introduced the KDA Prefill interface.
- Added the FlashKDA wrapper for easier integration and adoption.
Bug Fixes
Fixed the following issues:
- Fixed an inconsistency between DeepGEMM's default
get_alignmentbehavior and API input parameters. - Fixed incorrect robust descriptor configuration in the FA assembly backend.
- Fixed stride overflow issues in the FA assembly backend.
- Fixed performance regressions in DSA under certain scenarios.