Release v0.2.0
Highlights
This release introduces major architecture and runtime improvements, including Torch decoupling via the TVM-FFI ABI and full JIT compilation support across all MATE operators. It also adds new attention implementations, expands FMHA capabilities, integrates FP8 SageAttention, and updates multiple compatibility wrappers.
What's Changed
Torch Decoupling
MATE is now decoupled from a single Torch version through the TVM-FFI ABI.
- Starting from
v0.2.0, MATE can support multiple Torch versions at the same time. - This improves package compatibility and reduces dependency constraints for downstream users.
Full JIT Compilation Support
All MATE operators now support JIT compilation.
- Enables more flexible runtime compilation.
- Improves compatibility across different deployment environments.
FMHA Updates
FMHA support has been enhanced with new functionality and bug fixes.
Added support for:
- AppendKV functionality.
Fixed issues in:
- JIT compilation failure with
HeadDim 192-192. - Incorrect SWA kernel selection in some scenarios.
- Scheduler metadata kernel JIT errors when
batch-size > 992.
FP8 SageAttention Integration
Integrated the assembly-based SageAttention implementation.
Supported quantization modes:
- QK INT8 + PV FP8
- QK FP8 + PV FP8
Supported capabilities:
- Multiple quantization granularities.
- Configurable quantization precision and granularity through the wrapper interface.
DeepSeek Sparse Attention
Added DeepSeek Sparse Attention, also known as DSA.
- Added TileLangMUSA-based DSA Prefill implementation.
- Added TileLangMUSA-based DSA Decode implementation.
GDN Support
Added GDN support with a unified and stable API.
- Added TileLangMUSA-based GDN Prefill implementation.
- Added TileLangMUSA-based GDN Decode implementation.
Wrapper Updates
Updated and added multiple wrappers to improve compatibility with upstream projects and common usage patterns.
FlashAttention 3 Wrapper
Refactored the FlashAttention 3 wrapper.
- Strictly compatible with the FA3 package name and import style.
- Added export for the
flash_attn_funcinterface.
FlashMLA Wrapper
Added a new FlashMLA wrapper compatible with the official FlashMLA repository.
Supported computation modes:
- Dense
- Sparse
Supported model scenarios:
- DS V1
- DS R1
- DS V3.2
- GLM5
Known limitation:
MODEL1is not supported yet.
SageAttention Wrapper
Added a new SageAttention wrapper compatible with part of the official SageAttention repository capabilities.
- Provides the
sageattninterface. - Uses QK INT8 + PV FP8 quantization by default.
- Supports specifying other quantization precisions and quantization granularities.