Release Release v0.2.0 · MooreThreads/mate

Highlights

This release introduces major architecture and runtime improvements, including Torch decoupling via the TVM-FFI ABI and full JIT compilation support across all MATE operators. It also adds new attention implementations, expands FMHA capabilities, integrates FP8 SageAttention, and updates multiple compatibility wrappers.

What's Changed

Torch Decoupling

MATE is now decoupled from a single Torch version through the TVM-FFI ABI.

Starting from v0.2.0, MATE can support multiple Torch versions at the same time.
This improves package compatibility and reduces dependency constraints for downstream users.

Full JIT Compilation Support

All MATE operators now support JIT compilation.

Enables more flexible runtime compilation.
Improves compatibility across different deployment environments.

FMHA Updates

FMHA support has been enhanced with new functionality and bug fixes.

Added support for:

AppendKV functionality.

Fixed issues in:

JIT compilation failure with HeadDim 192-192.
Incorrect SWA kernel selection in some scenarios.
Scheduler metadata kernel JIT errors when batch-size > 992.

FP8 SageAttention Integration

Integrated the assembly-based SageAttention implementation.

Supported quantization modes:

QK INT8 + PV FP8
QK FP8 + PV FP8

Supported capabilities:

Multiple quantization granularities.
Configurable quantization precision and granularity through the wrapper interface.

DeepSeek Sparse Attention

Added DeepSeek Sparse Attention, also known as DSA.

Added TileLangMUSA-based DSA Prefill implementation.
Added TileLangMUSA-based DSA Decode implementation.

GDN Support

Added GDN support with a unified and stable API.

Added TileLangMUSA-based GDN Prefill implementation.
Added TileLangMUSA-based GDN Decode implementation.

Wrapper Updates

Updated and added multiple wrappers to improve compatibility with upstream projects and common usage patterns.

FlashAttention 3 Wrapper

Refactored the FlashAttention 3 wrapper.

Strictly compatible with the FA3 package name and import style.
Added export for the flash_attn_func interface.

FlashMLA Wrapper

Added a new FlashMLA wrapper compatible with the official FlashMLA repository.

Supported computation modes:

Dense
Sparse

Supported model scenarios:

DS V1
DS R1
DS V3.2
GLM5

Known limitation:

MODEL1 is not supported yet.

SageAttention Wrapper

Added a new SageAttention wrapper compatible with part of the official SageAttention repository capabilities.

Provides the sageattn interface.
Uses QK INT8 + PV FP8 quantization by default.
Supports specifying other quantization precisions and quantization granularities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

What's Changed

Torch Decoupling

Full JIT Compilation Support

FMHA Updates

FP8 SageAttention Integration

DeepSeek Sparse Attention

GDN Support

Wrapper Updates

FlashAttention 3 Wrapper

FlashMLA Wrapper

SageAttention Wrapper

Uh oh!