Skip to content

Release v0.2.0

Choose a tag to compare

@zhe-pang zhe-pang released this 23 May 05:31
· 4 commits to main since this release

Highlights

This release introduces major architecture and runtime improvements, including Torch decoupling via the TVM-FFI ABI and full JIT compilation support across all MATE operators. It also adds new attention implementations, expands FMHA capabilities, integrates FP8 SageAttention, and updates multiple compatibility wrappers.

What's Changed

Torch Decoupling

MATE is now decoupled from a single Torch version through the TVM-FFI ABI.

  • Starting from v0.2.0, MATE can support multiple Torch versions at the same time.
  • This improves package compatibility and reduces dependency constraints for downstream users.

Full JIT Compilation Support

All MATE operators now support JIT compilation.

  • Enables more flexible runtime compilation.
  • Improves compatibility across different deployment environments.

FMHA Updates

FMHA support has been enhanced with new functionality and bug fixes.

Added support for:

  • AppendKV functionality.

Fixed issues in:

  • JIT compilation failure with HeadDim 192-192.
  • Incorrect SWA kernel selection in some scenarios.
  • Scheduler metadata kernel JIT errors when batch-size > 992.

FP8 SageAttention Integration

Integrated the assembly-based SageAttention implementation.

Supported quantization modes:

  • QK INT8 + PV FP8
  • QK FP8 + PV FP8

Supported capabilities:

  • Multiple quantization granularities.
  • Configurable quantization precision and granularity through the wrapper interface.

DeepSeek Sparse Attention

Added DeepSeek Sparse Attention, also known as DSA.

  • Added TileLangMUSA-based DSA Prefill implementation.
  • Added TileLangMUSA-based DSA Decode implementation.

GDN Support

Added GDN support with a unified and stable API.

  • Added TileLangMUSA-based GDN Prefill implementation.
  • Added TileLangMUSA-based GDN Decode implementation.

Wrapper Updates

Updated and added multiple wrappers to improve compatibility with upstream projects and common usage patterns.

FlashAttention 3 Wrapper

Refactored the FlashAttention 3 wrapper.

  • Strictly compatible with the FA3 package name and import style.
  • Added export for the flash_attn_func interface.

FlashMLA Wrapper

Added a new FlashMLA wrapper compatible with the official FlashMLA repository.

Supported computation modes:

  • Dense
  • Sparse

Supported model scenarios:

  • DS V1
  • DS R1
  • DS V3.2
  • GLM5

Known limitation:

  • MODEL1 is not supported yet.

SageAttention Wrapper

Added a new SageAttention wrapper compatible with part of the official SageAttention repository capabilities.

  • Provides the sageattn interface.
  • Uses QK INT8 + PV FP8 quantization by default.
  • Supports specifying other quantization precisions and quantization granularities.