New Features
-
Distributed Roll API for MTP (#307)
- Added a P2P-based
rollAPI that cyclically shifts dispatched local tensors along the sequence dimension with O(N/P) memory footprint, primarily targeting Multi-Token Prediction (MTP) workflows where labels are shifted relative to inputs. - Exposed as
magi_attention.api.roll; supports variable chunk sizes and autograd.
- Added a P2P-based
-
Uneven Shard Support (#307)
- The dispatch/undispatch pipeline now handles non-uniform chunk sizes where the last chunk can be smaller than the rest, removing the previous virtual-padding workaround.
-
FA4 with Attention Sink in Extensions (#282)
- Added
fa4_func_with_sinkandfa4_varlen_func_with_sinktomagi_attn_extensions, offering Flash-Attention 4 variants with learnable attention sink support using the nativeflash_attn.cuteinterface.
- Added
Enhancements
-
Ampere (sm80) Support for FFA_FA4 (#287)
- Extended the cutlass-based
FFA_FA4backend to support Ampere (sm80) in addition to Hopper (sm90) and Blackwell (sm100).
- Extended the cutlass-based
-
Multi-Arch CUDA Builds (#307)
- Build system now accepts comma-separated compute capabilities via
MAGI_ATTENTION_BUILD_COMPUTE_CAPABILITY(e.g.,"90,100") to produce a single wheel supporting multiple GPU generations simultaneously.
- Build system now accepts comma-separated compute capabilities via
-
global_window_sizeininfer_attn_mask_from_cu_seqlens(#307)- Added
global_window_sizeparameter so that every query in a sample can always attend to the first N key tokens (useful for prefix or global sink tokens in conjunction withSlidingWindowmasks).
- Added
-
Performance: Caching for Key Hotpaths (#307)
DistAttnRuntimeKeyhash is now cached to avoid repeated rehashing on every dict lookup.infer_attn_mask_from_cu_seqlensis wrapped withlru_cacheto skip redundant mask inference for repeatedcu_seqlenspatterns.
Experimental/WIP Features
-
IndexAttn: Direct-Index Sparse Attention Forward Path (#313)
- Added
IndexAttnpath to the FFA kernel that uses direct token-level KV indices (viacp.async) instead of range-based iteration, enabling more flexible block-sparse attention patterns. - Split the DSA FFA backend into
ffa_sparse_load(range +SparseLoad) andffa_index_attn(direct index) paths; updated DSA backend options toflex / ffa_sparse_load / ffa_index_attn / sdpa. - Merged the previously duplicated
sparse_mmaintommausingif constexpr (SparseLoad), eliminating ~400 lines of redundant code. - Also enables simultaneous use of
SwapABandSparseLoad(previously mutually exclusive).
- Added
-
DSA Attention Interface in Extensions (#283)
- Added
dsa_attn_functomagi_attn_extensions, providing a drop-in interface for the DSA (dynamic sparse attention) kernel backed by FFA.
- Added
Bug Fixes
-
Fix TVM FFI Registry Collision in Precompiled Kernels (#315)
- Multiple precompiled
.sofiles previously all exported the symbolcached_kernel_func, causing TVM's global packed function registry to be overwritten on each load, silently corrupting attention output and triggering gradient explosion. Each.sonow exports a unique symbol name derived from its compile-key hash.
- Multiple precompiled
-
Fix
ffa_fa4Installation Progress Display (#288)- Fixed a display issue in the
ffa_fa4installation progress reporting.
- Fixed a display issue in the
Documentation
- Fix Chinese Docs Site 404 on Language Switch (#318)
- The Sphinx CI deploy workflow previously built only English with no language subdirectory, causing the language switcher to produce broken URLs (
…/index.html/zh_CN/). Fixed by building English at the URL root (build/html/) and Chinese underbuild/html/zh_CN/, then deploying the whole tree todocs/main/. - Rewrote the language-switcher JavaScript to handle the asymmetric URL structure: English lives at the version root (e.g.
docs/main/), Chinese lives underdocs/main/zh_CN/. - Updated
docs/README.mdanddocs/README_zh.mdto document the corrected build commands and an HTTP-server based local testing workflow that accurately simulates the production URL layout.
- The Sphinx CI deploy workflow previously built only English with no language subdirectory, causing the language switcher to produce broken URLs (