Release MagiAttention V1.1.1 · SandAI-org/MagiAttention

New Features

Distributed Roll API for MTP (#307)
- Added a P2P-based roll API that cyclically shifts dispatched local tensors along the sequence dimension with O(N/P) memory footprint, primarily targeting Multi-Token Prediction (MTP) workflows where labels are shifted relative to inputs.
- Exposed as magi_attention.api.roll; supports variable chunk sizes and autograd.
Uneven Shard Support (#307)
- The dispatch/undispatch pipeline now handles non-uniform chunk sizes where the last chunk can be smaller than the rest, removing the previous virtual-padding workaround.
FA4 with Attention Sink in Extensions (#282)
- Added fa4_func_with_sink and fa4_varlen_func_with_sink to magi_attn_extensions, offering Flash-Attention 4 variants with learnable attention sink support using the native flash_attn.cute interface.

Ampere (sm80) Support for FFA_FA4 (#287)
- Extended the cutlass-based FFA_FA4 backend to support Ampere (sm80) in addition to Hopper (sm90) and Blackwell (sm100).
Multi-Arch CUDA Builds (#307)
- Build system now accepts comma-separated compute capabilities via MAGI_ATTENTION_BUILD_COMPUTE_CAPABILITY (e.g., "90,100") to produce a single wheel supporting multiple GPU generations simultaneously.
global_window_size in infer_attn_mask_from_cu_seqlens (#307)
- Added global_window_size parameter so that every query in a sample can always attend to the first N key tokens (useful for prefix or global sink tokens in conjunction with SlidingWindow masks).
Performance: Caching for Key Hotpaths (#307)
- DistAttnRuntimeKey hash is now cached to avoid repeated rehashing on every dict lookup.
- infer_attn_mask_from_cu_seqlens is wrapped with lru_cache to skip redundant mask inference for repeated cu_seqlens patterns.

IndexAttn: Direct-Index Sparse Attention Forward Path (#313)
- Added IndexAttn path to the FFA kernel that uses direct token-level KV indices (via cp.async) instead of range-based iteration, enabling more flexible block-sparse attention patterns.
- Split the DSA FFA backend into ffa_sparse_load (range + SparseLoad) and ffa_index_attn (direct index) paths; updated DSA backend options to flex / ffa_sparse_load / ffa_index_attn / sdpa.
- Merged the previously duplicated sparse_mma into mma using if constexpr (SparseLoad), eliminating ~400 lines of redundant code.
- Also enables simultaneous use of SwapAB and SparseLoad (previously mutually exclusive).
DSA Attention Interface in Extensions (#283)
- Added dsa_attn_func to magi_attn_extensions, providing a drop-in interface for the DSA (dynamic sparse attention) kernel backed by FFA.

Fix TVM FFI Registry Collision in Precompiled Kernels (#315)
- Multiple precompiled .so files previously all exported the symbol cached_kernel_func, causing TVM's global packed function registry to be overwritten on each load, silently corrupting attention output and triggering gradient explosion. Each .so now exports a unique symbol name derived from its compile-key hash.
Fix ffa_fa4 Installation Progress Display (#288)
- Fixed a display issue in the ffa_fa4 installation progress reporting.

Fix Chinese Docs Site 404 on Language Switch (#318)
- The Sphinx CI deploy workflow previously built only English with no language subdirectory, causing the language switcher to produce broken URLs (…/index.html/zh_CN/). Fixed by building English at the URL root (build/html/) and Chinese under build/html/zh_CN/, then deploying the whole tree to docs/main/.
- Rewrote the language-switcher JavaScript to handle the asymmetric URL structure: English lives at the version root (e.g. docs/main/), Chinese lives under docs/main/zh_CN/.
- Updated docs/README.md and docs/README_zh.md to document the corrected build commands and an HTTP-server based local testing workflow that accurately simulates the production URL layout.

Updated nvidia-cutlass-dsl to 4.4.2 and quack-kernels to 0.4.1 (#318)
Updated the flash-attention submodule to incorporate cutlass-dsl 4.4.2 (#315)
- Fixed TCGen05Mma assert via PipelineTmaUmmaOg._make_sync_object.
- Added loc/ip params to pipeline producer_acquire/consumer_release.