Skip to content

MagiAttention V1.1.1

Latest

Choose a tag to compare

@Strivin0311 Strivin0311 released this 21 May 04:11
· 7 commits to main since this release
d7ea8af

New Features

  1. Distributed Roll API for MTP (#307)

    • Added a P2P-based roll API that cyclically shifts dispatched local tensors along the sequence dimension with O(N/P) memory footprint, primarily targeting Multi-Token Prediction (MTP) workflows where labels are shifted relative to inputs.
    • Exposed as magi_attention.api.roll; supports variable chunk sizes and autograd.
  2. Uneven Shard Support (#307)

    • The dispatch/undispatch pipeline now handles non-uniform chunk sizes where the last chunk can be smaller than the rest, removing the previous virtual-padding workaround.
  3. FA4 with Attention Sink in Extensions (#282)

    • Added fa4_func_with_sink and fa4_varlen_func_with_sink to magi_attn_extensions, offering Flash-Attention 4 variants with learnable attention sink support using the native flash_attn.cute interface.

Enhancements

  1. Ampere (sm80) Support for FFA_FA4 (#287)

    • Extended the cutlass-based FFA_FA4 backend to support Ampere (sm80) in addition to Hopper (sm90) and Blackwell (sm100).
  2. Multi-Arch CUDA Builds (#307)

    • Build system now accepts comma-separated compute capabilities via MAGI_ATTENTION_BUILD_COMPUTE_CAPABILITY (e.g., "90,100") to produce a single wheel supporting multiple GPU generations simultaneously.
  3. global_window_size in infer_attn_mask_from_cu_seqlens (#307)

    • Added global_window_size parameter so that every query in a sample can always attend to the first N key tokens (useful for prefix or global sink tokens in conjunction with SlidingWindow masks).
  4. Performance: Caching for Key Hotpaths (#307)

    • DistAttnRuntimeKey hash is now cached to avoid repeated rehashing on every dict lookup.
    • infer_attn_mask_from_cu_seqlens is wrapped with lru_cache to skip redundant mask inference for repeated cu_seqlens patterns.

Experimental/WIP Features

  1. IndexAttn: Direct-Index Sparse Attention Forward Path (#313)

    • Added IndexAttn path to the FFA kernel that uses direct token-level KV indices (via cp.async) instead of range-based iteration, enabling more flexible block-sparse attention patterns.
    • Split the DSA FFA backend into ffa_sparse_load (range + SparseLoad) and ffa_index_attn (direct index) paths; updated DSA backend options to flex / ffa_sparse_load / ffa_index_attn / sdpa.
    • Merged the previously duplicated sparse_mma into mma using if constexpr (SparseLoad), eliminating ~400 lines of redundant code.
    • Also enables simultaneous use of SwapAB and SparseLoad (previously mutually exclusive).
  2. DSA Attention Interface in Extensions (#283)

    • Added dsa_attn_func to magi_attn_extensions, providing a drop-in interface for the DSA (dynamic sparse attention) kernel backed by FFA.

Bug Fixes

  1. Fix TVM FFI Registry Collision in Precompiled Kernels (#315)

    • Multiple precompiled .so files previously all exported the symbol cached_kernel_func, causing TVM's global packed function registry to be overwritten on each load, silently corrupting attention output and triggering gradient explosion. Each .so now exports a unique symbol name derived from its compile-key hash.
  2. Fix ffa_fa4 Installation Progress Display (#288)

    • Fixed a display issue in the ffa_fa4 installation progress reporting.

Documentation

  1. Fix Chinese Docs Site 404 on Language Switch (#318)
    • The Sphinx CI deploy workflow previously built only English with no language subdirectory, causing the language switcher to produce broken URLs (…/index.html/zh_CN/). Fixed by building English at the URL root (build/html/) and Chinese under build/html/zh_CN/, then deploying the whole tree to docs/main/.
    • Rewrote the language-switcher JavaScript to handle the asymmetric URL structure: English lives at the version root (e.g. docs/main/), Chinese lives under docs/main/zh_CN/.
    • Updated docs/README.md and docs/README_zh.md to document the corrected build commands and an HTTP-server based local testing workflow that accurately simulates the production URL layout.

Dependency Updates

  1. Updated nvidia-cutlass-dsl to 4.4.2 and quack-kernels to 0.4.1 (#318)
  2. Updated the flash-attention submodule to incorporate cutlass-dsl 4.4.2 (#315)
    • Fixed TCGen05Mma assert via PipelineTmaUmmaOg._make_sync_object.
    • Added loc/ip params to pipeline producer_acquire/consumer_release.