Open
Description
FlexAttention Performance & Feature Tracking
Overview
FlexAttention currently has significant performance bottlenecks and missing features that limit its adoption. This tracking issue provides an overview of the main categories of work needed.
🚨 Critical Performance Issues
Primary bottleneck: Custom op prevents cudagraphing, causing ~10x throughput regression. Additional issues include unnecessary recompilations and metadata operations.
🔧 Missing Features
FlexAttention currently only supports basic causal attention. Many common attention patterns are not yet implemented:
- ALiBi slopes
- Sliding window attention
- Block sparse attention
- Quantized KV cache
- Encoder/cross-attention support
- Speculative decoding
- And more...
📋 Detailed Work Items & Contributing
All specific issues, performance optimizations, and feature implementations are tracked in the project board:
👉 [FlexAttention Project Board] 👈
The project board contains:
- Individual issues for each performance bottleneck
- Feature implementation tasks with detailed specifications
- Priority labels and status tracking
- Technical implementation notes
📊 Current Status
- Performance: ~10x slower than optimal due to cudagraph blocking
- Features: Basic causal attention only, many common patterns missing
- Priority: Focus on performance fixes first, then high-impact features
For technical details and implementation notes, see the full breakdown in the project board issues.