Skip to content

Vllm + FlexAttention Work Tracking #19765

Open
@drisspg

Description

@drisspg

FlexAttention Performance & Feature Tracking

Overview

FlexAttention currently has significant performance bottlenecks and missing features that limit its adoption. This tracking issue provides an overview of the main categories of work needed.

🚨 Critical Performance Issues

Primary bottleneck: Custom op prevents cudagraphing, causing ~10x throughput regression. Additional issues include unnecessary recompilations and metadata operations.

🔧 Missing Features

FlexAttention currently only supports basic causal attention. Many common attention patterns are not yet implemented:

  • ALiBi slopes
  • Sliding window attention
  • Block sparse attention
  • Quantized KV cache
  • Encoder/cross-attention support
  • Speculative decoding
  • And more...

📋 Detailed Work Items & Contributing

All specific issues, performance optimizations, and feature implementations are tracked in the project board:

👉 [FlexAttention Project Board] 👈

The project board contains:

  • Individual issues for each performance bottleneck
  • Feature implementation tasks with detailed specifications
  • Priority labels and status tracking
  • Technical implementation notes

📊 Current Status

  • Performance: ~10x slower than optimal due to cudagraph blocking
  • Features: Basic causal attention only, many common patterns missing
  • Priority: Focus on performance fixes first, then high-impact features

For technical details and implementation notes, see the full breakdown in the project board issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions