You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A prefill acceleration method for LLMs that compresses the key matrix $K$ along the sequence dimension before the $QK$ multiplication, reducing FLOPs proportionally to the compression ratio. Custom Triton kernels realize the speedup in practice, demonstrating wall-clock acceleration on long contexts.
About
DeltaAttention: accelerating LLM prefill by compressing the key matrix before QK multiplication — custom Triton kernels, wall-clock speedup on long contexts