Skip to content

Emruur/DeltaAttention

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

328 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeltaAttention

Read the draft paper

A prefill acceleration method for LLMs that compresses the key matrix $K$ along the sequence dimension before the $QK$ multiplication, reducing FLOPs proportionally to the compression ratio. Custom Triton kernels realize the speedup in practice, demonstrating wall-clock acceleration on long contexts.

About

DeltaAttention: accelerating LLM prefill by compressing the key matrix before QK multiplication — custom Triton kernels, wall-clock speedup on long contexts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors