Skip to content

DeepWok/a3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ICML'2026] $A^3$: an Analytical Low-Rank Approximation Framework for Attention

Paper Project Page ic Python 3.10+

Env setup

# create conda env
conda env create -f environment.yaml
conda activate a3
# install requirements.txt
pip install -r requirements.txt

LLMs

Functionalities:

  • Collect rxx
  • Approximate QK, VO, FFN using rxx
  • Evaluate approximated model's perplexity or downstream tasks (lm-eval-harness).

Command line interface:

cd experiments/llm
python run.py collect -h
python run.py approx -h
python run.py eval ppl -h
python run.py eval harness -h

For full example, check this tutorial.

Approximation modes

  • The attention type (attn_type) is parsed from the model config, and implemented in the A3ModelHelpers.get_model_arch_meta function at src/a3/models/__init__.py
  • The SVD-based A3-QK solution cannot be applied to multi-head attention with RoPE (mha-rope) and grouped query attention (gqa-*). We use CR approximation instead.
  • The SVD-based A3-VO solution cannot be applied to grouped query attention (gqa-*), we use joint SVD instead.
  • 🟢 denotes A3 method and its variants; 🟡 denotes baselines for ablation study/debug

A3-QK

Attn Type Available approx mode Class Description
mha qk 🟢 SVD using the Rxx of both Q and K
mha q-only 🟡 SVD using the Rxx of Q only
mha k-only 🟡 SVD using the Rxx of K only
mha-rope rxx-w-rxx 🟢 CR, each pair of (Qi,Ki) heads has its own index to drop cols
mha-rope rxx-w-rxx-uniform 🟡 CR, all (Qi,Ki) head pairs in one attn layer share the same index to drop cols
gqa Not implemented yet - Most GQA are combined with RoPE. Limited by time, I only implemented gqa-rope
gqa-rope rxx-w-rxx 🟢 CR, each group of (Q1,Q2,...,K) heads has its own index to drop cols
gqa-rope rxx-w-rxx-uniform 🟡 CR, all (Q1,Q2,...,K) heads in one decoder layer share the same index to drop cols

A3-VO

Attn Type Available approx mode Class Description
mha
mha-rope
axkv 🟢 SVD using the Rxx of A Xkv
mha
mha-rope
xkv 🟡 SVD using the Rxx of Xkv
mha
mha-rope
identity 🟡 SVD on fused weights without using activation information
gqa Not implemented yet - Most GQA are combined with RoPE. Limited by time, I only implemented gqa-rope
gqa-rope xkv 🟢 SVD using the Rxx of Xkv
gqa-rope identity 🟡 SVD on fused weights without using activation information

A3-FFN

For the 3-layer case, rxx may achieve better performance than rxx-w.

Attn Type Available approx mode Class Description
2-layer
3-layer
rxx 🟡 CR using only the information of Rxx.
2-layer
3-layer
rxx-w 🟢 CR using the information of both Rxx and the weights

About

Official implementation of the ICML 2026 paper: "A^3: an Analytical Low-Rank Approximation Framework for Attention"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages