[ICML'2026] $A^3$: an Analytical Low-Rank Approximation Framework for Attention

Env setup

# create conda env
conda env create -f environment.yaml
conda activate a3
# install requirements.txt
pip install -r requirements.txt

LLMs

Functionalities:

Collect rxx
Approximate QK, VO, FFN using rxx
Evaluate approximated model's perplexity or downstream tasks (lm-eval-harness).

Command line interface:

cd experiments/llm
python run.py collect -h
python run.py approx -h
python run.py eval ppl -h
python run.py eval harness -h

For full example, check this tutorial.

Approximation modes

The attention type (attn_type) is parsed from the model config, and implemented in the A3ModelHelpers.get_model_arch_meta function at src/a3/models/__init__.py
The SVD-based A3-QK solution cannot be applied to multi-head attention with RoPE (mha-rope) and grouped query attention (gqa-*). We use CR approximation instead.
The SVD-based A3-VO solution cannot be applied to grouped query attention (gqa-*), we use joint SVD instead.
🟢 denotes A3 method and its variants; 🟡 denotes baselines for ablation study/debug

A3-QK

Attn Type	Available approx mode	Class	Description
`mha`	`qk`	🟢	SVD using the `Rxx` of both Q and K
`mha`	`q-only`	🟡	SVD using the `Rxx` of Q only
`mha`	`k-only`	🟡	SVD using the `Rxx` of K only
`mha-rope`	`rxx-w-rxx`	🟢	CR, each pair of `(Qi,Ki)` heads has its own index to drop cols
`mha-rope`	`rxx-w-rxx-uniform`	🟡	CR, all `(Qi,Ki)` head pairs in one attn layer share the same index to drop cols
`gqa`	Not implemented yet	-	Most GQA are combined with RoPE. Limited by time, I only implemented `gqa-rope`
`gqa-rope`	`rxx-w-rxx`	🟢	CR, each group of `(Q1,Q2,...,K)` heads has its own index to drop cols
`gqa-rope`	`rxx-w-rxx-uniform`	🟡	CR, all `(Q1,Q2,...,K)` heads in one decoder layer share the same index to drop cols

A3-VO

Attn Type	Available approx mode	Class	Description
`mha` `mha-rope`	`axkv`	🟢	SVD using the `Rxx` of `A Xkv`
`mha` `mha-rope`	`xkv`	🟡	SVD using the `Rxx` of `Xkv`
`mha` `mha-rope`	`identity`	🟡	SVD on fused weights without using activation information
`gqa`	Not implemented yet	-	Most GQA are combined with RoPE. Limited by time, I only implemented `gqa-rope`
`gqa-rope`	`xkv`	🟢	SVD using the Rxx of `Xkv`
`gqa-rope`	`identity`	🟡	SVD on fused weights without using activation information

A3-FFN

For the 3-layer case, rxx may achieve better performance than rxx-w.

Attn Type	Available approx mode	Class	Description
`2-layer` `3-layer`	`rxx`	🟡	CR using only the information of `Rxx`.
`2-layer` `3-layer`	`rxx-w`	🟢	CR using the information of both `Rxx` and the weights

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
docs		docs
experiments		experiments
src		src
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICML'2026] $A^3$: an Analytical Low-Rank Approximation Framework for Attention

Env setup

LLMs

Approximation modes

A3-QK

A3-VO

A3-FFN

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICML'2026] $A^3$: an Analytical Low-Rank Approximation Framework for Attention

Env setup

LLMs

Approximation modes

A3-QK

A3-VO

A3-FFN

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages