Published as a conference paper at ICLR 2026
Paper | Overview | Results | Quick Start | Citation
Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song
University of Sydney, KAUST, Together AI, UT Austin
Converting pretrained GQA/MHA models into Multi-Head Latent Attention (MLA) can dramatically reduce KV-cache cost without retraining from scratch. However, naive SVD initialization minimizes weight error rather than activation error and enforces uniform rank across layers — causing activation drift and degraded attention fidelity.
CARE addresses both shortcomings:
- Activation-preserving factorization — SVD on the whitened operator sqrt(C)W, then unwhitening via sqrt(C)^{-1}, so the approximation aligns with actual input activations rather than just weights.
- Adjusted-rank scheduling — a singular-value-guided greedy allocation that distributes a fixed KV budget unevenly across layers, giving more capacity to spectrally complex layers.
- KV-parity mapping — reparameterizes converted K and V into the MLA format while keeping the KV-cache size unchanged.
CARE outperforms uniform-rank SVD baselines on Qwen3-4B/30B-A3B and Llama-3.1-8B/70B, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets.
(a) Naive (Joint) SVD factorizes W_K and W_V directly and truncates to a uniform per-layer rank, optimizing weight error while ignoring layerwise heterogeneity. (b) CARE estimates activation covariance C from calibration data, factorizes sqrt(C)W, and unwhitens via sqrt(C)^{-1} to initialize MLA factors. The singular spectrum of sqrt(C)W drives a global dynamic rank scheduler under KV parity, preserving activation geometry and yielding a stronger one-shot initialization.
| Model | Rank | SVD (Palu) | ASVD | MHA2MLA | CARE-U | CARE-E |
|---|---|---|---|---|---|---|
| Llama-3.1-8B | 128 | 7.5e4 | 2525 | 2.8e5 | 89.9 | 72.0 |
| Llama-3.1-8B | 256 | 2561 | 115 | 5236 | 68.5 | 64.8 |
| Qwen3-4B | 128 | 5.7e4 | 6684 | 1.2e5 | 102 | 53.3 |
| Qwen3-4B | 256 | 1093 | 267 | 4965 | 47.7 | 37.2 |
With a brief post-SVD "healing" fine-tune, CARE fully recovers the original model's accuracy while maintaining the MLA KV-cache reduction.
conda create -n care python=3.12 -y
conda activate care
pip install torch transformers datasets accelerate tqdm lm-eval gpustatSet your HuggingFace token for gated models:
export HF_TOKEN=hf_xxxAll shell launchers source scripts/lib/common.sh, which adds src to PYTHONPATH and optionally activates CONDA_ENV.
Evaluate decomposition quality without building full MLA modules:
PYTHONPATH=src python -m zeroshot.convert \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--method care \
--rank 256 \
--cal-dataset alpacaAdd --dynamic-rank for adjusted-rank scheduling.
Run all methods/ranks in parallel across GPUs:
bash scripts/zeroshot/run_parallel_llama3.1_8B.sh
bash scripts/zeroshot/run_parallel_qwen3_4B.shFor large models (30B+/70B) that require multi-GPU:
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/zeroshot/run_parallel_llama3.1_70B_rank_only.shPYTHONPATH=src python -m cli.convert \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--save-path outputs/llama-3.1-8B-mla \
--kv-lora-rank 256 \
--kv-decomp-method transmla-careAvailable --kv-decomp-method options:
transmla— PCA-based projection (TransMLA baseline)transmla-care— CARE with covariance-aware eigenbasis projectioncare— sqrt-covariance weighted SVD decompositionno-sqrt-care— covariance weighted SVD (no sqrt)
Additional options:
--cal-mode full|layerwise|auto— calibration strategy (default: auto)--cal-dataset wikitext2|alpaca|c4|ptb— calibration data (default: wikitext2)--dynamic-rank— enable adjusted-rank scheduling across layers
If you find this work useful, please cite:
@inproceedings{zhou2026care,
title = {{CARE}: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention},
author = {Zhou, Zhongzhu and Bie, Fengxiang and Chen, Ziyan and Zhang, Zhenyu and Yang, Yibo and Wang, Junxiong and Athiwaratkun, Ben and Wu, Xiaoxia and Song, Shuaiwen Leon},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}This project is licensed under the Apache License 2.0. See LICENSE for details.
This codebase builds upon TransMLA (MIT License) by Meng et al. We thank the TransMLA authors for open-sourcing their MLA conversion framework, which served as the foundation for our implementation.
