kvcomm_ttft.mp4
This is the up-to-date official implementation of KVCOMM in the paper, KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems. KVCOMM is a novel adaptive serving paradigm for LLM-based Multi-agent Systems, which highlights the online cross-context KV-cache resuse for the shared text among dependent LLM agents. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples—termed anchors—that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over
KVCOMM offers the following insights:
![]() |
![]() |
- KV Proximity: Tokens that stay closer in embedding space have closer KV vectors across layers (left).
- Offset Proximity: Under the same prefix context change, the offsets for similar tokens still stay close (right).
And further proposes an anchor-based KV-cache Communication Module, which contains:
- Anchor Matching: locates the nearest anchors for the requested segment (See kvcomm_engine.py:KVCOMMEngine._select_anchor_indices).
- Offset Approximation: predicts the offset by weighting the anchors' stored deviations (See kvcomm_engine.py:KVCOMMEngine.offset_kv_cache_pair).
- Anchor Prediction: determines whether newly-generated KV-caches, e.g., responses, user inputs, etc., could be shared or treated as new anchor (See kvcomm_engine.py:KVCOMMEngine.predict_as_anchor).
$ git clone https://github.com/HankYe/KVCOMM
$ cd KVCOMM
$ conda create -n kvcomm python==3.10
$ conda activate kvcomm
$ pip install -r requirements.txtSeveral examples are provided to run the main experiments in the paper, including three benchmarks, i.e., MMLU, GSM8K, HumanEval. The instructions are as follows, where
-
--modeis the connection mode between agents, -
--agent_namesassign agent class to each agent as defined inKVCOMM/agents, -
--agent_numsdefines the number of each unique agent class, which should be in the same size of--agent_names, -
--llm_namegives the specific model to serve all agents, -
--execution_modedetermine whether the experiment runs indefault(dense prefill) orallow_kv_reuse(KVCOMM) mode, -
--kv-thresholdis the entropy threshold$\gamma$ to estimate the KV-cache sharability (defaut = 0.3), -
--kv-max-anchor-numis the maximum number of stored anchors$\mathcal{V}$ for each placeholder id (defaut = 20), -
--kv-window-sizeis the anchor number that will be regarded as dropping candidates for the update of the anchor pool (defaut = 5).
The results are saved in --output_dir. In our experiments reported in the paper, we tuned the agent number and assigned the same agent class for all agents in each benchmark.
MMLU
To run the baseline on MMLU under the 3-agent setting with Llama-3.1-8B-Instruct, the script is:
python experiments/run_mmlu.py --mode FullConnected --agent_names AnalyzeAgent --agent_nums 3 --llm_name meta-llama/Llama-3.1-8B-Instruct --output_dir ./runs/mmlu_default --execution_mode defaultTo run the KVCOMM method on MMLU under the 3-agent setting with Llama-3.1-8B-Instruct, the script is:
python experiments/run_mmlu.py --mode FullConnected --agent_names AnalyzeAgent --agent_nums 3 --llm_name meta-llama/Llama-3.1-8B-Instruct --output_dir ./runs/mmlu_kvcomm --execution_mode allow_kv_reuse --kv-threshold 0.3 --kv-max-anchor-num 20 --kv-window-size 5GSM8K
To run the baseline on GSM8K under the 3-agent setting with Llama-3.1-8B-Instruct, the script is:
python experiments/run_gsm8k.py --mode FullConnected --agent_names MathSolver --agent_nums 3 --llm_name meta-llama/Llama-3.1-8B-Instruct --output_dir ./runs/gsm8k_default --execution_mode defaultTo run the KVCOMM method on GSM8K under the 3-agent setting with Llama-3.1-8B-Instruct, the script is:
python experiments/run_gsm8k.py --mode FullConnected --agent_names MathSolver --agent_nums 3 --llm_name meta-llama/Llama-3.1-8B-Instruct --output_dir ./runs/gsm8k_kvcomm --execution_mode allow_kv_reuse --kv-threshold 0.3 --kv-max-anchor-num 20 --kv-window-size 5HumanEval
To run the baseline on HumanEval under the 3-agent setting with Qwen2.5-Coder-7B-Instruct, the script is:
python experiments/run_humaneval.py --mode FullConnected --agent_names CodeWriting --agent_nums 3 --llm_name Qwen/Qwen2.5-Coder-7B-Instruct --output_dir ./runs/humaneval_default --execution_mode defaultTo run the KVCOMM method on HumanEval under the 3-agent setting with Qwen2.5-Coder-7B-Instruct, the script is:
python experiments/run_humaneval.py --mode FullConnected --agent_names CodeWriting --agent_nums 3 --llm_name Qwen/Qwen2.5-Coder-7B-Instruct --output_dir ./runs/humaneval_kvcomm --execution_mode allow_kv_reuse --kv-threshold 0.3 --kv-max-anchor-num 20 --kv-window-size 5@misc{kvcomm25ye,
author = {Hancheng Ye and Zhengqi Gao and Mingyuan Ma and Qinsi Wang and Yuzhe Fu and Ming-Yu Chung and Yueqian Lin and Zhijian Liu and Jianyi Zhang and Danyang Zhuo and Yiran Chen},
title = {KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems},
year = {2025},
booktitle = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems}
}We greatly acknowledge the authors of AgentPrune and GPTSwarm for their open-source codes. Visit the following links to access their more contributions.


