Skip to content
/ KVCOMM Public

[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

FastMAS/KVCOMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv GitHub issues PRs Welcome

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Demo of TTFT speedup under a two-agent setting

kvcomm_ttft.mp4



Introduction

This is the up-to-date official implementation of KVCOMM in the paper, KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems. KVCOMM is a novel adaptive serving paradigm for LLM-based Multi-agent Systems, which highlights the online cross-context KV-cache resuse for the shared text among dependent LLM agents. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples—termed anchors—that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over $70$% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to $7.8\times$ speedup compared to the standard prefill pipeline, reducing TTFT from $\sim430$ ms to $\sim55$ ms on a single H100 GPU.

KVCOMM offers the following insights:

KV proximity Offset proximity
  1. KV Proximity: Tokens that stay closer in embedding space have closer KV vectors across layers (left).
  2. Offset Proximity: Under the same prefix context change, the offsets for similar tokens still stay close (right).



And further proposes an anchor-based KV-cache Communication Module, which contains:

Installation

$ git clone https://github.com/HankYe/KVCOMM
$ cd KVCOMM
$ conda create -n kvcomm python==3.10
$ conda activate kvcomm
$ pip install -r requirements.txt

Experiments

Several examples are provided to run the main experiments in the paper, including three benchmarks, i.e., MMLU, GSM8K, HumanEval. The instructions are as follows, where

  • --mode is the connection mode between agents,
  • --agent_names assign agent class to each agent as defined in KVCOMM/agents,
  • --agent_nums defines the number of each unique agent class, which should be in the same size of --agent_names,
  • --llm_name gives the specific model to serve all agents,
  • --execution_mode determine whether the experiment runs in default (dense prefill) or allow_kv_reuse (KVCOMM) mode,
  • --kv-threshold is the entropy threshold $\gamma$ to estimate the KV-cache sharability (defaut = 0.3),
  • --kv-max-anchor-num is the maximum number of stored anchors $\mathcal{V}$ for each placeholder id (defaut = 20),
  • --kv-window-size is the anchor number that will be regarded as dropping candidates for the update of the anchor pool (defaut = 5).

The results are saved in --output_dir. In our experiments reported in the paper, we tuned the agent number and assigned the same agent class for all agents in each benchmark.

MMLU

To run the baseline on MMLU under the 3-agent setting with Llama-3.1-8B-Instruct, the script is:

python experiments/run_mmlu.py --mode FullConnected --agent_names AnalyzeAgent --agent_nums 3 --llm_name meta-llama/Llama-3.1-8B-Instruct --output_dir ./runs/mmlu_default --execution_mode default

To run the KVCOMM method on MMLU under the 3-agent setting with Llama-3.1-8B-Instruct, the script is:

python experiments/run_mmlu.py --mode FullConnected --agent_names AnalyzeAgent --agent_nums 3 --llm_name meta-llama/Llama-3.1-8B-Instruct --output_dir ./runs/mmlu_kvcomm --execution_mode allow_kv_reuse --kv-threshold 0.3 --kv-max-anchor-num 20 --kv-window-size 5
GSM8K

To run the baseline on GSM8K under the 3-agent setting with Llama-3.1-8B-Instruct, the script is:

python experiments/run_gsm8k.py --mode FullConnected --agent_names MathSolver --agent_nums 3 --llm_name meta-llama/Llama-3.1-8B-Instruct --output_dir ./runs/gsm8k_default --execution_mode default

To run the KVCOMM method on GSM8K under the 3-agent setting with Llama-3.1-8B-Instruct, the script is:

python experiments/run_gsm8k.py --mode FullConnected --agent_names MathSolver --agent_nums 3 --llm_name meta-llama/Llama-3.1-8B-Instruct --output_dir ./runs/gsm8k_kvcomm --execution_mode allow_kv_reuse --kv-threshold 0.3 --kv-max-anchor-num 20 --kv-window-size 5
HumanEval

To run the baseline on HumanEval under the 3-agent setting with Qwen2.5-Coder-7B-Instruct, the script is:

python experiments/run_humaneval.py --mode FullConnected --agent_names CodeWriting --agent_nums 3 --llm_name Qwen/Qwen2.5-Coder-7B-Instruct --output_dir ./runs/humaneval_default --execution_mode default

To run the KVCOMM method on HumanEval under the 3-agent setting with Qwen2.5-Coder-7B-Instruct, the script is:

python experiments/run_humaneval.py --mode FullConnected --agent_names CodeWriting --agent_nums 3 --llm_name Qwen/Qwen2.5-Coder-7B-Instruct --output_dir ./runs/humaneval_kvcomm --execution_mode allow_kv_reuse --kv-threshold 0.3 --kv-max-anchor-num 20 --kv-window-size 5

Citation

@misc{kvcomm25ye,
  author = {Hancheng Ye and Zhengqi Gao and Mingyuan Ma and Qinsi Wang and Yuzhe Fu and Ming-Yu Chung and Yueqian Lin and Zhijian Liu and Jianyi Zhang and Danyang Zhuo and Yiran Chen},
  title = {KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems},
  year = {2025},
  booktitle = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems}
}

Acknowledgements

We greatly acknowledge the authors of AgentPrune and GPTSwarm for their open-source codes. Visit the following links to access their more contributions.

About

[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages