Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions _posts/2025-07-31-cachegen.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
layout: post
title: "CacheGen: Store Your KV Cache on Disk or S3—Load Blazingly Fast!"
thumbnail-img: /assets/img/cachegen.png
share-img: /assets/img/cachegen.png
author: Kuntai Du, Kobe
image: /assets/img/cachegen.png
---

**TL;DR:** 🚀 [CacheGen](https://arxiv.org/abs/2310.07240) lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** so that you can load your KV cache blazingly fast while keeping response quality high. Stop wasting compute --- use CacheGen to fully utilize your storage and get instant first-token speedup!

<div align="center">
<img src="/assets/img/cachegen.png" alt="comparison" style="width: 97%; vertical-align:middle;">
<p><em>CacheGen reduces KV cache loading time from disk.</em></p>
</div>

---

## Why CacheGen?

Modern LLMs use long contexts, but reprocessing these every time is slow and resource-intensive.
While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts are a lot and GPU & CPU memory alone are not enough** --- we need to use disk and even S3 to store all KV caches.

However, storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!
**CacheGen fixes this**: you can persist KV caches to any storage (S3, disk, etc.) and reload them *much* faster than a fresh prefill. Perfect for keeping valuable context for all your users and agents—without the cold-start penalty.

---

## Key Results 📊

| System | Mean TTFT (ms) | Mean TPOT (ms) |
|-----------------------|:--------------:|:--------------:|
| **LMCache + CacheGen**| **737** | **47.7** |
| Naive vLLM | 4,355 | 247.6 |
| Fireworks | 2,353 | 664.7 |
| DeepInfra | 2,949 | 79.0 |
| Baseten | 113,239 | 174.9 |

**Takeaway:** CacheGen cuts Time-To-First-Token (TTFT) by up to **3×** compared to other baselines, and reduces per-token latency, too.

---

## How Does It Work?

- **Compress:** CacheGen encodes KV cache with custom quantization and residue coding—making files up to 3× smaller than quantized tensors.
- **Decompress:** Fast CUDA kernels restore the cache in milliseconds, right into GPU memory.

---

## Quick Start 🛠️

```bash
uv pip install vllm
uv pip install lmcache

# Start cache server
lmcache_server localhost 65434

# Start vLLM+LMCache server (using CacheGen)
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
```


example.yaml:
```yaml
chunk_size: 2048
local_cpu: False
remote_url: "lm://localhost:65434"
remote_serde: "cachegen"
```

## Citation

If you use CacheGen in your research, please cite our paper:

```bibtex
@misc{liu2024cachegenkvcachecompression,
title={CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving},
author={Yuhan Liu and Hanchen Li and Yihua Cheng and Siddhant Ray and Yuyang Huang and Qizheng Zhang and Kuntai Du and Jiayi Yao and Shan Lu and Ganesh Ananthanarayanan and Michael Maire and Henry Hoffmann and Ari Holtzman and Junchen Jiang},
year={2024},
eprint={2310.07240},
archivePrefix={arXiv},
primaryClass={cs.NI},
url={https://arxiv.org/abs/2310.07240},
}
```

**Paper:** [CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving](https://arxiv.org/abs/2310.07240)

## Contact

- **LMCache Github: [https://github.com/LMCache/LMCache](https://github.com/LMCache/LMCache)**
- **Chat with the Developers** **[Interest Form](https://forms.gle/mQfQDUXbKfp2St1z7)**
- **LMCache [slack](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ)**
- **vLLM Production-Stack [channel](https://vllm-dev.slack.com/archives/C089SMEAKRA)**


**CacheGen: persistent, streaming context for fast, scalable LLMs—the LMCache Lab way!** 🚀
Binary file added assets/img/cachegen.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.