[Feature] Option to also use host memory for the KV cache

### Motivation

KV cache hit rates are probably the biggest performance impact for me, and I recently read:

https://research.character.ai/optimizing-inference/

> To solve this problem, we developed an inter-turn caching system. For every prefilled prefix and generated message, **we cache the KV values on host memory** and retrieve them for future queries. Similar to RadixAttention ([Zheng et al., 2023](https://arxiv.org/abs/2312.07104?ref=research.character.ai)), we organize cached KV tensors in a LRU cache with a tree structure. The cached KV values are indexed by a rolling hash of prefix tokens. For each new query, a rolling hash is calculated for each prefix of the context, and the cache is retrieved for the longest match. This allows reusing the cache even for partially matched messages.

![image](https://github.com/InternLM/lmdeploy/assets/1167575/3e4bfe2a-edcd-4c14-b736-fdc86ad4f1b9)

In my case, ideally LMDeploy would be able to use **both** host memory and spare GPU memory, since I use a 'heterogenous' cluster, meaning that some GPUs have lots of free VRAM (after loading model weights), and other have only a little. And some have lots of system memory (>200GB), whereas others have only a little (20GB).

The awesome thing about LMDeploy in my opinion is the 4-bit KV cache combined with prefix caching, which allows lots of prefixes to be stored. This feature would take it to a completely new level - it would be a *massive* advantage over other frameworks, since most cloud GPU machines have a lot of system memory, and it's rarely used for anything.

So it seems like a "free" cache hit rate improvement just waiting to be implemented, since CharacterAI has apparently proven that it's possible & practical.

### Related resources

_No response_

### Additional context

I understand from [this comment](https://github.com/InternLM/lmdeploy/issues/1737#issuecomment-2156918312) that host memory KV storage is a low priority at the moment:

> The overhead from using CPU offloading outweighs the benefits. None of the mainstream frameworks have successfully implemented high-performance and effective CPU offloading. It is a low priority at the moment.

But I figured I'd open this as a tracking issue anyway so I can hear about any progress/updates when it eventually becomes higher priority.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Option to also use host memory for the KV cache #1817

Motivation

Related resources

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Option to also use host memory for the KV cache #1817

Description

Motivation

Related resources

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions