Skip to content

[Feature] Option to also use host memory for the KV cache #1817

@josephrocca

Description

@josephrocca

Motivation

KV cache hit rates are probably the biggest performance impact for me, and I recently read:

https://research.character.ai/optimizing-inference/

To solve this problem, we developed an inter-turn caching system. For every prefilled prefix and generated message, we cache the KV values on host memory and retrieve them for future queries. Similar to RadixAttention (Zheng et al., 2023), we organize cached KV tensors in a LRU cache with a tree structure. The cached KV values are indexed by a rolling hash of prefix tokens. For each new query, a rolling hash is calculated for each prefix of the context, and the cache is retrieved for the longest match. This allows reusing the cache even for partially matched messages.

image

In my case, ideally LMDeploy would be able to use both host memory and spare GPU memory, since I use a 'heterogenous' cluster, meaning that some GPUs have lots of free VRAM (after loading model weights), and other have only a little. And some have lots of system memory (>200GB), whereas others have only a little (20GB).

The awesome thing about LMDeploy in my opinion is the 4-bit KV cache combined with prefix caching, which allows lots of prefixes to be stored. This feature would take it to a completely new level - it would be a massive advantage over other frameworks, since most cloud GPU machines have a lot of system memory, and it's rarely used for anything.

So it seems like a "free" cache hit rate improvement just waiting to be implemented, since CharacterAI has apparently proven that it's possible & practical.

Related resources

No response

Additional context

I understand from this comment that host memory KV storage is a low priority at the moment:

The overhead from using CPU offloading outweighs the benefits. None of the mainstream frameworks have successfully implemented high-performance and effective CPU offloading. It is a low priority at the moment.

But I figured I'd open this as a tracking issue anyway so I can hear about any progress/updates when it eventually becomes higher priority.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions