Motivation
KV cache hit rates are probably the biggest performance impact for me, and I recently read:
https://research.character.ai/optimizing-inference/
To solve this problem, we developed an inter-turn caching system. For every prefilled prefix and generated message, we cache the KV values on host memory and retrieve them for future queries. Similar to RadixAttention (Zheng et al., 2023), we organize cached KV tensors in a LRU cache with a tree structure. The cached KV values are indexed by a rolling hash of prefix tokens. For each new query, a rolling hash is calculated for each prefix of the context, and the cache is retrieved for the longest match. This allows reusing the cache even for partially matched messages.

In my case, ideally LMDeploy would be able to use both host memory and spare GPU memory, since I use a 'heterogenous' cluster, meaning that some GPUs have lots of free VRAM (after loading model weights), and other have only a little. And some have lots of system memory (>200GB), whereas others have only a little (20GB).
The awesome thing about LMDeploy in my opinion is the 4-bit KV cache combined with prefix caching, which allows lots of prefixes to be stored. This feature would take it to a completely new level - it would be a massive advantage over other frameworks, since most cloud GPU machines have a lot of system memory, and it's rarely used for anything.
So it seems like a "free" cache hit rate improvement just waiting to be implemented, since CharacterAI has apparently proven that it's possible & practical.
Related resources
No response
Additional context
I understand from this comment that host memory KV storage is a low priority at the moment:
The overhead from using CPU offloading outweighs the benefits. None of the mainstream frameworks have successfully implemented high-performance and effective CPU offloading. It is a low priority at the moment.
But I figured I'd open this as a tracking issue anyway so I can hear about any progress/updates when it eventually becomes higher priority.
Motivation
KV cache hit rates are probably the biggest performance impact for me, and I recently read:
https://research.character.ai/optimizing-inference/
In my case, ideally LMDeploy would be able to use both host memory and spare GPU memory, since I use a 'heterogenous' cluster, meaning that some GPUs have lots of free VRAM (after loading model weights), and other have only a little. And some have lots of system memory (>200GB), whereas others have only a little (20GB).
The awesome thing about LMDeploy in my opinion is the 4-bit KV cache combined with prefix caching, which allows lots of prefixes to be stored. This feature would take it to a completely new level - it would be a massive advantage over other frameworks, since most cloud GPU machines have a lot of system memory, and it's rarely used for anything.
So it seems like a "free" cache hit rate improvement just waiting to be implemented, since CharacterAI has apparently proven that it's possible & practical.
Related resources
No response
Additional context
I understand from this comment that host memory KV storage is a low priority at the moment:
But I figured I'd open this as a tracking issue anyway so I can hear about any progress/updates when it eventually becomes higher priority.