ModelEngine-Group · Tarrei · Sep 30, 2025 · Sep 29, 2025 · Sep 29, 2025 · Sep 29, 2025
@@ -21,18 +21,49 @@ enables more straightforward and flexible management of heterogeneous computing
 UCM achieves a 3-10x reduction in inference latency across various scenarios, including multi-turn dialogue and
 long-context reasoning tasks.
 
-![architecture.png](./docs/source/_static/images/architecture.png)
+### Motivation
+
+With the increase of model size, the KV cache became larger and sparser, especially for long sequence requests. To
+reduce the GPU memory used, offload full KV to external storage and only keep partial or compressed KV in GPU memory
+became the popular direction. This can also reduce the GPU calculation, increase the sequence length and batch size of
+decoding.
+
+Sparse KV cache have many different choices. Recently paper point out that there is no common way can fit all scenarios
+and all models. So better to build a common framework then different sparse algorithms can be plugin to it like KV
+connector for PC.
+
+![architecture.png](./docs/source/_static/images/idea.png)
+
+All gray boxes are current classes in 0.9.2. Green boxes are proposed to add. Light green ones show out the future sub
+classes base on this framework.
+
+SpareKVBase is the base class of different algorithms. Just like KV connector design, it will hook few places of
+scheduler and layer.py to allow sparse algorithms do additional load, dump and calculate sparse KV blocks.
+
+SparseKVManager provide different KV block allocation methods for different algorithms. To keep all implementation under
+SpareKVBase, it will call SparseKVBase and real implementation will happen in sub class of sparse algorithms.
+
+KVStoreBase helps decoupling sparse algorithms and external storage. It defined the methods how to talk to external
+storage, so any sparse algorithms can work with any external storage. Concepts here is blocks identify by ID with
+offset. This is not only for sparse but also naturally for prefix cache also. KVStoreConnector connect it with current
+KVConnectorBase_V1 to provide PC function.
+
+NFSStore is sample implementation here provide ability to store blocks in local file system or NFS mount point in
+multi-server case.
+
+LocalCachedStore can reference any store to provide local DRAM read cache layer.
 
 ---
 
 ## Support Features
-- [Prefix Cache]()
-- [Cache Blend]()
-- [Model Window Extrapolation]()
-- [Prefill Offload]()
-- [Sparse Attention]()
-- [Sparse Attention Offload]()
-- [Heterogeneous PD Disaggregation]()
+
+- Prefix Cache
+- Cache Blend
+- Model Window Extrapolation
+- Prefill Offload
+- Sparse Attention
+- Sparse Attention Offload
+- Heterogeneous PD Disaggregation
 
 ---
 
@@ -52,7 +83,9 @@ please refer to [Quick Start](./docs/source/getting-started/quick_start.md).
 ---
 
 ## Contact Us
-For technical questions and feature requests, please use GitHub [Issues](https://github.com/ModelEngine-Group/unified-cache-management/issues).
+
+For technical questions and feature requests, please use
+GitHub [Issues](https://github.com/ModelEngine-Group/unified-cache-management/issues).
 
 ## License