ModelEngine-Group · ygwpz · Sep 22, 2025 · Sep 22, 2025
@@ -0,0 +1 @@
+# PD Disaggregation by Storage
@@ -0,0 +1 @@
+# Sparse Attention by Storage
@@ -1,4 +1,4 @@
-# GPU
+# Installation-GPU
 This document describes how to install unified-cache-management.
 
 ## Requirements

@@ -1,4 +1,4 @@
-# NPU
+# Installation-NPU
 This document describes how to install unified-cache-management when using Ascend NPU manually.
 
 ## Requirements

@@ -29,24 +29,25 @@ Make KVCache Great Again!
 :caption: Getting Started
 :maxdepth: 1
 getting-started/quick_start
-getting-started/installation/index
+getting-started/installation_gpu
+getting-started/installation_npu
 :::
 
 :::{toctree}
 :caption: User Guide
 :maxdepth: 1
-user_guide/support_matrix/index
-user_guide/features/index
-user_guide/connector_guide/index
-user_guide/engine_guide/index
+user-guide/prefix-cache/index
+user-guide/sparse-attention/index
+user-guide/pd-disaggregation/index
+user-guide/engine-integration/index
 :::
 
 :::{toctree}
 :caption: Developer Guide
 :maxdepth: 1
-developer_guide/design/index
-developer_guide/contributing
-developer_guide/performance/index
+developer-guide/sparse_attention
+developer-guide/pd_disaggregation
+developer-guide/contribute
 :::
 
 :::{toctree}

@@ -1,2 +1,2 @@
-# Engine Guide
+# Engine Integration
 This section provides a guide to the serving engines currently supported by UCM. Users can refer to this guide to use serving engines with UCM.
@@ -1,4 +1,4 @@
-# Disaggregated Prefill
+# PD Disaggregation
 
 :::{toctree}
 :maxdepth: 2

@@ -1,4 +1,4 @@
-# DRAM Connector
+# DRAM Store
 
 This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and support larger models or batch sizes.
 

@@ -1,13 +1,9 @@
-# Store
+# Prefix Cache
 
 :::{toctree}
 :maxdepth: 1
 :caption: Index
 base
-3fs_store
 dram_store
-vfs_store
 nfs_store
-nds_store
-mooncake_store
 :::
@@ -1,4 +1,4 @@
-# NFS Connector
+# NFS Store
 
 This document provides a usage example and configuration guide for the **NFS Connector**. This connector enables offloading of KV cache from GPU HBM to SSD or Local Disk, helping reduce memory pressure and support larger models or batch sizes.
 

@@ -3,12 +3,12 @@
 Attention mechanisms, especially in LLMs, are often the bottleneck in terms of latency during inference due to their computational complexity. Despite their importance in capturing contextual relationships, traditional attention requires processing all token interactions, leading to significant delays.
 
 <p align="center">
-  <img alt="UCM" src="../../../images/attention_overhead.png" width="80%">
+  <img alt="UCM" src="../../images/attention_overhead.png" width="80%">
 </p>
 
 Researchers have found that attention in LLM is highly dispersed:
 <p align="center">
-  <img alt="UCM" src="../../../images/attention_sparsity.png" width="80%">
+  <img alt="UCM" src="../../images/attention_sparsity.png" width="80%">
 </p>
 
 This movitates them actively developing sparse attention algorithms to address the latency issue. These algorithms aim to reduce the number of token interactions by focusing only on the most relevant parts of the input, thereby lowering the computation and memory requirements.
@@ -24,7 +24,7 @@ By utilizing UCM, researchers can efficiently implement rapid prototyping and te
 ### Overview
 The core concept of our UCMSparse attention framework is to offload the complete Key-Value (KV) cache to a dedicated KV cache storage. We then identify the crucial KV pairs relevant to the current context, as determined by our sparse attention algorithms, and selectively load only the necessary portions of the KV cache from storage into High Bandwidth Memory (HBM). This design significantly reduces the HBM footprint while accelerating generation speed.
 <p align="center">
-  <img alt="UCM" src="../../../images/sparse_attn_arch.png" width="80%">
+  <img alt="UCM" src="../../images/sparse_attn_arch.png" width="80%">
 </p>
 
 

@@ -0,0 +1 @@
+# ESA
@@ -0,0 +1 @@
+# GSA
@@ -1,4 +1,4 @@
-# Sparse
+# Sparse Attention
 
 :::{toctree}
 :maxdepth: 1
@@ -8,6 +8,4 @@ esa
 gsa
 kvcomp
 kvstar
-prefill_offload
-cacheblend
 :::
@@ -0,0 +1 @@
+# KVComp