Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/developer-guide/pd_disaggregation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# PD Disaggregation by Storage
1 change: 1 addition & 0 deletions docs/source/developer-guide/sparse_attention.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Sparse Attention by Storage
1 change: 0 additions & 1 deletion docs/source/developer_guide/design/add_connector.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/source/developer_guide/design/architecture.md

This file was deleted.

12 changes: 0 additions & 12 deletions docs/source/developer_guide/design/index.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/source/developer_guide/design/nfs_connector.md

This file was deleted.

19 changes: 0 additions & 19 deletions docs/source/developer_guide/design/vllm_institution.md

This file was deleted.

8 changes: 0 additions & 8 deletions docs/source/developer_guide/performance/index.md

This file was deleted.

This file was deleted.

10 changes: 0 additions & 10 deletions docs/source/getting-started/installation/index.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# GPU
# Installation-GPU
This document describes how to install unified-cache-management.

## Requirements
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# NPU
# Installation-NPU
This document describes how to install unified-cache-management when using Ascend NPU manually.

## Requirements
Expand Down
17 changes: 9 additions & 8 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,24 +29,25 @@ Make KVCache Great Again!
:caption: Getting Started
:maxdepth: 1
getting-started/quick_start
getting-started/installation/index
getting-started/installation_gpu
getting-started/installation_npu
:::

:::{toctree}
:caption: User Guide
:maxdepth: 1
user_guide/support_matrix/index
user_guide/features/index
user_guide/connector_guide/index
user_guide/engine_guide/index
user-guide/prefix-cache/index
user-guide/sparse-attention/index
user-guide/pd-disaggregation/index
user-guide/engine-integration/index
:::

:::{toctree}
:caption: Developer Guide
:maxdepth: 1
developer_guide/design/index
developer_guide/contributing
developer_guide/performance/index
developer-guide/sparse_attention
developer-guide/pd_disaggregation
developer-guide/contribute
:::

:::{toctree}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# Engine Guide
# Engine Integration
This section provides a guide to the serving engines currently supported by UCM. Users can refer to this guide to use serving engines with UCM.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Disaggregated Prefill
# PD Disaggregation

:::{toctree}
:maxdepth: 2
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# DRAM Connector
# DRAM Store

This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and support larger models or batch sizes.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,9 @@
# Store
# Prefix Cache

:::{toctree}
:maxdepth: 1
:caption: Index
base
3fs_store
dram_store
vfs_store
nfs_store
nds_store
mooncake_store
:::
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# NFS Connector
# NFS Store

This document provides a usage example and configuration guide for the **NFS Connector**. This connector enables offloading of KV cache from GPU HBM to SSD or Local Disk, helping reduce memory pressure and support larger models or batch sizes.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
Attention mechanisms, especially in LLMs, are often the bottleneck in terms of latency during inference due to their computational complexity. Despite their importance in capturing contextual relationships, traditional attention requires processing all token interactions, leading to significant delays.

<p align="center">
<img alt="UCM" src="../../../images/attention_overhead.png" width="80%">
<img alt="UCM" src="../../images/attention_overhead.png" width="80%">
</p>

Researchers have found that attention in LLM is highly dispersed:
<p align="center">
<img alt="UCM" src="../../../images/attention_sparsity.png" width="80%">
<img alt="UCM" src="../../images/attention_sparsity.png" width="80%">
</p>

This movitates them actively developing sparse attention algorithms to address the latency issue. These algorithms aim to reduce the number of token interactions by focusing only on the most relevant parts of the input, thereby lowering the computation and memory requirements.
Expand All @@ -24,7 +24,7 @@ By utilizing UCM, researchers can efficiently implement rapid prototyping and te
### Overview
The core concept of our UCMSparse attention framework is to offload the complete Key-Value (KV) cache to a dedicated KV cache storage. We then identify the crucial KV pairs relevant to the current context, as determined by our sparse attention algorithms, and selectively load only the necessary portions of the KV cache from storage into High Bandwidth Memory (HBM). This design significantly reduces the HBM footprint while accelerating generation speed.
<p align="center">
<img alt="UCM" src="../../../images/sparse_attn_arch.png" width="80%">
<img alt="UCM" src="../../images/sparse_attn_arch.png" width="80%">
</p>


Expand Down
1 change: 1 addition & 0 deletions docs/source/user-guide/sparse-attention/esa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# ESA
1 change: 1 addition & 0 deletions docs/source/user-guide/sparse-attention/gsa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# GSA
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sparse
# Sparse Attention

:::{toctree}
:maxdepth: 1
Expand All @@ -8,6 +8,4 @@ esa
gsa
kvcomp
kvstar
prefill_offload
cacheblend
:::
1 change: 1 addition & 0 deletions docs/source/user-guide/sparse-attention/kvcomp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# KVComp
2 changes: 0 additions & 2 deletions docs/source/user_guide/connector_guide/index.md

This file was deleted.

10 changes: 0 additions & 10 deletions docs/source/user_guide/examples/index.md

This file was deleted.

180 changes: 0 additions & 180 deletions docs/source/user_guide/examples/mooncake_conn.md

This file was deleted.

Loading