Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion docs/source/about.md
Original file line number Diff line number Diff line change
@@ -1 +1,9 @@
# About Us
# About Us

UCM is rooted in KV Cache, with the goal of reducing inference costs and building commercially viable inference
solutions. It enhances throughput through methods such as Prefix Cache, sparsification, and PD Disaggregation.

The UCM team consists of a group of "lazy" people who love simple things and also enjoy "borrowing" the excellent
experiences of others. Adhering to the principle of full openness, we hope everyone will generously share their
insights. We also welcome everyone to learn from these experiences together, engage in discussions, and help us make
progress.
1 change: 0 additions & 1 deletion docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,6 @@ getting-started/installation_npu
user-guide/prefix-cache/index
user-guide/sparse-attention/index
user-guide/pd-disaggregation/index
user-guide/engine-integration/index
:::

:::{toctree}
Expand Down
2 changes: 0 additions & 2 deletions docs/source/user-guide/engine-integration/index.md

This file was deleted.

114 changes: 114 additions & 0 deletions docs/source/user-guide/pd-disaggregation/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,119 @@
# PD Disaggregation

The Disaggregation of Prefill and Decode (PD Disaggregation) has basically become a consensus solution for the
deployment of
large-scale inference clusters, and its advantages are even more prominent, especially for Mixture of Experts (MOE)
models. PD Disaggregation mainly includes three core components: independent deployment strategies for Prefill and
Decode,
KV cache storage and transmission strategies, and scheduling strategies. Notably, the scheduling strategy is dependent
on the KV cache storage and transmission strategy. The PD Disaggregation design in the Unified Computing Model (UCM)
focuses
primarily on optimizing KV cache storage and transmission, thereby enabling more rational scheduling strategies.

Prefix Cache has become a standard component in inference systems. With the expanding application scope of large models,
the increase in sequence lengths, and the growing adoption of Agent-based applications, the performance benefits of
Prefix Cache will become even more significant. The PD Disaggregation in UCM takes Prefix Cache as a foundational
assumption
and is inherently dependent on its functionality.

## Transmission Modes of KV Cache Between Prefill and Decode Nodes

There are roughly three transmission modes for KV cache between Prefill (P) and Decode (D) nodes, each with distinct
characteristics and application scenarios:

1. **Direct Transmission**.KV cache is transmitted directly from the High-Bandwidth Memory (HBM) of the Prefill node to
the HBM of the Decode node, typically via a high-speed inter-HBM network or a direct pass-through protocol. This
approach is straightforward and efficient, making it highly suitable for scenarios with a 1:1 Prefill-to-Decode
ratio (1P1D) and homogeneous P/D nodes. On the scheduling side, coordination is usually required: Prefill and Decode
nodes are allocated at the initiation of a request to enable layer-wise KV transmission during the Prefill phase.
2. **Indirect Transmission via DRAM**. First, the KV cache generated during the Prefill phase is offloaded to Dynamic
Random-Access Memory (DRAM). Subsequently, the KV cache is transferred from the Prefill node’s DRAM to the Decode
node’s DRAM, and finally loaded from the Decode node’s DRAM into its HBM for inference. In this mode, DRAM acts as a
logical cache, which is more compatible with scheduling logic. Critically, HBM is only occupied for the shortest
duration in the entire process, effectively reducing HBM resource consumption.
3. **Transmission via Unified Storage Pool (Leveraging Prefix Cache Logic)**. This mode fully utilizes Prefix Cache
logic, with a unified storage pool serving as the intermediate medium for KV cache transmission. Specifically, the
Prefill node offloads KV cache to the Prefix Cache, while the Decode node performs inference with high hit rates on
the Prefix Cache. Compared with the first two modes, this approach is the "simplest" in terms of logic and
implementation, and achieves the highest degree of "decoupling" in the entire system—even eliminating the need for a
strict distinction between Prefill and Decode nodes.

### Rationale for UCM’s Adoption of the Third Transmission Mode

The "simplicity" and "decoupling" of the third mode are sufficient to make it the preferred choice for UCM. In practical
implementations, additional advantages have been identified:

1. **Complete Decoupling of Prefill and Decode**: This not only simplifies scheduling logic but also greatly streamlines
exception handling.
2. **Full Reuse of Prefix Cache Code**: It serves as a "time-saver" for developers, as no additional PD
Disaggregation-specific logic needs to be added. Consequently, there is no need to address the cumbersome exception
handling issues associated with custom logic.
3. **Unified Storage as Inference Instance State**: This design renders inference instances completely stateless,
significantly enhancing the robustness of the entire system.
4. **Near-Zero-Cost Heterogeneous Inference**: Due to inherent differences between Prefill and Decode tasks, optimizing
cost can be achieved by selecting different graphics processing units (GPUs), precision levels, instance launch
methods, and optimization algorithms. While direct inter-GPU transmission becomes more complex in heterogeneous
environments, the Disaggregation of KV cache and computation either natively supports such scenarios or only requires
the
addition of a fully decoupled module. Over time, large inference clusters composed of new and old GPUs with diverse
architectures will naturally become a mainstream trend.

In large-scale clusters, the direct transmission mode (Mode 1) requires either a full connection between Prefill and
Decode nodes or further division of nodes into smaller groups. This not only increases the complexity of network design
and scheduling but also limits the maximum scalable size of the cluster. In contrast, larger and more unified clusters
are more conducive to improving overall throughput.

## Enhanced Scheduling Flexibility Enabled by PD Disaggregation

The flexible decoupling of PD Disaggregation provides greater flexibility for scheduling optimization. Key application
scenarios include the following:

**1. Reducing GPU Compute Idle Time and Maximizing Compute Utilization**

- Under Dynamic Batching (DP), the scheduler merges sequences of different lengths to reduce idle time caused by DP,
with task migration performed midway if necessary.
- The scheduler leverages Chunk Prefill to utilize residual compute resources on Decode instances.
- By default, the scheduler stores KV cache generated from each inference task in a unified external memory. This not
only avoids recomputation in case of exceptions but also maximizes system-wide compute utilization through mid-task
migration.
- The scheduler automatically switches the roles of Prefill and Decode nodes to further exploit underutilized compute
resources.
- When system bandwidth is insufficient, the scheduler triggers additional recomputation to avoid bandwidth bottlenecks.
- The scheduler balances the load across all instances, maximizing compute utilization while improving user experience.
- During high-priority task preemption, the scheduler enables seamless migration of existing tasks to new instances.

**2. Improving User Experience**

- The scheduler prevents long sequences from delaying short sequences (which would degrade the experience of
short-sequence tasks), thereby improving average Time to First Token (TTFT) and Time per Output Token (TPOT).
- The scheduler uses simple hashing to map requests from the same user to the same instance as much as possible,
increasing the local hit rate of KV cache and reducing both TTFT and bandwidth consumption.

**3. Enhancing Exception Handling**

- The scheduler implements mechanisms such as retries and checkpoint-based resumption to handle exceptions, preventing
task errors and failures.
- The scheduler itself is designed with weak state and multi-instance redundancy, eliminating single points of failure
and reducing system-level risks.

## Evolution and Future Outlook of PD Disaggregation

Since its initial proposal, PD Disaggregation has evolved toward greater complexity with the widespread adoption of Deepseek
MLA MOE models. This evolution has led to discussions about more granular Disaggregation strategies, such as
Activation-Feedforward (AF) Disaggregation and layer-wise Disaggregation. It is undeniable that these more complex approaches
can further reduce compute idle time (e.g., idle time caused by DP) and fully exploit compute resources.

However, it is important to recognize that large-model inference is still in its early stages, and PD Disaggregation
represents only the starting point for the transition toward large-scale distributed inference deployment. As more
application scenarios emerge, there will be an inevitable demand for stricter Service-Level Agreements (SLAs) and more
robust handling of extreme edge cases. Currently, simpler architectural designs (such as the third KV transmission mode
adopted by UCM) can provide greater design redundancy for more complex and effective solutions in the future. For
example, when implementing checkpoint-based resumption and offline inference, it has been found that these
functionalities can be extremely easily integrated into a simple architecture.

UCM’s understanding of PD Disaggregation remains rooted in the principles of "simplicity" and "decoupling"—to the extent
that it may even sacrifice a certain degree of performance to preserve these core advantages.

:::{toctree}
:maxdepth: 2
1p1d.md
Expand Down
1 change: 0 additions & 1 deletion docs/source/user-guide/prefix-cache/base.md

This file was deleted.

79 changes: 77 additions & 2 deletions docs/source/user-guide/prefix-cache/index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,84 @@
# Prefix Cache

## Prefix Cache: A Fundamental Acceleration Component for KVCache and Its Architectural Considerations in Large Language Model Inference

As the simplest and most fundamental acceleration feature of KVCache, Prefix Cache has achieved industry-wide consensus.
With the expanding application scope of large language models (LLMs), the growth of sequence lengths, and the
proliferation of Agent-based applications, the performance gains of Prefix Cache become even more pronounced. Serving as
the default capability for KVCache applications, Prefix Cache also lays the foundation for the PD disaggregation by UCM.
Concurrently, it imposes a requirement that sparse algorithms must support Prefix Cache.

The hit rate of Prefix Cache is its core performance metric, and there exists a direct positive correlation between
cache capacity and hit rate. Taking the publicly released data from DeepSeek and Kimi as examples, a relatively large
cache capacity is required to reach the "hit rate sweet spot". In terms of input/output (IO) characteristics, Prefix
Cache primarily demands bandwidth-intensive IO, making it well-suited for storage on Solid-State Drives (SSDs).

Prefix Cache can leverage diverse storage media, including High-Bandwidth Memory (HBM), Dynamic Random-Access Memory (
DRAM), SSDs, and dedicated storage systems (e.g., DeepSeek’s 3fs, a storage system specifically developed for KVCache).
The fundamental design philosophy involves constructing a **multi-level cache** hierarchy using HBM, DRAM, local SSDs,
and
remote storage. In practice, the implementation of this hierarchy can be roughly categorized into two architectural
directions:

- **Decentralized Architecture**: KVCache is deployed in an isolated manner for each inference instance, with each
KVCache
partition belonging to a distinct inference instance (or server). This distributed KVCache deployment is typically
paired with upper-layer KVCache-aware affinity scheduling. The goal of such scheduling is to route inference requests
to instances with higher KVCache hit rates, thereby maximizing performance.
Centralized Architecture: KVCache is stored in a centralized external storage system and shared across all computing
nodes. This architecture features inherent simplicity; DeepSeek’s 3fs adopts this design paradigm, and the Prefix
Cache module in UCM also tends to prioritize this centralized approach.
- **Centralized Architecture**: KVCache is stored in a centralized external storage system and shared across all
computing
nodes. This architecture features inherent simplicity; DeepSeek’s 3fs adopts this design paradigm, and the Prefix
Cache module in UCM also tends to prioritize this centralized approach.

## Rationale for Adopting DeepSeek’s Centralized Approach Over Dynamo’s Decentralized Design

The decision to adopt DeepSeek’s centralized architecture (rather than Dynamo’s decentralized scheme) is driven by the
following key considerations, which align with UCM’s core design principles:

1. **Adherence to UCM’s First Foundational Principle: "Simplicity"**.A core tenet guiding UCM’s design is "avoiding
unnecessary investments in features that do not yield substantial benefits". Affinity scheduling, however, is not a
trivial module to implement. Most decentralized implementations require each inference instance to feed back KVCache
management status to the scheduler—enabling the scheduler to predict hit rates for requests routed to different
instances. Additionally, the scheduler must balance these hit rates against the load of each instance, introducing
significant complexity.

2. **Compliance with UCM’s First Derived Principle: "Decoupling"**. In decentralized architectures, inference instances
are required to report KVCache status to the scheduler. This breaks the independence of individual instances,
introducing coupling between upper-layer scheduling and lower-layer inference components—an outcome explicitly
avoided in UCM’s design. It is important to emphasize that UCM’s design is governed by only two principles: "
Simplicity" serves as the only axiom, while "Decoupling" is regarded as the first derived theorem.

3. **Cost-Benefit Analysis: Insufficient Gains to Justify Principle Violations**. UCM’s evaluation indicates that
decentralized KVCache does not deliver benefits significant enough to offset the trade-offs of violating the "
Simplicity" and "Decoupling" principles. The primary purported advantages of decentralized KVCache—reduced KVCache
network bandwidth consumption and lower latency—are unavoidable under the PD-separated architecture. Moreover, when
compared to improvements in Time-to-First-Token (TTFT), the latency reduction benefits of decentralization are
marginal.

4. **Facilitation of Commercial-Grade Inference Solutions**. Decentralized KVCache introduces additional complexity to
fault tolerance and multi-instance deployment. To advance toward a "commercially viable inference solution", UCM
prioritizes architectures that are structurally simple and robust to anomalies.

5. **Mitigation of Data Silos**.Decentralized KVCache inherently creates data silos: redundant KVCache data accumulates
across isolated instances, and the limited capacity of individual silos constrains the overall Prefix Cache hit
rate—undermining a key performance objective.

6. **Enhanced Compatibility with PD Separation and Large-Scale Deployment**.The centralized architecture exhibits
superior compatibility with the PD-separated paradigm and is more scalable for large-scale inference deployments, a
critical requirement for industrial-grade LLM applications.

It is important to note that the distinction between decentralized and centralized architectures is not absolute. For
instance, some decentralized implementations integrate remote storage to augment capacity, and UCM similarly leverages
DRAM as a high-speed cache tier. The core difference lies in architectural priority: in decentralized designs, affinity
scheduling is a high-priority requirement (as it directly impacts KVCache hit rates); in centralized designs, however,
affinity scheduling is demoted to a low-priority consideration, affecting only TTFT rather than core hit rate
performance.

:::{toctree}
:maxdepth: 1
:caption: Index
base
dram_store
nfs_store
:::
4 changes: 2 additions & 2 deletions examples/offline_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ def build_llm_with_uc(module_path: str, name: str, model: str):
kv_connector_extra_config={
"ucm_connector_name": "UcmDramStore",
"ucm_connector_config": {
"max_cache_size": 5368709120,
"kv_block_size": 262144
"max_cache_size": 5368709120,
"kv_block_size": 262144,
},
"ucm_sparse_config": {
"ESA": {
Expand Down
1 change: 1 addition & 0 deletions ucm/store/connector/nfsstore_connector.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from typing import Dict, List, Tuple

import torch

from ucm.store.connector import ucmnfsstore
from ucm.store.connector.ucmstore import Task, UcmKVStoreBase

Expand Down
2 changes: 1 addition & 1 deletion ucm/ucm_sparse/esa.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
UcmSparseMetadata,
UcmSparseRole,
)
from ucm.store.connector.ucmstore import Task, UcmKVStoreBase
from ucm.store.connector.ucmstore import Task, UcmKVStoreBase
from ucm.ucm_sparse.retrieval import retrieval_backend
from ucm.ucm_sparse.retrieval.retrieval_worker import RetrievalWorker

Expand Down