Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 12 additions & 15 deletions docs/source/user-guide/pd-disaggregation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,22 @@
The Disaggregation of Prefill and Decode (PD Disaggregation) has basically become a consensus solution for the
deployment of
large-scale inference clusters, and its advantages are even more prominent, especially for Mixture of Experts (MOE)
models. PD Disaggregation mainly includes three core components: independent deployment strategies for Prefill and
Decode,
KV cache storage and transmission strategies, and scheduling strategies. Notably, the scheduling strategy is dependent
on the KV cache storage and transmission strategy. The PD Disaggregation design in the Unified Computing Model (UCM)
focuses
models. PD Disaggregation mainly includes three core components: **independent deployment strategies for Prefill and Decode**,
**KV cache storage and transmission strategies**, and **scheduling strategies**. Notably, the scheduling strategy is dependent
on the KV cache storage and transmission strategy. The PD Disaggregation design in UCM focuses
primarily on optimizing KV cache storage and transmission, thereby enabling more rational scheduling strategies.

Prefix Cache has become a standard component in inference systems. With the expanding application scope of large models,
Prefix Cache has become a standard component in inference systems. With the expanding application scope of large language models (LLMs),
the increase in sequence lengths, and the growing adoption of Agent-based applications, the performance benefits of
Prefix Cache will become even more significant. The PD Disaggregation in UCM takes Prefix Cache as a foundational
assumption
and is inherently dependent on its functionality.
assumption and is inherently dependent on its functionality.

## Transmission Modes of KV Cache Between Prefill and Decode Nodes

There are roughly three transmission modes for KV cache between Prefill (P) and Decode (D) nodes, each with distinct
characteristics and application scenarios:

1. **Direct Transmission**.KV cache is transmitted directly from the High-Bandwidth Memory (HBM) of the Prefill node to
1. **Direct Transmission**. KV cache is transmitted directly from the High-Bandwidth Memory (HBM) of the Prefill node to
the HBM of the Decode node, typically via a high-speed inter-HBM network or a direct pass-through protocol. This
approach is straightforward and efficient, making it highly suitable for scenarios with a 1:1 Prefill-to-Decode
ratio (1P1D) and homogeneous P/D nodes. On the scheduling side, coordination is usually required: Prefill and Decode
Expand All @@ -33,9 +30,9 @@ characteristics and application scenarios:
duration in the entire process, effectively reducing HBM resource consumption.
3. **Transmission via Unified Storage Pool (Leveraging Prefix Cache Logic)**. This mode fully utilizes Prefix Cache
logic, with a unified storage pool serving as the intermediate medium for KV cache transmission. Specifically, the
Prefill node offloads KV cache to the Prefix Cache, while the Decode node performs inference with high hit rates on
Prefill node offloads KV cache to the storage, while the Decode node performs inference with high hit rates on
the Prefix Cache. Compared with the first two modes, this approach is the "simplest" in terms of logic and
implementation, and achieves the highest degree of "decoupling" in the entire systemeven eliminating the need for a
implementation, and achieves the highest degree of decoupling in the entire system, even eliminating the need for a
strict distinction between Prefill and Decode nodes.

### Rationale for UCM’s Adoption of the Third Transmission Mode
Expand Down Expand Up @@ -70,9 +67,9 @@ scenarios include the following:

**1. Reducing GPU Compute Idle Time and Maximizing Compute Utilization**

- Under Dynamic Batching (DP), the scheduler merges sequences of different lengths to reduce idle time caused by DP,
- Under Data Parallelism (DP) in Dynamic Batching, the scheduler merges sequences of different lengths to reduce idle time caused by DP,
with task migration performed midway if necessary.
- The scheduler leverages Chunk Prefill to utilize residual compute resources on Decode instances.
- The scheduler leverages Chunked Prefill to utilize residual compute resources on Decode instances.
- By default, the scheduler stores KV cache generated from each inference task in a unified external memory. This not
only avoids recomputation in case of exceptions but also maximizes system-wide compute utilization through mid-task
migration.
Expand Down Expand Up @@ -105,13 +102,13 @@ can further reduce compute idle time (e.g., idle time caused by DP) and fully ex

However, it is important to recognize that large-model inference is still in its early stages, and PD Disaggregation
represents only the starting point for the transition toward large-scale distributed inference deployment. As more
application scenarios emerge, there will be an inevitable demand for stricter Service-Level Agreements (SLAs) and more
application scenarios emerge, there will be an inevitable demand for stricter and stricter Service-Level Agreements (SLAs) and more
robust handling of extreme edge cases. Currently, simpler architectural designs (such as the third KV transmission mode
adopted by UCM) can provide greater design redundancy for more complex and effective solutions in the future. For
example, when implementing checkpoint-based resumption and offline inference, it has been found that these
functionalities can be extremely easily integrated into a simple architecture.

UCM’s understanding of PD Disaggregation remains rooted in the principles of "simplicity" and "decoupling"—to the extent
UCM’s understanding of PD Disaggregation remains rooted in the principles of "**simplicity**" and "**decoupling**", to the extent
that it may even sacrifice a certain degree of performance to preserve these core advantages.

:::{toctree}
Expand Down
8 changes: 4 additions & 4 deletions docs/source/user-guide/prefix-cache/dram_store.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# DRAM Store

This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and support larger models or batch sizes.
This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and supporting larger models or batch sizes.

## Performance

### Overview
The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM on two different models.
During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from DRAM.

In the QwQ-32B model, the test used one H20 server with two GPUs.
In the QwQ-32B model, the test used one H20 server with 2 GPUs.

Here, Full Compute refers to pure VLLM inference, while DRAM80% indicates that after UCM pooling, the DRAM hit rate of the KV cache is 80%.

Expand Down Expand Up @@ -42,10 +42,10 @@ To use the DRAM connector, you need to configure the `connector_config` dictiona
### Required Parameters

- `max_cache_size` *(optional)*:
Specifies the maximum allowed DRAM memory usage (in **byte**) for caching in `kv_connector_extra_config["ucm_connector_config"]`.
Specifies the maximum allowed DRAM memory usage (in **bytes**) for caching in `kv_connector_extra_config["ucm_connector_config"]`.
If not provided, it defaults to **5 GB**.
- `kv_block_size` *(optional)*:
Specifies the memory size (in bytes) of a single key or value cache block used in vLLM’s paged attention mechanism, which is calculated as : `block_size * head_size * total_num_kv_heads * element_size`.
Specifies the memory size (in **bytes**) of a single key or value cache block used in vLLM’s paged attention mechanism, which is calculated as : `block_size * head_size * total_num_kv_heads * element_size`.

### Example:

Expand Down
27 changes: 11 additions & 16 deletions docs/source/user-guide/prefix-cache/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,22 @@ proliferation of Agent-based applications, the performance gains of Prefix Cache
the default capability for KVCache applications, Prefix Cache also lays the foundation for the PD disaggregation by UCM.
Concurrently, it imposes a requirement that sparse algorithms must support Prefix Cache.

The hit rate of Prefix Cache is its core performance metric, and there exists a direct positive correlation between
The core performance metric of Prefix Cache is the hit rate, and there exists a direct positive correlation between
cache capacity and hit rate. Taking the publicly released data from DeepSeek and Kimi as examples, a relatively large
cache capacity is required to reach the "hit rate sweet spot". In terms of input/output (IO) characteristics, Prefix
Cache primarily demands bandwidth-intensive IO, making it well-suited for storage on Solid-State Drives (SSDs).

Prefix Cache can leverage diverse storage media, including High-Bandwidth Memory (HBM), Dynamic Random-Access Memory (
DRAM), SSDs, and dedicated storage systems (e.g., DeepSeek’s 3fs, a storage system specifically developed for KVCache).
The fundamental design philosophy involves constructing a **multi-level cache** hierarchy using HBM, DRAM, local SSDs,
and
remote storage. In practice, the implementation of this hierarchy can be roughly categorized into two architectural
and remote storage. In practice, the implementation of this hierarchy can be roughly categorized into two architectural
directions:

- **Decentralized Architecture**: KVCache is deployed in an isolated manner for each inference instance, with each
KVCache
partition belonging to a distinct inference instance (or server). This distributed KVCache deployment is typically
paired with upper-layer KVCache-aware affinity scheduling. The goal of such scheduling is to route inference requests
to instances with higher KVCache hit rates, thereby maximizing performance.
Centralized Architecture: KVCache is stored in a centralized external storage system and shared across all computing
nodes. This architecture features inherent simplicity; DeepSeek’s 3fs adopts this design paradigm, and the Prefix
Cache module in UCM also tends to prioritize this centralized approach.
to instances with higher KVCache hit rates, thereby maximizing overall system performance.
- **Centralized Architecture**: KVCache is stored in a centralized external storage system and shared across all
computing
nodes. This architecture features inherent simplicity; DeepSeek’s 3fs adopts this design paradigm, and the Prefix
Expand All @@ -38,10 +34,10 @@ directions:
The decision to adopt DeepSeek’s centralized architecture (rather than Dynamo’s decentralized scheme) is driven by the
following key considerations, which align with UCM’s core design principles:

1. **Adherence to UCM’s First Foundational Principle: "Simplicity"**.A core tenet guiding UCM’s design is "avoiding
1. **Adherence to UCM’s First Foundational Principle: "Simplicity"**. A core tenet guiding UCM’s design is "avoiding
unnecessary investments in features that do not yield substantial benefits". Affinity scheduling, however, is not a
trivial module to implement. Most decentralized implementations require each inference instance to feed back KVCache
management status to the scheduler—enabling the scheduler to predict hit rates for requests routed to different
management status to the scheduler to enable the scheduler to predict hit rates for requests routed to different
instances. Additionally, the scheduler must balance these hit rates against the load of each instance, introducing
significant complexity.

Expand All @@ -54,20 +50,19 @@ following key considerations, which align with UCM’s core design principles:
3. **Cost-Benefit Analysis: Insufficient Gains to Justify Principle Violations**. UCM’s evaluation indicates that
decentralized KVCache does not deliver benefits significant enough to offset the trade-offs of violating the "
Simplicity" and "Decoupling" principles. The primary purported advantages of decentralized KVCache—reduced KVCache
network bandwidth consumption and lower latency—are unavoidable under the PD-separated architecture. Moreover, when
network bandwidth consumption and lower latency. However, it's hard to achieve these two benefits under the PD-disaggregated architecture. Moreover, when
compared to improvements in Time-to-First-Token (TTFT), the latency reduction benefits of decentralization are
marginal.

4. **Facilitation of Commercial-Grade Inference Solutions**. Decentralized KVCache introduces additional complexity to
fault tolerance and multi-instance deployment. To advance toward a "commercially viable inference solution", UCM
4. **Facilitation of Commercial-Grade Inference Solutions**. Decentralized KVCache introduces additional complexity in achieving fault tolerance and supporting multi-instance deployment. To advance toward a "commercially viable inference solution", UCM
prioritizes architectures that are structurally simple and robust to anomalies.

5. **Mitigation of Data Silos**.Decentralized KVCache inherently creates data silos: redundant KVCache data accumulates
5. **Mitigation of Data Silos**. Decentralized KVCache inherently creates data silos: redundant KVCache data accumulates
across isolated instances, and the limited capacity of individual silos constrains the overall Prefix Cache hit
rateundermining a key performance objective.
rate, undermining a key performance objective.

6. **Enhanced Compatibility with PD Separation and Large-Scale Deployment**.The centralized architecture exhibits
superior compatibility with the PD-separated paradigm and is more scalable for large-scale inference deployments, a
6. **Enhanced Compatibility with PD Disaggregation and Large-Scale Deployment**. The centralized architecture exhibits
superior compatibility with the PD-disaggregated paradigm and is more scalable for large-scale inference deployments, a
critical requirement for industrial-grade LLM applications.

It is important to note that the distinction between decentralized and centralized architectures is not absolute. For
Expand Down
8 changes: 4 additions & 4 deletions docs/source/user-guide/prefix-cache/nfs_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ This document provides a usage example and configuration guide for the **NFS Con
The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM on two different models.
During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD.

In the QwQ-32B model, the test used one H20 server with two GPUs.
In the DeepSeek-V3 model, the test used two H20 servers with sixteen GPUs.
In the QwQ-32B model, the test used one H20 server with 2 GPUs.
In the DeepSeek-V3 model, the test used two H20 servers with 16 GPUs.

Here, Full Compute refers to pure VLLM inference, while Disk80% indicates that after UCM pooling, the SSD hit rate of the KV cache is 80%.

Expand Down Expand Up @@ -176,6 +176,6 @@ To quickly experience the NFS Connector's effect:
|--------------|-----------------------------------------------------------------------------|
| `task_id` | Unique identifier for the task |
| `direction` | `D2S`: Dump to Storage (Device → SSD)<br>`S2D`: Load from Storage (SSD → Device) |
| `task_count` | Number of tasks executed in this operation |
| `size` | Total size of data transferred in bytes (across all tasks) |
| `task_count` | Number of tasks executed in this operation |
| `size` | Total size of data transferred in **bytes** (across all tasks) |
| `time` | Time taken for the complete operation in seconds |
6 changes: 3 additions & 3 deletions docs/source/user-guide/sparse-attention/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ The core concept of our UCMSparse attention framework is to offload the complete
- UCMSparse in model_runner: this instance locates in the same process as the `Worker`.
A typical sparse attention algorithm works like this:
1. In prefill, it dumps full KV Cache from HBM to storage.
2. In decode, it retrieves the most relevant blocks based on the context and loads the blocks from store to HBM.
3. In decoode, it also dumps new generated blocks to keep the latest context accessible.
- By fine-grained task scheduling, retrieval and loading can be executed asynchronously and overlap with the model execution. Therefore no overhead is introduced by UCMSparse and generation speed is boosted benefitted by less computational load and fewer memory accesses.
2. In decode, it retrieves the most relevant blocks based on the context and loads the blocks from storage to HBM.
3. In decode, it also dumps new generated blocks to keep the latest context accessible.
- By fine-grained task scheduling, retrieval and loading can be executed asynchronously and overlap with the model execution. Therefore, benefited from less computational load and fewer memory accesses, no overhead is introduced by UCMSparse and generation speed is boosted.


See `ESA` for more details.
Expand Down
14 changes: 10 additions & 4 deletions examples/offline_inference_kvstar.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ def setup_environment_variables():
os.environ["PYTHONHASHSEED"] = "123456"
os.environ["VLLM_TORCH_PROFILER_DIR"] = "./vllm_profile"


@contextlib.contextmanager
def build_llm_with_uc(module_path: str, name: str, model: str):
ktc = KVTransferConfig(
Expand All @@ -41,8 +42,8 @@ def build_llm_with_uc(module_path: str, name: str, model: str):
"local_window_sz": 2,
"sparse_ratio": 0.25,
"retrieval_stride": 8,
"blk_repre_dim_prune_ratio": 0.25, # 块表征维度裁剪
"blk_repre_inner_token_merge": 2 # 块内几个token融合成一个表征
"blk_repre_dim_prune_ratio": 0.25, # 块表征维度裁剪
"blk_repre_inner_token_merge": 2, # 块内几个token融合成一个表征
}
},
},
Expand Down Expand Up @@ -162,8 +163,13 @@ def main():

sampling_params = SamplingParams(temperature=0, top_k=1, max_tokens=300)

print_output(llm, prompts_prefill_more_than_2_full_blk, sampling_params, "first")
print_output(llm, prompts_prefill_more_than_2_full_blk, sampling_params, "second")
print_output(
llm, prompts_prefill_more_than_2_full_blk, sampling_params, "first"
)
print_output(
llm, prompts_prefill_more_than_2_full_blk, sampling_params, "second"
)


if __name__ == "__main__":
main()
4 changes: 3 additions & 1 deletion ucm/integration/vllm/ucm_sparse/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,6 @@ def create_sparse_method(
"KvComp", "ucm.sandbox.sparse.kvcomp.kvcomp", "KvComp"
)
UcmSparseFactory.register_sparse_method("GSA", "ucm.ucm_sparse.gsa", "GSA")
UcmSparseFactory.register_sparse_method("KVStarMultiStep", "ucm.ucm_sparse.kvstar.multistep", "KVStarMultiStep")
UcmSparseFactory.register_sparse_method(
"KVStarMultiStep", "ucm.ucm_sparse.kvstar.multistep", "KVStarMultiStep"
)
Loading