ModelEngine-Group · ygwpz · Sep 26, 2025 · Sep 22, 2025 · Sep 24, 2025 · Sep 25, 2025
@@ -1 +1,9 @@
-# About Us
+# About Us
+
+UCM is rooted in KV Cache, with the goal of reducing inference costs and building commercially viable inference
+solutions. It enhances throughput through methods such as Prefix Cache, sparsification, and PD Disaggregation.
+
+The UCM team consists of a group of "lazy" people who love simple things and also enjoy "borrowing" the excellent
+experiences of others. Adhering to the principle of full openness, we hope everyone will generously share their
+insights. We also welcome everyone to learn from these experiences together, engage in discussions, and help us make
+progress.
@@ -39,7 +39,6 @@ getting-started/installation_npu
 user-guide/prefix-cache/index
 user-guide/sparse-attention/index
 user-guide/pd-disaggregation/index
-user-guide/engine-integration/index
 :::
 
 :::{toctree}

@@ -1,5 +1,119 @@
 # PD Disaggregation
 
+The Disaggregation of Prefill and Decode (PD Disaggregation) has basically become a consensus solution for the
+deployment of
+large-scale inference clusters, and its advantages are even more prominent, especially for Mixture of Experts (MOE)
+models. PD Disaggregation mainly includes three core components: independent deployment strategies for Prefill and
+Decode,
+KV cache storage and transmission strategies, and scheduling strategies. Notably, the scheduling strategy is dependent
+on the KV cache storage and transmission strategy. The PD Disaggregation design in the Unified Computing Model (UCM)
+focuses
+primarily on optimizing KV cache storage and transmission, thereby enabling more rational scheduling strategies.
+
+Prefix Cache has become a standard component in inference systems. With the expanding application scope of large models,
+the increase in sequence lengths, and the growing adoption of Agent-based applications, the performance benefits of
+Prefix Cache will become even more significant. The PD Disaggregation in UCM takes Prefix Cache as a foundational
+assumption
+and is inherently dependent on its functionality.
+
+## Transmission Modes of KV Cache Between Prefill and Decode Nodes
+
+There are roughly three transmission modes for KV cache between Prefill (P) and Decode (D) nodes, each with distinct
+characteristics and application scenarios:
+
+1. **Direct Transmission**.KV cache is transmitted directly from the High-Bandwidth Memory (HBM) of the Prefill node to
+   the HBM of the Decode node, typically via a high-speed inter-HBM network or a direct pass-through protocol. This
+   approach is straightforward and efficient, making it highly suitable for scenarios with a 1:1 Prefill-to-Decode
+   ratio (1P1D) and homogeneous P/D nodes. On the scheduling side, coordination is usually required: Prefill and Decode
+   nodes are allocated at the initiation of a request to enable layer-wise KV transmission during the Prefill phase.
+2. **Indirect Transmission via DRAM**. First, the KV cache generated during the Prefill phase is offloaded to Dynamic
+   Random-Access Memory (DRAM). Subsequently, the KV cache is transferred from the Prefill node’s DRAM to the Decode
+   node’s DRAM, and finally loaded from the Decode node’s DRAM into its HBM for inference. In this mode, DRAM acts as a
+   logical cache, which is more compatible with scheduling logic. Critically, HBM is only occupied for the shortest
+   duration in the entire process, effectively reducing HBM resource consumption.
+3. **Transmission via Unified Storage Pool (Leveraging Prefix Cache Logic)**. This mode fully utilizes Prefix Cache
+   logic, with a unified storage pool serving as the intermediate medium for KV cache transmission. Specifically, the
+   Prefill node offloads KV cache to the Prefix Cache, while the Decode node performs inference with high hit rates on
+   the Prefix Cache. Compared with the first two modes, this approach is the "simplest" in terms of logic and
+   implementation, and achieves the highest degree of "decoupling" in the entire system—even eliminating the need for a
+   strict distinction between Prefill and Decode nodes.
+
+### Rationale for UCM’s Adoption of the Third Transmission Mode
+
+The "simplicity" and "decoupling" of the third mode are sufficient to make it the preferred choice for UCM. In practical
+implementations, additional advantages have been identified:
+
+1. **Complete Decoupling of Prefill and Decode**: This not only simplifies scheduling logic but also greatly streamlines
+   exception handling.
+2. **Full Reuse of Prefix Cache Code**: It serves as a "time-saver" for developers, as no additional PD
+   Disaggregation-specific logic needs to be added. Consequently, there is no need to address the cumbersome exception
+   handling issues associated with custom logic.
+3. **Unified Storage as Inference Instance State**: This design renders inference instances completely stateless,
+   significantly enhancing the robustness of the entire system.
+4. **Near-Zero-Cost Heterogeneous Inference**: Due to inherent differences between Prefill and Decode tasks, optimizing
+   cost can be achieved by selecting different graphics processing units (GPUs), precision levels, instance launch
+   methods, and optimization algorithms. While direct inter-GPU transmission becomes more complex in heterogeneous
+   environments, the Disaggregation of KV cache and computation either natively supports such scenarios or only requires
+   the
+   addition of a fully decoupled module. Over time, large inference clusters composed of new and old GPUs with diverse
+   architectures will naturally become a mainstream trend.
+
+In large-scale clusters, the direct transmission mode (Mode 1) requires either a full connection between Prefill and
+Decode nodes or further division of nodes into smaller groups. This not only increases the complexity of network design
+and scheduling but also limits the maximum scalable size of the cluster. In contrast, larger and more unified clusters
+are more conducive to improving overall throughput.
+
+## Enhanced Scheduling Flexibility Enabled by PD Disaggregation
+
+The flexible decoupling of PD Disaggregation provides greater flexibility for scheduling optimization. Key application
+scenarios include the following:
+
+**1. Reducing GPU Compute Idle Time and Maximizing Compute Utilization**
+
+- Under Dynamic Batching (DP), the scheduler merges sequences of different lengths to reduce idle time caused by DP,
+  with task migration performed midway if necessary.
+- The scheduler leverages Chunk Prefill to utilize residual compute resources on Decode instances.
+- By default, the scheduler stores KV cache generated from each inference task in a unified external memory. This not
+  only avoids recomputation in case of exceptions but also maximizes system-wide compute utilization through mid-task
+  migration.
+- The scheduler automatically switches the roles of Prefill and Decode nodes to further exploit underutilized compute
+  resources.
+- When system bandwidth is insufficient, the scheduler triggers additional recomputation to avoid bandwidth bottlenecks.
+- The scheduler balances the load across all instances, maximizing compute utilization while improving user experience.
+- During high-priority task preemption, the scheduler enables seamless migration of existing tasks to new instances.
+
+**2. Improving User Experience**
+
+- The scheduler prevents long sequences from delaying short sequences (which would degrade the experience of
+  short-sequence tasks), thereby improving average Time to First Token (TTFT) and Time per Output Token (TPOT).
+- The scheduler uses simple hashing to map requests from the same user to the same instance as much as possible,
+  increasing the local hit rate of KV cache and reducing both TTFT and bandwidth consumption.
+
+**3. Enhancing Exception Handling**
+
+- The scheduler implements mechanisms such as retries and checkpoint-based resumption to handle exceptions, preventing
+  task errors and failures.
+- The scheduler itself is designed with weak state and multi-instance redundancy, eliminating single points of failure
+  and reducing system-level risks.
+
+## Evolution and Future Outlook of PD Disaggregation
+
+Since its initial proposal, PD Disaggregation has evolved toward greater complexity with the widespread adoption of Deepseek
+MLA MOE models. This evolution has led to discussions about more granular Disaggregation strategies, such as
+Activation-Feedforward (AF) Disaggregation and layer-wise Disaggregation. It is undeniable that these more complex approaches
+can further reduce compute idle time (e.g., idle time caused by DP) and fully exploit compute resources.
+
+However, it is important to recognize that large-model inference is still in its early stages, and PD Disaggregation
+represents only the starting point for the transition toward large-scale distributed inference deployment. As more
+application scenarios emerge, there will be an inevitable demand for stricter Service-Level Agreements (SLAs) and more
+robust handling of extreme edge cases. Currently, simpler architectural designs (such as the third KV transmission mode
+adopted by UCM) can provide greater design redundancy for more complex and effective solutions in the future. For
+example, when implementing checkpoint-based resumption and offline inference, it has been found that these
+functionalities can be extremely easily integrated into a simple architecture.
+
+UCM’s understanding of PD Disaggregation remains rooted in the principles of "simplicity" and "decoupling"—to the extent
+that it may even sacrifice a certain degree of performance to preserve these core advantages.
+
 :::{toctree}
 :maxdepth: 2
 1p1d.md

@@ -1,9 +1,84 @@
 # Prefix Cache
 
+## Prefix Cache: A Fundamental Acceleration Component for KVCache and Its Architectural Considerations in Large Language Model Inference
+
+As the simplest and most fundamental acceleration feature of KVCache, Prefix Cache has achieved industry-wide consensus.
+With the expanding application scope of large language models (LLMs), the growth of sequence lengths, and the
+proliferation of Agent-based applications, the performance gains of Prefix Cache become even more pronounced. Serving as
+the default capability for KVCache applications, Prefix Cache also lays the foundation for the PD disaggregation by UCM.
+Concurrently, it imposes a requirement that sparse algorithms must support Prefix Cache.
+
+The hit rate of Prefix Cache is its core performance metric, and there exists a direct positive correlation between
+cache capacity and hit rate. Taking the publicly released data from DeepSeek and Kimi as examples, a relatively large
+cache capacity is required to reach the "hit rate sweet spot". In terms of input/output (IO) characteristics, Prefix
+Cache primarily demands bandwidth-intensive IO, making it well-suited for storage on Solid-State Drives (SSDs).
+
+Prefix Cache can leverage diverse storage media, including High-Bandwidth Memory (HBM), Dynamic Random-Access Memory (
+DRAM), SSDs, and dedicated storage systems (e.g., DeepSeek’s 3fs, a storage system specifically developed for KVCache).
+The fundamental design philosophy involves constructing a **multi-level cache** hierarchy using HBM, DRAM, local SSDs,
+and
+remote storage. In practice, the implementation of this hierarchy can be roughly categorized into two architectural
+directions:
+
+- **Decentralized Architecture**: KVCache is deployed in an isolated manner for each inference instance, with each
+  KVCache
+  partition belonging to a distinct inference instance (or server). This distributed KVCache deployment is typically
+  paired with upper-layer KVCache-aware affinity scheduling. The goal of such scheduling is to route inference requests
+  to instances with higher KVCache hit rates, thereby maximizing performance.
+  Centralized Architecture: KVCache is stored in a centralized external storage system and shared across all computing
+  nodes. This architecture features inherent simplicity; DeepSeek’s 3fs adopts this design paradigm, and the Prefix
+  Cache module in UCM also tends to prioritize this centralized approach.
+- **Centralized Architecture**: KVCache is stored in a centralized external storage system and shared across all
+  computing
+  nodes. This architecture features inherent simplicity; DeepSeek’s 3fs adopts this design paradigm, and the Prefix
+  Cache module in UCM also tends to prioritize this centralized approach.
+
+## Rationale for Adopting DeepSeek’s Centralized Approach Over Dynamo’s Decentralized Design
+
+The decision to adopt DeepSeek’s centralized architecture (rather than Dynamo’s decentralized scheme) is driven by the
+following key considerations, which align with UCM’s core design principles:
+
+1. **Adherence to UCM’s First Foundational Principle: "Simplicity"**.A core tenet guiding UCM’s design is "avoiding
+   unnecessary investments in features that do not yield substantial benefits". Affinity scheduling, however, is not a
+   trivial module to implement. Most decentralized implementations require each inference instance to feed back KVCache
+   management status to the scheduler—enabling the scheduler to predict hit rates for requests routed to different
+   instances. Additionally, the scheduler must balance these hit rates against the load of each instance, introducing
+   significant complexity.
+
+2. **Compliance with UCM’s First Derived Principle: "Decoupling"**. In decentralized architectures, inference instances
+   are required to report KVCache status to the scheduler. This breaks the independence of individual instances,
+   introducing coupling between upper-layer scheduling and lower-layer inference components—an outcome explicitly
+   avoided in UCM’s design. It is important to emphasize that UCM’s design is governed by only two principles: "
+   Simplicity" serves as the only axiom, while "Decoupling" is regarded as the first derived theorem.
+
+3. **Cost-Benefit Analysis: Insufficient Gains to Justify Principle Violations**. UCM’s evaluation indicates that
+   decentralized KVCache does not deliver benefits significant enough to offset the trade-offs of violating the "
+   Simplicity" and "Decoupling" principles. The primary purported advantages of decentralized KVCache—reduced KVCache
+   network bandwidth consumption and lower latency—are unavoidable under the PD-separated architecture. Moreover, when
+   compared to improvements in Time-to-First-Token (TTFT), the latency reduction benefits of decentralization are
+   marginal.
+
+4. **Facilitation of Commercial-Grade Inference Solutions**. Decentralized KVCache introduces additional complexity to
+   fault tolerance and multi-instance deployment. To advance toward a "commercially viable inference solution", UCM
+   prioritizes architectures that are structurally simple and robust to anomalies.
+
+5. **Mitigation of Data Silos**.Decentralized KVCache inherently creates data silos: redundant KVCache data accumulates
+   across isolated instances, and the limited capacity of individual silos constrains the overall Prefix Cache hit
+   rate—undermining a key performance objective.
+
+6. **Enhanced Compatibility with PD Separation and Large-Scale Deployment**.The centralized architecture exhibits
+   superior compatibility with the PD-separated paradigm and is more scalable for large-scale inference deployments, a
+   critical requirement for industrial-grade LLM applications.
+
+It is important to note that the distinction between decentralized and centralized architectures is not absolute. For
+instance, some decentralized implementations integrate remote storage to augment capacity, and UCM similarly leverages
+DRAM as a high-speed cache tier. The core difference lies in architectural priority: in decentralized designs, affinity
+scheduling is a high-priority requirement (as it directly impacts KVCache hit rates); in centralized designs, however,
+affinity scheduling is demoted to a low-priority consideration, affecting only TTFT rather than core hit rate
+performance.
+
 :::{toctree}
 :maxdepth: 1
-:caption: Index
-base
 dram_store
 nfs_store
 :::
@@ -32,8 +32,8 @@ def build_llm_with_uc(module_path: str, name: str, model: str):
         kv_connector_extra_config={
             "ucm_connector_name": "UcmDramStore",
             "ucm_connector_config": {
-                "max_cache_size": 5368709120, 
-                "kv_block_size": 262144
+                "max_cache_size": 5368709120,
+                "kv_block_size": 262144,
             },
             "ucm_sparse_config": {
                 "ESA": {

@@ -26,6 +26,7 @@
 from typing import Dict, List, Tuple
 
 import torch
+
 from ucm.store.connector import ucmnfsstore
 from ucm.store.connector.ucmstore import Task, UcmKVStoreBase
 

@@ -22,7 +22,7 @@
     UcmSparseMetadata,
     UcmSparseRole,
 )
-from ucm.store.connector.ucmstore  import Task, UcmKVStoreBase
+from ucm.store.connector.ucmstore import Task, UcmKVStoreBase
 from ucm.ucm_sparse.retrieval import retrieval_backend
 from ucm.ucm_sparse.retrieval.retrieval_worker import RetrievalWorker