ModelEngine-Group · ygwpz · Sep 29, 2025 · Sep 28, 2025 · Sep 28, 2025 · Sep 28, 2025
@@ -3,25 +3,22 @@
 The Disaggregation of Prefill and Decode (PD Disaggregation) has basically become a consensus solution for the
 deployment of
 large-scale inference clusters, and its advantages are even more prominent, especially for Mixture of Experts (MOE)
-models. PD Disaggregation mainly includes three core components: independent deployment strategies for Prefill and
-Decode,
-KV cache storage and transmission strategies, and scheduling strategies. Notably, the scheduling strategy is dependent
-on the KV cache storage and transmission strategy. The PD Disaggregation design in the Unified Computing Model (UCM)
-focuses
+models. PD Disaggregation mainly includes three core components: **independent deployment strategies for Prefill and Decode**,
+**KV cache storage and transmission strategies**, and **scheduling strategies**. Notably, the scheduling strategy is dependent
+on the KV cache storage and transmission strategy. The PD Disaggregation design in UCM focuses
 primarily on optimizing KV cache storage and transmission, thereby enabling more rational scheduling strategies.
 
-Prefix Cache has become a standard component in inference systems. With the expanding application scope of large models,
+Prefix Cache has become a standard component in inference systems. With the expanding application scope of large language models (LLMs),
 the increase in sequence lengths, and the growing adoption of Agent-based applications, the performance benefits of
 Prefix Cache will become even more significant. The PD Disaggregation in UCM takes Prefix Cache as a foundational
-assumption
-and is inherently dependent on its functionality.
+assumption and is inherently dependent on its functionality.
 
 ## Transmission Modes of KV Cache Between Prefill and Decode Nodes
 
 There are roughly three transmission modes for KV cache between Prefill (P) and Decode (D) nodes, each with distinct
 characteristics and application scenarios:
 
-1. **Direct Transmission**.KV cache is transmitted directly from the High-Bandwidth Memory (HBM) of the Prefill node to
+1. **Direct Transmission**. KV cache is transmitted directly from the High-Bandwidth Memory (HBM) of the Prefill node to
    the HBM of the Decode node, typically via a high-speed inter-HBM network or a direct pass-through protocol. This
    approach is straightforward and efficient, making it highly suitable for scenarios with a 1:1 Prefill-to-Decode
    ratio (1P1D) and homogeneous P/D nodes. On the scheduling side, coordination is usually required: Prefill and Decode
@@ -33,9 +30,9 @@ characteristics and application scenarios:
    duration in the entire process, effectively reducing HBM resource consumption.
 3. **Transmission via Unified Storage Pool (Leveraging Prefix Cache Logic)**. This mode fully utilizes Prefix Cache
    logic, with a unified storage pool serving as the intermediate medium for KV cache transmission. Specifically, the
-   Prefill node offloads KV cache to the Prefix Cache, while the Decode node performs inference with high hit rates on
+   Prefill node offloads KV cache to the storage, while the Decode node performs inference with high hit rates on
    the Prefix Cache. Compared with the first two modes, this approach is the "simplest" in terms of logic and
-   implementation, and achieves the highest degree of "decoupling" in the entire system—even eliminating the need for a
+   implementation, and achieves the highest degree of decoupling in the entire system, even eliminating the need for a
    strict distinction between Prefill and Decode nodes.
 
 ### Rationale for UCM’s Adoption of the Third Transmission Mode
@@ -70,9 +67,9 @@ scenarios include the following:
 
 **1. Reducing GPU Compute Idle Time and Maximizing Compute Utilization**
 
-- Under Dynamic Batching (DP), the scheduler merges sequences of different lengths to reduce idle time caused by DP,
+- Under Data Parallelism (DP) in Dynamic Batching, the scheduler merges sequences of different lengths to reduce idle time caused by DP,
   with task migration performed midway if necessary.
-- The scheduler leverages Chunk Prefill to utilize residual compute resources on Decode instances.
+- The scheduler leverages Chunked Prefill to utilize residual compute resources on Decode instances.
 - By default, the scheduler stores KV cache generated from each inference task in a unified external memory. This not
   only avoids recomputation in case of exceptions but also maximizes system-wide compute utilization through mid-task
   migration.
@@ -105,13 +102,13 @@ can further reduce compute idle time (e.g., idle time caused by DP) and fully ex
 
 However, it is important to recognize that large-model inference is still in its early stages, and PD Disaggregation
 represents only the starting point for the transition toward large-scale distributed inference deployment. As more
-application scenarios emerge, there will be an inevitable demand for stricter Service-Level Agreements (SLAs) and more
+application scenarios emerge, there will be an inevitable demand for stricter and stricter Service-Level Agreements (SLAs) and more
 robust handling of extreme edge cases. Currently, simpler architectural designs (such as the third KV transmission mode
 adopted by UCM) can provide greater design redundancy for more complex and effective solutions in the future. For
 example, when implementing checkpoint-based resumption and offline inference, it has been found that these
 functionalities can be extremely easily integrated into a simple architecture.
 
-UCM’s understanding of PD Disaggregation remains rooted in the principles of "simplicity" and "decoupling"—to the extent
+UCM’s understanding of PD Disaggregation remains rooted in the principles of "**simplicity**" and "**decoupling**", to the extent
 that it may even sacrifice a certain degree of performance to preserve these core advantages.
 
 :::{toctree}

@@ -1,14 +1,14 @@
 # DRAM Store
 
-This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and support larger models or batch sizes.
+This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and supporting larger models or batch sizes.
 
 ## Performance
 
 ### Overview
 The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM on two different models.
 During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from DRAM.
 
-In the QwQ-32B model, the test used one H20 server with two GPUs.
+In the QwQ-32B model, the test used one H20 server with 2 GPUs.
 
 Here, Full Compute refers to pure VLLM inference, while DRAM80% indicates that after UCM pooling, the DRAM hit rate of the KV cache is 80%.
 
@@ -42,10 +42,10 @@ To use the DRAM connector, you need to configure the `connector_config` dictiona
 ### Required Parameters
 
 - `max_cache_size` *(optional)*:
-  Specifies the maximum allowed DRAM memory usage (in **byte**) for caching in `kv_connector_extra_config["ucm_connector_config"]`.
+  Specifies the maximum allowed DRAM memory usage (in **bytes**) for caching in `kv_connector_extra_config["ucm_connector_config"]`.
   If not provided, it defaults to **5 GB**.
 - `kv_block_size` *(optional)*:
-  Specifies the memory size (in bytes) of a single key or value cache block used in vLLM’s paged attention mechanism, which is calculated as : `block_size * head_size * total_num_kv_heads * element_size`.
+  Specifies the memory size (in **bytes**) of a single key or value cache block used in vLLM’s paged attention mechanism, which is calculated as : `block_size * head_size * total_num_kv_heads * element_size`.
 
 ### Example:
 

@@ -8,26 +8,22 @@ proliferation of Agent-based applications, the performance gains of Prefix Cache
 the default capability for KVCache applications, Prefix Cache also lays the foundation for the PD disaggregation by UCM.
 Concurrently, it imposes a requirement that sparse algorithms must support Prefix Cache.
 
-The hit rate of Prefix Cache is its core performance metric, and there exists a direct positive correlation between
+The core performance metric of Prefix Cache is the hit rate, and there exists a direct positive correlation between
 cache capacity and hit rate. Taking the publicly released data from DeepSeek and Kimi as examples, a relatively large
 cache capacity is required to reach the "hit rate sweet spot". In terms of input/output (IO) characteristics, Prefix
 Cache primarily demands bandwidth-intensive IO, making it well-suited for storage on Solid-State Drives (SSDs).
 
 Prefix Cache can leverage diverse storage media, including High-Bandwidth Memory (HBM), Dynamic Random-Access Memory (
 DRAM), SSDs, and dedicated storage systems (e.g., DeepSeek’s 3fs, a storage system specifically developed for KVCache).
 The fundamental design philosophy involves constructing a **multi-level cache** hierarchy using HBM, DRAM, local SSDs,
-and
-remote storage. In practice, the implementation of this hierarchy can be roughly categorized into two architectural
+and remote storage. In practice, the implementation of this hierarchy can be roughly categorized into two architectural
 directions:
 
 - **Decentralized Architecture**: KVCache is deployed in an isolated manner for each inference instance, with each
   KVCache
   partition belonging to a distinct inference instance (or server). This distributed KVCache deployment is typically
   paired with upper-layer KVCache-aware affinity scheduling. The goal of such scheduling is to route inference requests
-  to instances with higher KVCache hit rates, thereby maximizing performance.
-  Centralized Architecture: KVCache is stored in a centralized external storage system and shared across all computing
-  nodes. This architecture features inherent simplicity; DeepSeek’s 3fs adopts this design paradigm, and the Prefix
-  Cache module in UCM also tends to prioritize this centralized approach.
+  to instances with higher KVCache hit rates, thereby maximizing overall system performance.
 - **Centralized Architecture**: KVCache is stored in a centralized external storage system and shared across all
   computing
   nodes. This architecture features inherent simplicity; DeepSeek’s 3fs adopts this design paradigm, and the Prefix
@@ -38,10 +34,10 @@ directions:
 The decision to adopt DeepSeek’s centralized architecture (rather than Dynamo’s decentralized scheme) is driven by the
 following key considerations, which align with UCM’s core design principles:
 
-1. **Adherence to UCM’s First Foundational Principle: "Simplicity"**.A core tenet guiding UCM’s design is "avoiding
+1. **Adherence to UCM’s First Foundational Principle: "Simplicity"**. A core tenet guiding UCM’s design is "avoiding
    unnecessary investments in features that do not yield substantial benefits". Affinity scheduling, however, is not a
    trivial module to implement. Most decentralized implementations require each inference instance to feed back KVCache
-   management status to the scheduler—enabling the scheduler to predict hit rates for requests routed to different
+   management status to the scheduler to enable the scheduler to predict hit rates for requests routed to different
    instances. Additionally, the scheduler must balance these hit rates against the load of each instance, introducing
    significant complexity.
 
@@ -54,20 +50,19 @@ following key considerations, which align with UCM’s core design principles:
 3. **Cost-Benefit Analysis: Insufficient Gains to Justify Principle Violations**. UCM’s evaluation indicates that
    decentralized KVCache does not deliver benefits significant enough to offset the trade-offs of violating the "
    Simplicity" and "Decoupling" principles. The primary purported advantages of decentralized KVCache—reduced KVCache
-   network bandwidth consumption and lower latency—are unavoidable under the PD-separated architecture. Moreover, when
+   network bandwidth consumption and lower latency. However, it's hard to achieve these two benefits under the PD-disaggregated architecture. Moreover, when
    compared to improvements in Time-to-First-Token (TTFT), the latency reduction benefits of decentralization are
    marginal.
 
-4. **Facilitation of Commercial-Grade Inference Solutions**. Decentralized KVCache introduces additional complexity to
-   fault tolerance and multi-instance deployment. To advance toward a "commercially viable inference solution", UCM
+4. **Facilitation of Commercial-Grade Inference Solutions**. Decentralized KVCache introduces additional complexity in achieving fault tolerance and supporting multi-instance deployment. To advance toward a "commercially viable inference solution", UCM
    prioritizes architectures that are structurally simple and robust to anomalies.
 
-5. **Mitigation of Data Silos**.Decentralized KVCache inherently creates data silos: redundant KVCache data accumulates
+5. **Mitigation of Data Silos**. Decentralized KVCache inherently creates data silos: redundant KVCache data accumulates
    across isolated instances, and the limited capacity of individual silos constrains the overall Prefix Cache hit
-   rate—undermining a key performance objective.
+   rate, undermining a key performance objective.
 
-6. **Enhanced Compatibility with PD Separation and Large-Scale Deployment**.The centralized architecture exhibits
-   superior compatibility with the PD-separated paradigm and is more scalable for large-scale inference deployments, a
+6. **Enhanced Compatibility with PD Disaggregation and Large-Scale Deployment**. The centralized architecture exhibits
+   superior compatibility with the PD-disaggregated paradigm and is more scalable for large-scale inference deployments, a
    critical requirement for industrial-grade LLM applications.
 
 It is important to note that the distinction between decentralized and centralized architectures is not absolute. For

@@ -8,8 +8,8 @@ This document provides a usage example and configuration guide for the **NFS Con
 The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM on two different models.
 During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from SSD.
 
-In the QwQ-32B model, the test used one H20 server with two GPUs.
-In the DeepSeek-V3 model, the test used two H20 servers with sixteen GPUs.
+In the QwQ-32B model, the test used one H20 server with 2 GPUs.
+In the DeepSeek-V3 model, the test used two H20 servers with 16 GPUs.
 
 Here, Full Compute refers to pure VLLM inference, while Disk80% indicates that after UCM pooling, the SSD hit rate of the KV cache is 80%.
 
@@ -176,6 +176,6 @@ To quickly experience the NFS Connector's effect:
 |--------------|-----------------------------------------------------------------------------|
 | `task_id`    | Unique identifier for the task                                              |
 | `direction`  | `D2S`: Dump to Storage (Device → SSD)<br>`S2D`: Load from Storage (SSD → Device) |
-| `task_count` | Number of tasks executed in this operation                         |
-| `size`       | Total size of data transferred in bytes (across all tasks)                  |
+| `task_count` | Number of tasks executed in this operation                                  |
+| `size`       | Total size of data transferred in **bytes** (across all tasks)              |
 | `time`       | Time taken for the complete operation in seconds                            |
@@ -27,9 +27,9 @@ The core concept of our UCMSparse attention framework is to offload the complete
 - UCMSparse in model_runner: this instance locates in the same process as the `Worker`. 
 A typical sparse attention algorithm works like this:
     1. In prefill, it dumps full KV Cache from HBM to storage.
-    2. In decode, it retrieves the most relevant blocks based on the context and loads the blocks from store to HBM.
-    3. In decoode, it also dumps new generated blocks to keep the latest context accessible.
-- By fine-grained task scheduling, retrieval and loading can be executed asynchronously and overlap with the model execution. Therefore no overhead is introduced by UCMSparse and generation speed is boosted benefitted by less computational load and fewer memory accesses.
+    2. In decode, it retrieves the most relevant blocks based on the context and loads the blocks from storage to HBM.
+    3. In decode, it also dumps new generated blocks to keep the latest context accessible.
+- By fine-grained task scheduling, retrieval and loading can be executed asynchronously and overlap with the model execution. Therefore, benefited from less computational load and fewer memory accesses, no overhead is introduced by UCMSparse and generation speed is boosted.
 
 
 See `ESA` for more details.

@@ -23,6 +23,7 @@ def setup_environment_variables():
     os.environ["PYTHONHASHSEED"] = "123456"
     os.environ["VLLM_TORCH_PROFILER_DIR"] = "./vllm_profile"
 
+
 @contextlib.contextmanager
 def build_llm_with_uc(module_path: str, name: str, model: str):
     ktc = KVTransferConfig(
@@ -41,8 +42,8 @@ def build_llm_with_uc(module_path: str, name: str, model: str):
                     "local_window_sz": 2,
                     "sparse_ratio": 0.25,
                     "retrieval_stride": 8,
-                    "blk_repre_dim_prune_ratio": 0.25, # 块表征维度裁剪
-                    "blk_repre_inner_token_merge": 2 # 块内几个token融合成一个表征
+                    "blk_repre_dim_prune_ratio": 0.25,  # 块表征维度裁剪
+                    "blk_repre_inner_token_merge": 2,  # 块内几个token融合成一个表征
                 }
             },
         },
@@ -162,8 +163,13 @@ def main():
 
         sampling_params = SamplingParams(temperature=0, top_k=1, max_tokens=300)
 
-        print_output(llm, prompts_prefill_more_than_2_full_blk, sampling_params, "first")
-        print_output(llm, prompts_prefill_more_than_2_full_blk, sampling_params, "second")
+        print_output(
+            llm, prompts_prefill_more_than_2_full_blk, sampling_params, "first"
+        )
+        print_output(
+            llm, prompts_prefill_more_than_2_full_blk, sampling_params, "second"
+        )
+
 
 if __name__ == "__main__":
     main()
@@ -49,4 +49,6 @@ def create_sparse_method(
     "KvComp", "ucm.sandbox.sparse.kvcomp.kvcomp", "KvComp"
 )
 UcmSparseFactory.register_sparse_method("GSA", "ucm.ucm_sparse.gsa", "GSA")
-UcmSparseFactory.register_sparse_method("KVStarMultiStep", "ucm.ucm_sparse.kvstar.multistep", "KVStarMultiStep")
+UcmSparseFactory.register_sparse_method(
+    "KVStarMultiStep", "ucm.ucm_sparse.kvstar.multistep", "KVStarMultiStep"
+)