ModelEngine-Group · flesher0813 · Aug 1, 2025 · Jul 31, 2025 · Jul 31, 2025 · Jul 31, 2025
@@ -5,13 +5,22 @@
   </picture>
 </p>
 
+<p align="center">
+| <a href="docs/source/index.md"><b>Documentation</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/issues/16"><b>Roadmap</b></a> |
+</p>
+
 ---
 
 *Latest News* 🔥
-- [2025/07/30] We are excited to announce the alpha release of Unified Cache Manager.
+- [2025/08/01] We are excited to announce the alpha release of Unified Cache Manager.
 
 ---
 
+## Performance
+nfs connector has reached about 4x TTFT accelerate.
+
+![perf](docs/source/images/nfs_performance.png)
+
 ## Overview
 
 ### Motivation

@@ -14,11 +14,10 @@ ENV VLLM_USE_PRECOMPILED=1
 RUN VLLM_TARGET_DEVICE=cuda pip install -v -e /vllm-workspace/vllm --extra-index=https://download.pytorch.org/whl/nightly/cu128
 
 # Install unified-cache-management
-ARG UCM_REPO=https://github.com/ModelEngine-Group/unified-cache-management.git
-ARG UCM_BRANCH=develop
-RUN git clone --depth 1 $UCM_REPO --branch $UCM_BRANCH /vllm-workspace/unified-cache-management
+COPY . /vllm-workspace/unified-cache-management
 
-RUN pip install -v -e /vllm-workspace/unified-cache-management
+RUN export PLATFORM="cuda" && \
+     pip install -v -e /vllm-workspace/unified-cache-management
 
 # Apply patch for vLLM
 RUN cd /vllm-workspace/vllm \

@@ -4,11 +4,11 @@ FROM quay.io/ascend/vllm-ascend:v0.9.2rc1
 WORKDIR /workspace
 
 # Install unified-cache-management
-ARG UCM_REPO=https://github.com/ModelEngine-Group/unified-cache-management.git
-ARG UCM_BRANCH=develop
-RUN git clone --depth 1 $UCM_REPO --branch $UCM_BRANCH /vllm-workspace/unified-cache-management
+COPY . /vllm-workspace/unified-cache-management
 
-RUN pip install -v -e /vllm-workspace/unified-cache-management
+RUN export PLATFORM="ascend" && \
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
+    pip install -v -e /vllm-workspace/unified-cache-management
 
 # Apply patch for vLLM
 RUN cd /vllm-workspace/vllm \

@@ -1 +1 @@
-# Add New Connector
+# How To Add New Connector
@@ -3,7 +3,8 @@
 :::{toctree}
 :maxdepth: 2
 architecture.md
-block_layout.md
 add_connector.md
+nfs_connector.md
+performance_benchmark.md
 :::
 
@@ -0,0 +1 @@
+#  NFS Connector
@@ -0,0 +1 @@
+#  Performance Benchmark
@@ -2,6 +2,14 @@
 
 This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and support larger models or batch sizes.
 
+## Performance
+
+Combining UCM with vLLM delivers 3–10× improvements in latency and GPU efficiency, especially for long-context LLM tasks.
+
+<p align="center">
+  <img alt="UCM" src="../../images/dram_perform.png" width="90%">
+</p>
+
 ## Features
 
 The DRAM connector supports the following functionalities:
@@ -21,38 +29,48 @@ To use the DRAM connector, you need to configure the `connector_config` dictiona
 - `max_cache_size` *(optional)*:  
   Specifies the maximum allowed DRAM memory usage (in **byte**) for caching in `kv_connector_extra_config["ucm_connector_config"]`.  
   If not provided, it defaults to **5 GB**.
+- `kv_block_size` *(optional)*:  
+  Specifies the memory size (in bytes) of a single key or value cache block used in vLLM’s paged attention mechanism, which is calculated as : `block_size * head_size * total_num_kv_heads * element_size`.
 
 ### Example:
 
 ```python
-kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120}}
 # Allocate up to 8GB DRAM for KV cache
+# KV Block size (in byte) is 262144
+kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120, "kv_block_size": 262144}}
 ```
 
 ## Launching Inference
 
 ### Offline Inference
 
-To start **offline inference** with the DRAM connector，modify the script `examples/vllm_kv_offload.py` to include the `kv_connector_extra_config` for DRAM connector usage:
+To start **offline inference** with the DRAM connector，modify the script `examples/offline_inference.py` to include the `kv_connector_extra_config` for DRAM connector usage:
 
 ```python
-# In examples/vllm_kv_offload.py
+# In examples/offline_inference.py
 ktc = KVTransferConfig(
     ...
-    kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120}}
+    kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120, "kv_block_size": 262144}}
 )
 ```
 
 Then run the script as follows:
 
 ```bash
 cd examples/
-python vllm_kv_offload.py
+python offline_inference.py
 ```
 
 ### Online Inference
 
-For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
+For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. 
+
+First, specify the python hash seed by:
+```bash
+export PYTHONHASHSEED=123456
+```
+
+Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
 
 ```bash
 vllm serve /home/models/Qwen2.5-14B-Instruct \
@@ -69,7 +87,8 @@ vllm serve /home/models/Qwen2.5-14B-Instruct \
     "kv_connector_extra_config": {
         "ucm_connector_name": "UcmDram",
         "ucm_connector_config": {
-            "max_cache_size": 5368709120
+            "max_cache_size": 5368709120,
+            "kv_block_size": 262144
         }
     }
 }'

@@ -1,2 +1,128 @@
 # NFS Connector
 
+This document provides a usage example and configuration guide for the **NFS Connector**. This connector enables offloading of KV cache from GPU HBM to SSD or Local Disk, helping reduce memory pressure and support larger models or batch sizes.
+
+## Performance: DRAM Connector vs NFS Connector
+
+### Overview
+When the total size of `kvcache` does not exceed the `max_cache_size` configured for the DRAM Connector, the DRAM Connector demonstrates superior performance. However, when the `kvcache` size exceeds `max_cache_size`, the DRAM Connector experiences significant performance degradation, at which point the NFS Connector becomes the better-performing option.
+
+<p align="center">
+  <img alt="UCM" src="../../images/nfs_performance.png" width="90%">
+</p>
+
+## Features
+
+The DRAM connector supports the following functionalities:
+
+- `dump`: Offload KV cache blocks from HBM to SSD or Local Disk.
+- `load`: Load KV cache blocks from SSD or Local Disk back to HBM.
+- `lookup`: Look up KV blocks stored in SSD or Local Disk by block hash.
+- `wait`: Ensure that all dump or load operations have completed.
+- `commit`: Mark cache operations as complete and ready for reuse.
+
+## Configuration
+
+To use the NFS connector, you need to configure the `connector_config` dictionary in your model's launch configuration.
+
+### Required Parameters
+
+- `storage_backends` *(required)*:  
+  The `storage_backends` directory can either be a local folder or an NFS-mounted directory backed by an SSD driver
+- `kv_block_size` *(required)*:
+  `kv_block_size` represents `block_size * head_size * total_num_kv_heads * element_size * num_layers * 2`
+
+### Example:
+
+```python
+kv_connector_extra_config={"ucm_connector_name": "UcmNfsStore", "ucm_connector_config":{"storage_backends": "/mnt/test1", "kv_block_size": 33554432}}
+```
+
+## Launching Inference
+
+### Offline Inference
+
+To start **offline inference** with the NFS connector，modify the script `examples/offline_inference.py` to include the `kv_connector_extra_config` for NFS connector usage:
+
+```python
+# In examples/offline_inference.py
+ktc = KVTransferConfig(
+    ...
+    kv_connector_extra_config={"ucm_connector_name": "UcmNfsStore", "ucm_connector_config":{"storage_backends": "/mnt/test1", "kv_block_size": 33554432}}
+)
+```
+
+Then run the script as follows:
+
+```bash
+cd examples/
+export PYTHONHASHSEED=123456
+python offline_inference.py
+```
+
+### Online Inference
+
+For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
+
+```bash
+export PYTHONHASHSEED=123456
+vllm serve /home/models/Qwen2.5-14B-Instruct \
+--max-model-len 20000 \
+--tensor-parallel-size 2 \
+--gpu_memory_utilization 0.87 \
+--trust-remote-code \
+--port 7800 \
+--kv-transfer-config \
+'{
+    "kv_connector": "UnifiedCacheConnectorV1",
+    "kv_connector_module_path": "unifiedcache.integration.vllm.uc_connector",
+    "kv_role": "kv_both",
+    "kv_connector_extra_config": {
+        "ucm_connector_name": "UcmNfsStore",
+        "ucm_connector_config": {
+            "storage_backends": "/mnt/test",
+            "kv_block_size": 33554432
+        }
+    }
+}'
+```
+
+If you see log as below:
+
+```bash
+INFO:     Started server process [1049932]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+Congratulations, you have successfully started the vLLM server with NFS Connector!
+
+Afrer successfully started the vLLM server，You can interact with the API as following:
+
+```bash
+curl http://localhost:7800/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "/home/models/Qwen2.5-14B-Instruct",
+        "prompt": "Shanghai is a",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+To quickly experience the NFS Connector's effect:
+
+1. Start the service with:  
+   `--no-enable-prefix-caching`  
+2. Send the same request (exceed 128 tokens) twice consecutively
+3. Remember to enable prefix caching (do not add `--no-enable-prefix-caching`) in production environments.
+### Log Message Structure
+```plaintext
+[UCMNFSSTORE] [I] Task(<task_id>,<direction>,<task_count>,<size>) finished, elapsed <time>s
+```
+| Component    | Description                                                                 |
+|--------------|-----------------------------------------------------------------------------|
+| `task_id`    | Unique identifier for the task                                              |
+| `direction`  | `D2S`: Dump to Storage (Device → SSD)<br>`S2D`: Load from Storage (SSD → Device) |
+| `task_count` | Number of tasks executed in this operation                         |
+| `size`       | Total size of data transferred in bytes (across all tasks)                  |
+| `time`       | Time taken for the complete operation in seconds                            |
@@ -4,7 +4,6 @@
 :maxdepth: 2
 installation.md
 installation_npu.md
-quick_start.md
 example/index.md
 :::
 
@@ -35,19 +35,21 @@ Refer to [Set up using docker](https://docs.vllm.ai/en/latest/getting_started/in
 ### Build from source code
 Follow commands below to install unified-cache-management:
 ```bash
-git clone --depth 1 --branch develop https://github.com/ModelEngine-Group/unified-cache-management.git
+# Replace <branch_or_tag_name> with the branch or tag name needed
+git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
 cd unified-cache-management
+export PLATFORM=cuda
 pip install -v -e .
 cd ..
 ```
 
 ## Setup from docker
 Download the pre-built docker image provided or build unified-cache-management docker image by commands below:
  ```bash
- # Build docker image using source code
- git clone --depth 1 --branch develop https://github.com/ModelEngine-Group/unified-cache-management.git
- cd unified-cache-management/docker
- docker build -t ucm-vllm:latest -f ./Dockerfile ./
+ # Build docker image using source code, replace <branch_or_tag_name> with the branch or tag name needed
+ git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
+ cd unified-cache-management
+ docker build -t ucm-vllm:latest -f ./docker/Dockerfile ./
  ```
 Then run your container using following command. You can add or remove Docker parameters as needed.
 ```bash

@@ -44,24 +44,29 @@ Codes of vLLM and vLLM Ascend are placed in /vllm-workspace, you can refer to [v
 ### Build from source code
 Follow commands below to install unified-cache-management:
 ```bash
-git clone --depth 1 --branch develop https://github.com/ModelEngine-Group/unified-cache-management.git
+# Replace <branch_or_tag_name> with the branch or tag name needed
+git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
 cd unified-cache-management
+export PLATFORM=ascend
 pip install -v -e .
 cd ..
 ```
 
 ## Setup from docker
 Download the pre-built docker image provided or build unified-cache-management docker image by commands below:
  ```bash
- # Build docker image using source code
- git clone --depth 1 --branch develop https://github.com/ModelEngine-Group/unified-cache-management.git
- cd unified-cache-management/docker
- docker build -t ucm-vllm:latest -f ./Dockerfile-NPU ./
+ # Build docker image using source code, replace <branch_or_tag_name> with the branch or tag name needed
+ git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
+ cd unified-cache-management
+ docker build -t ucm-vllm:latest -f ./docker/Dockerfile-NPU ./
  ```
   Then run your container using following command. You can add or remove Docker parameters as needed.
 ```bash
-# Use `--ipc=host` to make sure the shared memory is large enough.
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
 docker run --rm \
+    --network=host \
     --device $DEVICE \
     --device /dev/davinci_manager \
     --device /dev/devmm_svm \