diff --git a/README.md b/README.md index a14303ae..03818b72 100644 --- a/README.md +++ b/README.md @@ -5,13 +5,22 @@

+

+| Documentation | Roadmap | +

+ --- *Latest News* 🔥 -- [2025/07/30] We are excited to announce the alpha release of Unified Cache Manager. +- [2025/08/01] We are excited to announce the alpha release of Unified Cache Manager. --- +## Performance +nfs connector has reached about 4x TTFT accelerate. + +![perf](docs/source/images/nfs_performance.png) + ## Overview ### Motivation diff --git a/docker/Dockerfile b/docker/Dockerfile index 06edc006..fe2c2b84 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -14,11 +14,10 @@ ENV VLLM_USE_PRECOMPILED=1 RUN VLLM_TARGET_DEVICE=cuda pip install -v -e /vllm-workspace/vllm --extra-index=https://download.pytorch.org/whl/nightly/cu128 # Install unified-cache-management -ARG UCM_REPO=https://github.com/ModelEngine-Group/unified-cache-management.git -ARG UCM_BRANCH=develop -RUN git clone --depth 1 $UCM_REPO --branch $UCM_BRANCH /vllm-workspace/unified-cache-management +COPY . /vllm-workspace/unified-cache-management -RUN pip install -v -e /vllm-workspace/unified-cache-management +RUN export PLATFORM="cuda" && \ + pip install -v -e /vllm-workspace/unified-cache-management # Apply patch for vLLM RUN cd /vllm-workspace/vllm \ diff --git a/docker/Dockerfile-NPU b/docker/Dockerfile-NPU index 4216e292..519d4253 100644 --- a/docker/Dockerfile-NPU +++ b/docker/Dockerfile-NPU @@ -4,11 +4,11 @@ FROM quay.io/ascend/vllm-ascend:v0.9.2rc1 WORKDIR /workspace # Install unified-cache-management -ARG UCM_REPO=https://github.com/ModelEngine-Group/unified-cache-management.git -ARG UCM_BRANCH=develop -RUN git clone --depth 1 $UCM_REPO --branch $UCM_BRANCH /vllm-workspace/unified-cache-management +COPY . /vllm-workspace/unified-cache-management -RUN pip install -v -e /vllm-workspace/unified-cache-management +RUN export PLATFORM="ascend" && \ + export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \ + pip install -v -e /vllm-workspace/unified-cache-management # Apply patch for vLLM RUN cd /vllm-workspace/vllm \ diff --git a/docs/source/developer/add_connector.md b/docs/source/developer/add_connector.md index 37552e9f..8d456ede 100644 --- a/docs/source/developer/add_connector.md +++ b/docs/source/developer/add_connector.md @@ -1 +1 @@ -# Add New Connector +# How To Add New Connector diff --git a/docs/source/developer/block_layout.md b/docs/source/developer/block_layout.md deleted file mode 100644 index a7365ff1..00000000 --- a/docs/source/developer/block_layout.md +++ /dev/null @@ -1 +0,0 @@ -# Block Layout diff --git a/docs/source/developer/index.md b/docs/source/developer/index.md index d6069244..488d86cf 100644 --- a/docs/source/developer/index.md +++ b/docs/source/developer/index.md @@ -3,7 +3,8 @@ :::{toctree} :maxdepth: 2 architecture.md -block_layout.md add_connector.md +nfs_connector.md +performance_benchmark.md ::: diff --git a/docs/source/developer/nfs_connector.md b/docs/source/developer/nfs_connector.md new file mode 100644 index 00000000..629c2daa --- /dev/null +++ b/docs/source/developer/nfs_connector.md @@ -0,0 +1 @@ +# NFS Connector \ No newline at end of file diff --git a/docs/source/developer/performance_benchmark.md b/docs/source/developer/performance_benchmark.md new file mode 100644 index 00000000..927cc276 --- /dev/null +++ b/docs/source/developer/performance_benchmark.md @@ -0,0 +1 @@ +# Performance Benchmark \ No newline at end of file diff --git a/docs/source/getting-started/example/dram_conn.md b/docs/source/getting-started/example/dram_conn.md index 6739b637..5ffcf276 100644 --- a/docs/source/getting-started/example/dram_conn.md +++ b/docs/source/getting-started/example/dram_conn.md @@ -2,6 +2,14 @@ This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and support larger models or batch sizes. +## Performance + +Combining UCM with vLLM delivers 3–10× improvements in latency and GPU efficiency, especially for long-context LLM tasks. + +

+ UCM +

+ ## Features The DRAM connector supports the following functionalities: @@ -21,25 +29,28 @@ To use the DRAM connector, you need to configure the `connector_config` dictiona - `max_cache_size` *(optional)*: Specifies the maximum allowed DRAM memory usage (in **byte**) for caching in `kv_connector_extra_config["ucm_connector_config"]`. If not provided, it defaults to **5 GB**. +- `kv_block_size` *(optional)*: + Specifies the memory size (in bytes) of a single key or value cache block used in vLLM’s paged attention mechanism, which is calculated as : `block_size * head_size * total_num_kv_heads * element_size`. ### Example: ```python -kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120}} # Allocate up to 8GB DRAM for KV cache +# KV Block size (in byte) is 262144 +kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120, "kv_block_size": 262144}} ``` ## Launching Inference ### Offline Inference -To start **offline inference** with the DRAM connector,modify the script `examples/vllm_kv_offload.py` to include the `kv_connector_extra_config` for DRAM connector usage: +To start **offline inference** with the DRAM connector,modify the script `examples/offline_inference.py` to include the `kv_connector_extra_config` for DRAM connector usage: ```python -# In examples/vllm_kv_offload.py +# In examples/offline_inference.py ktc = KVTransferConfig( ... - kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120}} + kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120, "kv_block_size": 262144}} ) ``` @@ -47,12 +58,19 @@ Then run the script as follows: ```bash cd examples/ -python vllm_kv_offload.py +python offline_inference.py ``` ### Online Inference -For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model: +For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. + +First, specify the python hash seed by: +```bash +export PYTHONHASHSEED=123456 +``` + +Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model: ```bash vllm serve /home/models/Qwen2.5-14B-Instruct \ @@ -69,7 +87,8 @@ vllm serve /home/models/Qwen2.5-14B-Instruct \ "kv_connector_extra_config": { "ucm_connector_name": "UcmDram", "ucm_connector_config": { - "max_cache_size": 5368709120 + "max_cache_size": 5368709120, + "kv_block_size": 262144 } } }' diff --git a/docs/source/getting-started/example/nfs_conn.md b/docs/source/getting-started/example/nfs_conn.md index d43aae88..95da8f69 100644 --- a/docs/source/getting-started/example/nfs_conn.md +++ b/docs/source/getting-started/example/nfs_conn.md @@ -1,2 +1,128 @@ # NFS Connector +This document provides a usage example and configuration guide for the **NFS Connector**. This connector enables offloading of KV cache from GPU HBM to SSD or Local Disk, helping reduce memory pressure and support larger models or batch sizes. + +## Performance: DRAM Connector vs NFS Connector + +### Overview +When the total size of `kvcache` does not exceed the `max_cache_size` configured for the DRAM Connector, the DRAM Connector demonstrates superior performance. However, when the `kvcache` size exceeds `max_cache_size`, the DRAM Connector experiences significant performance degradation, at which point the NFS Connector becomes the better-performing option. + +

+ UCM +

+ +## Features + +The DRAM connector supports the following functionalities: + +- `dump`: Offload KV cache blocks from HBM to SSD or Local Disk. +- `load`: Load KV cache blocks from SSD or Local Disk back to HBM. +- `lookup`: Look up KV blocks stored in SSD or Local Disk by block hash. +- `wait`: Ensure that all dump or load operations have completed. +- `commit`: Mark cache operations as complete and ready for reuse. + +## Configuration + +To use the NFS connector, you need to configure the `connector_config` dictionary in your model's launch configuration. + +### Required Parameters + +- `storage_backends` *(required)*: + The `storage_backends` directory can either be a local folder or an NFS-mounted directory backed by an SSD driver +- `kv_block_size` *(required)*: + `kv_block_size` represents `block_size * head_size * total_num_kv_heads * element_size * num_layers * 2` + +### Example: + +```python +kv_connector_extra_config={"ucm_connector_name": "UcmNfsStore", "ucm_connector_config":{"storage_backends": "/mnt/test1", "kv_block_size": 33554432}} +``` + +## Launching Inference + +### Offline Inference + +To start **offline inference** with the NFS connector,modify the script `examples/offline_inference.py` to include the `kv_connector_extra_config` for NFS connector usage: + +```python +# In examples/offline_inference.py +ktc = KVTransferConfig( + ... + kv_connector_extra_config={"ucm_connector_name": "UcmNfsStore", "ucm_connector_config":{"storage_backends": "/mnt/test1", "kv_block_size": 33554432}} +) +``` + +Then run the script as follows: + +```bash +cd examples/ +export PYTHONHASHSEED=123456 +python offline_inference.py +``` + +### Online Inference + +For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model: + +```bash +export PYTHONHASHSEED=123456 +vllm serve /home/models/Qwen2.5-14B-Instruct \ +--max-model-len 20000 \ +--tensor-parallel-size 2 \ +--gpu_memory_utilization 0.87 \ +--trust-remote-code \ +--port 7800 \ +--kv-transfer-config \ +'{ + "kv_connector": "UnifiedCacheConnectorV1", + "kv_connector_module_path": "unifiedcache.integration.vllm.uc_connector", + "kv_role": "kv_both", + "kv_connector_extra_config": { + "ucm_connector_name": "UcmNfsStore", + "ucm_connector_config": { + "storage_backends": "/mnt/test", + "kv_block_size": 33554432 + } + } +}' +``` + +If you see log as below: + +```bash +INFO: Started server process [1049932] +INFO: Waiting for application startup. +INFO: Application startup complete. +``` + +Congratulations, you have successfully started the vLLM server with NFS Connector! + +Afrer successfully started the vLLM server,You can interact with the API as following: + +```bash +curl http://localhost:7800/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "/home/models/Qwen2.5-14B-Instruct", + "prompt": "Shanghai is a", + "max_tokens": 7, + "temperature": 0 + }' +``` +To quickly experience the NFS Connector's effect: + +1. Start the service with: + `--no-enable-prefix-caching` +2. Send the same request (exceed 128 tokens) twice consecutively +3. Remember to enable prefix caching (do not add `--no-enable-prefix-caching`) in production environments. +### Log Message Structure +```plaintext +[UCMNFSSTORE] [I] Task(,,,) finished, elapsed