Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,22 @@
</picture>
</p>

<p align="center">
| <a href="docs/source/index.md"><b>Documentation</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/issues/16"><b>Roadmap</b></a> |
</p>

---

*Latest News* 🔥
- [2025/07/30] We are excited to announce the alpha release of Unified Cache Manager.
- [2025/08/01] We are excited to announce the alpha release of Unified Cache Manager.

---

## Performance
nfs connector has reached about 4x TTFT accelerate.

![perf](docs/source/images/nfs_performance.png)

## Overview

### Motivation
Expand Down
7 changes: 3 additions & 4 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,10 @@ ENV VLLM_USE_PRECOMPILED=1
RUN VLLM_TARGET_DEVICE=cuda pip install -v -e /vllm-workspace/vllm --extra-index=https://download.pytorch.org/whl/nightly/cu128

# Install unified-cache-management
ARG UCM_REPO=https://github.com/ModelEngine-Group/unified-cache-management.git
ARG UCM_BRANCH=develop
RUN git clone --depth 1 $UCM_REPO --branch $UCM_BRANCH /vllm-workspace/unified-cache-management
COPY . /vllm-workspace/unified-cache-management

RUN pip install -v -e /vllm-workspace/unified-cache-management
RUN export PLATFORM="cuda" && \
pip install -v -e /vllm-workspace/unified-cache-management

# Apply patch for vLLM
RUN cd /vllm-workspace/vllm \
Expand Down
8 changes: 4 additions & 4 deletions docker/Dockerfile-NPU
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ FROM quay.io/ascend/vllm-ascend:v0.9.2rc1
WORKDIR /workspace

# Install unified-cache-management
ARG UCM_REPO=https://github.com/ModelEngine-Group/unified-cache-management.git
ARG UCM_BRANCH=develop
RUN git clone --depth 1 $UCM_REPO --branch $UCM_BRANCH /vllm-workspace/unified-cache-management
COPY . /vllm-workspace/unified-cache-management

RUN pip install -v -e /vllm-workspace/unified-cache-management
RUN export PLATFORM="ascend" && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
pip install -v -e /vllm-workspace/unified-cache-management

# Apply patch for vLLM
RUN cd /vllm-workspace/vllm \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/developer/add_connector.md
Original file line number Diff line number Diff line change
@@ -1 +1 @@
# Add New Connector
# How To Add New Connector
1 change: 0 additions & 1 deletion docs/source/developer/block_layout.md

This file was deleted.

3 changes: 2 additions & 1 deletion docs/source/developer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
:::{toctree}
:maxdepth: 2
architecture.md
block_layout.md
add_connector.md
nfs_connector.md
performance_benchmark.md
:::

1 change: 1 addition & 0 deletions docs/source/developer/nfs_connector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# NFS Connector
1 change: 1 addition & 0 deletions docs/source/developer/performance_benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Performance Benchmark
33 changes: 26 additions & 7 deletions docs/source/getting-started/example/dram_conn.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@

This document provides a usage example and configuration guide for the **DRAM Connector**. This connector enables offloading of KV cache from GPU HBM to CPU DRAM, helping reduce memory pressure and support larger models or batch sizes.

## Performance

Combining UCM with vLLM delivers 3–10× improvements in latency and GPU efficiency, especially for long-context LLM tasks.

<p align="center">
<img alt="UCM" src="../../images/dram_perform.png" width="90%">
</p>

## Features

The DRAM connector supports the following functionalities:
Expand All @@ -21,38 +29,48 @@ To use the DRAM connector, you need to configure the `connector_config` dictiona
- `max_cache_size` *(optional)*:
Specifies the maximum allowed DRAM memory usage (in **byte**) for caching in `kv_connector_extra_config["ucm_connector_config"]`.
If not provided, it defaults to **5 GB**.
- `kv_block_size` *(optional)*:
Specifies the memory size (in bytes) of a single key or value cache block used in vLLM’s paged attention mechanism, which is calculated as : `block_size * head_size * total_num_kv_heads * element_size`.

### Example:

```python
kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120}}
# Allocate up to 8GB DRAM for KV cache
# KV Block size (in byte) is 262144
kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120, "kv_block_size": 262144}}
```

## Launching Inference

### Offline Inference

To start **offline inference** with the DRAM connector,modify the script `examples/vllm_kv_offload.py` to include the `kv_connector_extra_config` for DRAM connector usage:
To start **offline inference** with the DRAM connector,modify the script `examples/offline_inference.py` to include the `kv_connector_extra_config` for DRAM connector usage:

```python
# In examples/vllm_kv_offload.py
# In examples/offline_inference.py
ktc = KVTransferConfig(
...
kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120}}
kv_connector_extra_config={"ucm_connector_name": "UcmDram", "ucm_connector_config":{"max_cache_size": 5368709120, "kv_block_size": 262144}}
)
```

Then run the script as follows:

```bash
cd examples/
python vllm_kv_offload.py
python offline_inference.py
```

### Online Inference

For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol.

First, specify the python hash seed by:
```bash
export PYTHONHASHSEED=123456
```

Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:

```bash
vllm serve /home/models/Qwen2.5-14B-Instruct \
Expand All @@ -69,7 +87,8 @@ vllm serve /home/models/Qwen2.5-14B-Instruct \
"kv_connector_extra_config": {
"ucm_connector_name": "UcmDram",
"ucm_connector_config": {
"max_cache_size": 5368709120
"max_cache_size": 5368709120,
"kv_block_size": 262144
}
}
}'
Expand Down
126 changes: 126 additions & 0 deletions docs/source/getting-started/example/nfs_conn.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,128 @@
# NFS Connector

This document provides a usage example and configuration guide for the **NFS Connector**. This connector enables offloading of KV cache from GPU HBM to SSD or Local Disk, helping reduce memory pressure and support larger models or batch sizes.

## Performance: DRAM Connector vs NFS Connector

### Overview
When the total size of `kvcache` does not exceed the `max_cache_size` configured for the DRAM Connector, the DRAM Connector demonstrates superior performance. However, when the `kvcache` size exceeds `max_cache_size`, the DRAM Connector experiences significant performance degradation, at which point the NFS Connector becomes the better-performing option.

<p align="center">
<img alt="UCM" src="../../images/nfs_performance.png" width="90%">
</p>

## Features

The DRAM connector supports the following functionalities:

- `dump`: Offload KV cache blocks from HBM to SSD or Local Disk.
- `load`: Load KV cache blocks from SSD or Local Disk back to HBM.
- `lookup`: Look up KV blocks stored in SSD or Local Disk by block hash.
- `wait`: Ensure that all dump or load operations have completed.
- `commit`: Mark cache operations as complete and ready for reuse.

## Configuration

To use the NFS connector, you need to configure the `connector_config` dictionary in your model's launch configuration.

### Required Parameters

- `storage_backends` *(required)*:
The `storage_backends` directory can either be a local folder or an NFS-mounted directory backed by an SSD driver
- `kv_block_size` *(required)*:
`kv_block_size` represents `block_size * head_size * total_num_kv_heads * element_size * num_layers * 2`

### Example:

```python
kv_connector_extra_config={"ucm_connector_name": "UcmNfsStore", "ucm_connector_config":{"storage_backends": "/mnt/test1", "kv_block_size": 33554432}}
```

## Launching Inference

### Offline Inference

To start **offline inference** with the NFS connector,modify the script `examples/offline_inference.py` to include the `kv_connector_extra_config` for NFS connector usage:

```python
# In examples/offline_inference.py
ktc = KVTransferConfig(
...
kv_connector_extra_config={"ucm_connector_name": "UcmNfsStore", "ucm_connector_config":{"storage_backends": "/mnt/test1", "kv_block_size": 33554432}}
)
```

Then run the script as follows:

```bash
cd examples/
export PYTHONHASHSEED=123456
python offline_inference.py
```

### Online Inference

For **online inference** , vLLM with our connector can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:

```bash
export PYTHONHASHSEED=123456
vllm serve /home/models/Qwen2.5-14B-Instruct \
--max-model-len 20000 \
--tensor-parallel-size 2 \
--gpu_memory_utilization 0.87 \
--trust-remote-code \
--port 7800 \
--kv-transfer-config \
'{
"kv_connector": "UnifiedCacheConnectorV1",
"kv_connector_module_path": "unifiedcache.integration.vllm.uc_connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"ucm_connector_name": "UcmNfsStore",
"ucm_connector_config": {
"storage_backends": "/mnt/test",
"kv_block_size": 33554432
}
}
}'
```

If you see log as below:

```bash
INFO: Started server process [1049932]
INFO: Waiting for application startup.
INFO: Application startup complete.
```

Congratulations, you have successfully started the vLLM server with NFS Connector!

Afrer successfully started the vLLM server,You can interact with the API as following:

```bash
curl http://localhost:7800/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/home/models/Qwen2.5-14B-Instruct",
"prompt": "Shanghai is a",
"max_tokens": 7,
"temperature": 0
}'
```
To quickly experience the NFS Connector's effect:

1. Start the service with:
`--no-enable-prefix-caching`
2. Send the same request (exceed 128 tokens) twice consecutively
3. Remember to enable prefix caching (do not add `--no-enable-prefix-caching`) in production environments.
### Log Message Structure
```plaintext
[UCMNFSSTORE] [I] Task(<task_id>,<direction>,<task_count>,<size>) finished, elapsed <time>s
```
| Component | Description |
|--------------|-----------------------------------------------------------------------------|
| `task_id` | Unique identifier for the task |
| `direction` | `D2S`: Dump to Storage (Device → SSD)<br>`S2D`: Load from Storage (SSD → Device) |
| `task_count` | Number of tasks executed in this operation |
| `size` | Total size of data transferred in bytes (across all tasks) |
| `time` | Time taken for the complete operation in seconds |
1 change: 0 additions & 1 deletion docs/source/getting-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
:maxdepth: 2
installation.md
installation_npu.md
quick_start.md
example/index.md
:::

12 changes: 7 additions & 5 deletions docs/source/getting-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,19 +35,21 @@ Refer to [Set up using docker](https://docs.vllm.ai/en/latest/getting_started/in
### Build from source code
Follow commands below to install unified-cache-management:
```bash
git clone --depth 1 --branch develop https://github.com/ModelEngine-Group/unified-cache-management.git
# Replace <branch_or_tag_name> with the branch or tag name needed
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management
export PLATFORM=cuda
pip install -v -e .
cd ..
```

## Setup from docker
Download the pre-built docker image provided or build unified-cache-management docker image by commands below:
```bash
# Build docker image using source code
git clone --depth 1 --branch develop https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management/docker
docker build -t ucm-vllm:latest -f ./Dockerfile ./
# Build docker image using source code, replace <branch_or_tag_name> with the branch or tag name needed
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management
docker build -t ucm-vllm:latest -f ./docker/Dockerfile ./
```
Then run your container using following command. You can add or remove Docker parameters as needed.
```bash
Expand Down
17 changes: 11 additions & 6 deletions docs/source/getting-started/installation_npu.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,24 +44,29 @@ Codes of vLLM and vLLM Ascend are placed in /vllm-workspace, you can refer to [v
### Build from source code
Follow commands below to install unified-cache-management:
```bash
git clone --depth 1 --branch develop https://github.com/ModelEngine-Group/unified-cache-management.git
# Replace <branch_or_tag_name> with the branch or tag name needed
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management
export PLATFORM=ascend
pip install -v -e .
cd ..
```

## Setup from docker
Download the pre-built docker image provided or build unified-cache-management docker image by commands below:
```bash
# Build docker image using source code
git clone --depth 1 --branch develop https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management/docker
docker build -t ucm-vllm:latest -f ./Dockerfile-NPU ./
# Build docker image using source code, replace <branch_or_tag_name> with the branch or tag name needed
git clone --depth 1 --branch <branch_or_tag_name> https://github.com/ModelEngine-Group/unified-cache-management.git
cd unified-cache-management
docker build -t ucm-vllm:latest -f ./docker/Dockerfile-NPU ./
```
Then run your container using following command. You can add or remove Docker parameters as needed.
```bash
# Use `--ipc=host` to make sure the shared memory is large enough.
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
docker run --rm \
--network=host \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
Expand Down
1 change: 0 additions & 1 deletion docs/source/getting-started/quick_start.md

This file was deleted.

Binary file added docs/source/images/dram_perform.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/images/nfs_performance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading