From 4fd830e6d8966ac749dbe1cf246b67cda250334a Mon Sep 17 00:00:00 2001 From: harrisonyhq Date: Fri, 26 Sep 2025 23:01:02 +0800 Subject: [PATCH 1/3] [Docs]Update DRAM perform data --- .../user-guide/prefix-cache/dram_store.md | 27 ++++++++++++++----- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/docs/source/user-guide/prefix-cache/dram_store.md b/docs/source/user-guide/prefix-cache/dram_store.md index b51bc6cd..46a3c603 100644 --- a/docs/source/user-guide/prefix-cache/dram_store.md +++ b/docs/source/user-guide/prefix-cache/dram_store.md @@ -4,12 +4,27 @@ This document provides a usage example and configuration guide for the **DRAM Co ## Performance -Combining UCM with vLLM delivers 3–10× improvements in latency and GPU efficiency, especially for long-context LLM tasks. - -

- UCM -

- +### Overview +The following are the multi-concurrency performance test results of UCM in the Prefix Cache scenario under a CUDA environment, showing the performance improvements of UCM on two different models. +During the tests, HBM cache was disabled, and KV Cache was retrieved and matched only from DRAM. + +In the QwQ-32B model, the test used one H20 server with two GPUs. + +Here, Full Compute refers to pure VLLM inference, while DRAM80% indicates that after UCM pooling, the DRAM hit rate of the KV cache is 80%. + +The following table shows the results on the QwQ-32B model: +| **QwQ-32B** | | | | | +| ---------------: | -------------: | ------------------: | -------------: | :----------- | +| **Input length** | **Concurrent** | **Full Compute(s)** | **DRAM80%(s)** | **Speedup** | +| 4 000 | 1 | 1.0269 | 0.3102 | **+230.9 %** | +| 8 000 | 1 | 2.0902 | 0.5718 | **+265.5 %** | +| 16 000 | 1 | 4.4852 | 1.1914 | **+276.4 %** | +| 4 000 | 2 | 1.5383 | 0.4209 | **+265.4 %** | +| 8 000 | 2 | 3.1323 | 0.8231 | **+280.5 %** | +| 16 000 | 2 | 6.7984 | 1.7420 | **+290.2 %** | +| 4 000 | 4 | 2.8173 | 0.9444 | **+198.2 %** | +| 8 000 | 4 | 5.2643 | 1.8290 | **+187.8 %** | +| 16 000 | 4 | 11.3651 | 3.6706 | **+209.6 %** | ## Features The DRAM connector supports the following functionalities: From 0191943a71556479a4a499159ff6fbc3eead716f Mon Sep 17 00:00:00 2001 From: harrisonyhq Date: Fri, 26 Sep 2025 23:15:04 +0800 Subject: [PATCH 2/3] [Fix] fix workflow not exiting while error inside bash commands --- .github/workflows/unifiedcache_test.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/unifiedcache_test.yml b/.github/workflows/unifiedcache_test.yml index 3242ae24..9754691e 100644 --- a/.github/workflows/unifiedcache_test.yml +++ b/.github/workflows/unifiedcache_test.yml @@ -43,6 +43,7 @@ jobs: --entrypoint /bin/bash \ vllm/vllm-openai:v0.9.2 \ -c " + set -euo pipefail pip install -v -e . --no-build-isolation cd \$(pip show vllm | grep Location | awk '{print \$2}') && git apply /workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch && From 5e2d19a36c8886d3f94a185da6a2b4bec513bbb9 Mon Sep 17 00:00:00 2001 From: harrisonyhq Date: Fri, 26 Sep 2025 23:33:26 +0800 Subject: [PATCH 3/3] [Fix] fix workflow --- .github/workflows/unifiedcache_test.yml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/.github/workflows/unifiedcache_test.yml b/.github/workflows/unifiedcache_test.yml index 9754691e..1cdb4208 100644 --- a/.github/workflows/unifiedcache_test.yml +++ b/.github/workflows/unifiedcache_test.yml @@ -46,8 +46,7 @@ jobs: set -euo pipefail pip install -v -e . --no-build-isolation cd \$(pip show vllm | grep Location | awk '{print \$2}') && - git apply /workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch && - git apply /workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt-sparse.patch + git apply /workspace/unified-cache-management/ucm/integration/vllm/patch/0.9.2/vllm-adapt.patch cd /workspace/unified-cache-management python3 -m unittest discover -s test "