Skip to content

Conversation

@harrisonyhq
Copy link
Collaborator

Prupose

Add a DRAM Connector to enable offloading the KV cache from GPU HBM to CPU DRAM, helping reduce GPU memory pressure and support larger models or batch sizes.

Modifications

Test

Unit Test

Passes unit test in test/test_ucm_dram.py

python test/test_ucm_dram.py

Performance Test

Using llmperf with following command:

python token_benchmark_ray.py  --model "/home/models/QwQ-32B" --mean-input-tokens 16000 --mean-output-tokens 1 --max-num-completed-requests 10 --num-concurrent-requests 1

to evaluate the connector TTFT performance with a series of input token lengths, using the model QwQ-32B, got the following result:

Token length Local Disk Connector vllm with disabled prefix cache DRAM Connector
8K 0.88s 2.16s 0.65s
16K 1.79s 5.01s 1.26s
32K 3.51s 12.54s 2.21s

@harrisonyhq harrisonyhq requested a review from ygwpz July 29, 2025 05:12
@ygwpz ygwpz merged commit dfe3e14 into ModelEngine-Group:develop Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants