-
Notifications
You must be signed in to change notification settings - Fork 660
[PD Disaggregation] decode use cpu buffer to receive cache from prefill #5027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a feature for decode instances to use CPU buffer to receive cache from prefill in PD (Prefill-Decode) disaggregation deployments. This optimization allows the decode instance to buffer incoming cache data in CPU memory before swapping it to GPU, potentially improving resource utilization and system throughput.
Key Changes
- Refactored cache information sending logic to separate prefill-to-messager and decode-to-prefill communication paths
- Added CPU buffer allocation and management for splitwise cache in decode mode via the
--splitwise-cache-buffer-sizeparameter - Implemented CPU-to-GPU cache swapping mechanism for buffered cache data
- Updated resource management to support pre-allocation and deferred GPU resource assignment for prefilled requests
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
fastdeploy/splitwise/splitwise_connector.py |
Refactored send_cache_infos into two separate methods: send_cache_info_to_messager for prefill and send_cache_info_to_prefill for decode |
fastdeploy/engine/sched/resource_manager_v1.py |
Added splitwise CPU buffer support with preallocated_reqs tracking, pre_recycle_resource, and add_prefilled_request methods |
fastdeploy/engine/common_engine.py |
Updated task insertion and prefilled request processing to support CPU buffer workflow |
fastdeploy/cache_manager/cache_messager.py |
Implemented CPU buffer allocation and CPU-to-GPU swap thread for decode instances |
fastdeploy/cache_manager/prefix_cache_manager.py |
Added splitwise CPU buffer management APIs including allocate/recycle/swap operations |
fastdeploy/engine/args_utils.py |
Added --splitwise-cache-buffer-size parameter with validation for decode mode only |
fastdeploy/config.py |
Added cache buffer configuration with block number calculation |
fastdeploy/cache_manager/utils.py |
Extracted cache byte size and dtype conversion logic into reusable utility functions |
fastdeploy/cache_manager/cache_data.py |
Added SPLITWISE_CPU2GPU cache status enum value |
examples/splitwise/*.sh |
Updated example scripts with port checking utilities and improved consistency |
examples/splitwise/README.md |
Added documentation for running splitwise examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 17 comments.
Motivation
decode use cpu buffer to receive cache from prefill
Modifications
Usage or Command
Decode can set
--splitwise-cache-buffer-size 10argsAccuracy Tests
TODO
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.