-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Disaggregated prefilling and KV cache transfer roadmap #10818
Comments
Will the logic for model upgrades and instance service discovery be introduced in XpYd? |
Model upgrades --- not in the scope of disaggregated prefill roadmap for now. But this IS important, RLHF-style training also needs this, so I can add that if this is a common need. Please ❤️ this message if you need this feature. |
When using XpYd in production, model upgrades are frequent. During the upgrade period, there are two versions of the model. I think vLLM gateway need to pair prefill and decode instances, ensuring they are from the same model version. |
Glad to see the progress of supporting P/D disaggreation feature.
|
|
To better support the PD disaggregated architecture, we are actively developing a dual-tiered scheduler, implemented in Go, to optimize XpYd and request management. This upgrade has been built upon our PD disaggregated feature within vllm and is now live in our production environment, showing improved performance with good stability. The core design of our scheduler is outlined below: ● Observability: To reduce reliance on any single inference engine, we have implemented a Go-based reverse proxy that directly collects and computes instance-level performance metrics in real time, such as TTFT, TPOT, instance load, and cache status. ● Hierarchical Scheduling System: Our system features a Cluster Level Scheduler (CLS) and an Instance Level Scheduler (ILS), aiming to maximize goodput per GPU while meeting latency SLOs. The CLS leverages a workload-aware performance-cost model to refine request routing, determining whether to use a disaggregated or colocated serving mode and pinpointing the most cost-effective GPU types. Subsequently, the ILS assigns the most suitable P/D instance pairs for incoming requests, optimizing load balancing and cache reuse. ● Dynamic P/D Adjustment: By leveraging instance-level metrics, we've developed a role shift module that periodically evaluates instance load stats and decides when to add, remove, or switch P/D instances as needed. We are looking forward to releasing the code for our global scheduler to OSS shortly. Additional features are currently in development. We welcome any discussions and opportunities for collaboration. |
@yuleil Hello, I was wondering how to use nsys to profile such distributed system, I have lots of experience in using nsys to profile vllm. But for PD disagg You know I have to run prefill/decode instance seperately, I want use one nsys profile two seperate instance. After check the help doc I still can not find the solution. |
Let's also add some orchestration support in the roadmap. Seems how to orchestrate such stateful application is not covered yet. Let's create one sub-task to track it |
Hi @KuntaiDu |
Chunked prefill chunks is useful in terms of controlling the peak GPU memory usage of prefilling very long context. So for long context usecase, it makes sense to use both. |
Fully asynchronous KV Cache transfer is a great feature. It can reduce latency. I hope that this feature can be merged to the main branch soon. Will it be merged to the main branch? If so, when will it be merged? @yuleil |
@KuntaiDu. I am trying to implement XpYd (taking 1P3D as an example), here are my method and problem. Method
ProblemBut the problem I encountered is: when I send an instance of the request Note: Even if I use TCPStore to transfer kvcache, the same problem occurs that the system gets stuck after the next request changes pd_pair, and through the log, I found that the stuck position is after the D instance sends the signal. So it is not a problem with nccl at all, but it is stuck somewhere else! Maybe I need to check my system implementation again. Notemy send func is as follows:
recv func:
|
I solved the problem of the system hanging when changing Problem descriptionI only passed in parameters when creating the SolutionI use queue to pass parameters to the New BugBut I encountered a new problem:
What is going on here? How can I solve this problem? |
hi @wyzzbond , Can you tell me the progress of this scheduler? Do we have a link for this issue? |
We are planning to launch this next week and want to ensure everything is on track. |
Updated the current status in the tracker at the beginning. |
I made a tutorial to explain how to integrate the LMCache with vLLM to support cross server PD disaggregation. https://github.com/bytedance-iaas/splitwise-demos . This demo can be expand to nPmD and independently scale each service. |
@KuntaiDu ,When is it expected that a version of P/D Disaggregation of vLLM will be released? |
Motivation.
Here is the roadmap for disaggregated prefill (and general-purpose kv cache transfer). Feel free to contribute 😁.
Proposed Change.
[Feature] Allow specifying region-of-interest / roi on(mooncake team proposed new design)num_head
dimension andlayer
dimension (currently theroi
tensor only contains tokens dimension)[Feature] XpYd support by building multiple connections between Xp and Yd(We now go for KVCache-store-based design. If you prefer direct P2P please raise concerns in vLLM #feat-prefill-disaggregation channel)vllm connect
([Frontend] Disaggregate prefill decode with zmq #11791 )Engine
instead of talking to the API server ([Frontend] Disaggregate prefill decode with zmq #11791)Better memory control(postponed to 2025 Q2)[Perf] Reusing vLLM page table to avoid memory fragmentation[Perf] Reduce number of tensor copy[ ] InfiniteStore (Yet another Prefill-Decode separation in vllm #9079 @chenqianfzh )(no response from the developer)[ ] Valkey ([Core] Disaggregated prefilling supports valkey #8724 @zeroorhero @pizhenwei )(no response from the developer)Feedback Period.
No response
CC List.
@youkaichao @zeroorhero @comaniac @rkooo567 @WoosukKwon @liweiqing1997 @ShangmingCai @Leaf996 @coolkp @sjnaj @K-Mistele @ApostaC @YaoJiayi @njhill
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: