-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Simple Data Parallelism in vLLM #9206
Comments
Hi, I'll be working on this issue |
I like this idea, providing flexibility around deployment is great! |
Related issue: #9198 |
Previously, I encountered an issue: When running vLLM(older version) on a machine with multiple GPUs(a CI environment) without specifying @archthegit and I chated a little and plan to investigate and potentially work together on this.😄 |
Specifying CUDA_VISIBLE_DEVICE before starting vllm can use different cards for multiple vllm instances started on one machine. But multiple vllm instances may have port conflicts. So the best way to use multiple vllm instances on a single machine is to use docker Now implementation of pp/tp is coupled with layers, it is very difficult to implement dp. If you don't use pp/tp, you can patch GroupCoordinator to implement dp, but without asynchronous scheduling, there is If you have many GPUs doing decoder-only model inference, using "Disaggregating Prefill and Decoding architecture", compare with use dp, has higher throughput , refer to DistServe Mooncake |
A prototype, first implement asynchronous scheduling, then by patching GroupCoordinator to implement dp. |
What is patching group coordinator ? Can I verify the performance of dp through benchmark_throughput.py? |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
It seems that sglang supports dp through lanuching multiple servers, e.g. https://github.com/sgl-project/sglang/blob/main/docs/backend/server_arguments.md, why don't vllm just implement like this? |
Reopening this to focus on simple offline use case. |
🚀 The feature, motivation and pitch
It is common to have a scenario where folks want to deploy multiple vLLM instances on a single machine due to the machine have several GPUs (commonly 8 GPUs). The work can then be sharded across replicated instances. This issue describes the intended UX for such feature. Notably we might not want to tackle large distributed settings (100s of parallel vLLM instances), which should be better handled by a higher layers.
For the server, same argument, route requests to different engine processes, we can start with simple round robin load balancing, but a good stretch goal is session affinity or prefix aware routing
Alternatives
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: