[Feature]: Simple Data Parallelism in vLLM #9206

simon-mo · 2024-10-09T22:04:57Z

🚀 The feature, motivation and pitch

It is common to have a scenario where folks want to deploy multiple vLLM instances on a single machine due to the machine have several GPUs (commonly 8 GPUs). The work can then be sharded across replicated instances. This issue describes the intended UX for such feature. Notably we might not want to tackle large distributed settings (100s of parallel vLLM instances), which should be better handled by a higher layers.

Offline use case, for the LLM class, a new argument data_parallel_size and support dispatching requests to one engine per GPU (or per tensor parallel size).

from vllm import LLM

llm = LLM(model="...", data_parallel_size=X) # spawn X number of engine processes and shard the work among them
llm = LLM(model="...", data_parallel_size=X, tensor_parallel_size=Y) # this is supported if X*Y <= total number of GPUs

For the server, same argument, route requests to different engine processes, we can start with simple round robin load balancing, but a good stretch goal is session affinity or prefix aware routing

vllm serve ... --data-parallel-size X

Alternatives

LiteLLM + manually creating replicas
Using ray.data or ray serve to scale out

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

archthegit · 2024-10-09T22:07:24Z

Hi, I'll be working on this issue

BKitor · 2024-10-09T23:54:19Z

I like this idea, providing flexibility around deployment is great!
Being able to deploy multiple types of parallelism would be ideal, e.g. -pp 2 -dp 4 -dp 2 across 16 GPUs would be benificial.
Would also be good to have fine-grained control of process-to-device topology mapping, but that might be a stretch goal for a different ticket...

andoorve · 2024-10-10T01:20:03Z

Related issue: #9198

Imss27 · 2024-10-10T02:44:48Z

Previously, I encountered an issue: When running vLLM(older version) on a machine with multiple GPUs(a CI environment) without specifying CUDA_VISIBLE_DEVICE, it keeps allocating memory on one single GPU which then causes OOM error. This might be related?

@archthegit and I chated a little and plan to investigate and potentially work together on this.😄

noooop · 2024-10-10T08:47:51Z

Specifying CUDA_VISIBLE_DEVICE before starting vllm can use different cards for multiple vllm instances started on one machine. But multiple vllm instances may have port conflicts. So the best way to use multiple vllm instances on a single machine is to use docker

Now implementation of pp/tp is coupled with layers, it is very difficult to implement dp. If you don't use pp/tp, you can patch GroupCoordinator to implement dp, but without asynchronous scheduling, there is
no meaning to implementing dp. Therefore, asynchronous scheduling should be implemented first before implementing dp

If you have many GPUs doing decoder-only model inference, using "Disaggregating Prefill and Decoding architecture", compare with use dp, has higher throughput , refer to DistServe Mooncake

noooop · 2024-10-10T08:50:30Z

#8452

A prototype, first implement asynchronous scheduling, then by patching GroupCoordinator to implement dp.

tensorflowt · 2024-10-24T10:27:56Z

#8452

A prototype, first implement asynchronous scheduling, then by patching GroupCoordinator to implement dp.

What is patching group coordinator ? Can I verify the performance of dp through benchmark_throughput.py?

github-actions · 2025-01-23T01:58:08Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2025-02-23T02:03:47Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

jiqiujia · 2025-02-24T09:24:41Z

Specifying CUDA_VISIBLE_DEVICE before starting vllm can use different cards for multiple vllm instances started on one machine. But multiple vllm instances may have port conflicts. So the best way to use multiple vllm instances on a single machine is to use docker

Now implementation of pp/tp is coupled with layers, it is very difficult to implement dp. If you don't use pp/tp, you can patch GroupCoordinator to implement dp, but without asynchronous scheduling, there is no meaning to implementing dp. Therefore, asynchronous scheduling should be implemented first before implementing dp

If you have many GPUs doing decoder-only model inference, using "Disaggregating Prefill and Decoding architecture", compare with use dp, has higher throughput , refer to DistServe Mooncake

It seems that sglang supports dp through lanuching multiple servers, e.g. https://github.com/sgl-project/sglang/blob/main/docs/backend/server_arguments.md, why don't vllm just implement like this?

simon-mo · 2025-03-17T17:27:38Z

Reopening this to focus on simple offline use case.

simon-mo added the feature request label Oct 9, 2024

BKitor mentioned this issue Dec 31, 2024

[Core] Rank-to-device mapping env var #11662

Open

github-actions bot added the stale label Jan 23, 2025

github-actions bot closed this as not planned Feb 23, 2025

simon-mo reopened this Mar 17, 2025

github-actions bot added unstale and removed stale labels Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Simple Data Parallelism in vLLM #9206

[Feature]: Simple Data Parallelism in vLLM #9206

simon-mo commented Oct 9, 2024

archthegit commented Oct 9, 2024

BKitor commented Oct 9, 2024

andoorve commented Oct 10, 2024

Imss27 commented Oct 10, 2024

noooop commented Oct 10, 2024 •

edited

Loading

noooop commented Oct 10, 2024 •

edited

Loading

tensorflowt commented Oct 24, 2024

github-actions bot commented Jan 23, 2025

github-actions bot commented Feb 23, 2025

jiqiujia commented Feb 24, 2025

simon-mo commented Mar 17, 2025

[Feature]: Simple Data Parallelism in vLLM #9206

[Feature]: Simple Data Parallelism in vLLM #9206

Comments

simon-mo commented Oct 9, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

archthegit commented Oct 9, 2024

BKitor commented Oct 9, 2024

andoorve commented Oct 10, 2024

Imss27 commented Oct 10, 2024

noooop commented Oct 10, 2024 • edited Loading

noooop commented Oct 10, 2024 • edited Loading

tensorflowt commented Oct 24, 2024

github-actions bot commented Jan 23, 2025

github-actions bot commented Feb 23, 2025

jiqiujia commented Feb 24, 2025

simon-mo commented Mar 17, 2025

noooop commented Oct 10, 2024 •

edited

Loading

noooop commented Oct 10, 2024 •

edited

Loading