-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Distributed][refactor] Add base class for device-specific communicator #11324
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
c4f0481
to
eeb5aae
Compare
eeb5aae
to
6e6501a
Compare
1a977d3
to
6de2b98
Compare
This pull request has merge conflicts that must be resolved before it can be |
b5d2063
to
f03eedb
Compare
242fb40
to
b085f82
Compare
cadfb32
to
cc6d46a
Compare
cc6d46a
to
464594a
Compare
This pull request has merge conflicts that must be resolved before it can be |
464594a
to
1e986a0
Compare
1e986a0
to
79a5eb0
Compare
79a5eb0
to
6851cd0
Compare
CI failed due to network issues. This pr is ready for review now, thanks in advance! @youkaichao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao Would you mind taking another look?
Or if you are worried about the code changes being too big and want us to split the PR, for example:
- a separate PR for CommunicatorBase and interface change
- Adapts cuda/rocm, hpu, tpu, xpu separately and split to 3 followup PRs
Please let us know, we'd like to do so.
sorry I'm super busy recently. will review this week. |
Signed-off-by: Mengqing Cao <cmq0113@163.com>
6851cd0
to
26dcd5d
Compare
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
f"{current_platform.device_type}:{local_rank}") | ||
else: | ||
import torch_xla.core.xla_model as xm | ||
self.device = xm.xla_device(local_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @youkaichao , I'm not sure if the initialize of self.device
is correct for neuron, openvino and tpu devices. Appreciate your help!
part of #11162
This PR provide a base class
CommunicatorBase
for device-specific communicators (HpuCommunicator
,TpuCommunicator
andXpuCommunicator
), avoiding the cumbersome dispatch in each communicator operator ofGroupCoordinator
, e.g.,https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L342-L353
In this pr, the communication-related classes are organized as the following fig. This allows new backends to implement their own communicators and dynamic dispatch them in the platform.
