Skip to content

[Bugfix][Ray] Set the cuda context eagerly in the ray worker #19583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 20, 2025

Conversation

kouroshHakha
Copy link
Collaborator

@kouroshHakha kouroshHakha commented Jun 12, 2025

This PR sets the cuda context eagerly in the ray actor (torch.cuda.set_device is actually lazy I think)

There was a bug in Ray + P/D where the nixl/ucx cuda context is checked via direct cuda devices and it fails because Ray runs the execution of the model in a background thread which does not inherit the cuda context.

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@kouroshHakha kouroshHakha changed the title Set the cuda context immediately in the ray worker [Bugfix][Ray] Set the cuda context immediately in the ray worker Jun 12, 2025
@mergify mergify bot added the ci/build label Jun 12, 2025
@kouroshHakha kouroshHakha changed the title [Bugfix][Ray] Set the cuda context immediately in the ray worker [Bugfix][Ray] Set the cuda context eagerly in the ray worker Jun 12, 2025
@ruisearch42 ruisearch42 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2025
Copy link
Collaborator

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

cc @youkaichao to take a look, who recently investigated the same problem under a different context:
pytorch/pytorch#155668
https://forums.developer.nvidia.com/t/whats-the-expected-behavior-of-calling-cudagetdevice-when-the-process-has-no-cuda-context/335784

@@ -759,6 +759,7 @@ steps:
- torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
- pytest -v -s -x lora/test_mixtral.py
- pytest -v -s cuda/test_cuda_context.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be under 4xA100 tests?

@@ -113,6 +113,9 @@ def setup_device_if_necessary(self):
# Not needed
pass
else:
if current_platform.is_cuda():
from vllm.platforms.cuda import set_cuda_context
set_cuda_context(self.worker.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert True here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change here? you just need to change the set_device implementation in the cuda platform.

Copy link
Collaborator Author

@kouroshHakha kouroshHakha Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so set_device method does not explicitly exist on cuda Platform definition. It's going through __getattr__ I think here https://github.com/vllm-project/vllm/blob/main/vllm/platforms/interface.py#L498? otherwise I don't know how set_device is mapped to torch.cuda.set_device on Cuda Platform class. For lack of understanding that I implemented it separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you can add a set_device interface, and default to torch.xxx.set_device, and override it in cuda


try:
# Load CUDA driver library
cuda = ctypes.CDLL('libcuda.so')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering how robust this is, as compared to the workaround of creating a tensor.

@@ -50,6 +51,65 @@ def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> _R:
return wrapper


def set_cuda_context(device: Union[torch.device, int]) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would suggest just creating a tensor on the target device, and then call torch.cuda.set_device . creating a context this way might interfere with other functionality, e.g. cuCtxCreate_v2 might be deprecated in the future (in favor of cuCtxCreate_v3), and calling the function would break other things easily.

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't expect it occurs here too 👀

@kouroshHakha
Copy link
Collaborator Author

kouroshHakha commented Jun 13, 2025

Cool, I think I addressed all the comments. Plz take a look again @youkaichao @ruisearch42
PS: I think the failed tests are not relevant?
PS: new tests pass:
image

from vllm.platforms import current_platform


def check_cuda_context():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use torch._C._cuda_hasPrimaryContext(int device)

Copy link
Collaborator Author

@kouroshHakha kouroshHakha Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this API capture what we are tying to do here. torch._C._cuda_hasPrimaryContext(0) returns true even if the background thread is created. The current method returns false which is consistent with the problem that I was trying to solve.

except Exception as e:
return False, f"Wrong exception: {type(e).__name__}: {e}"

with ThreadPoolExecutor(max_workers=1) as executor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we use ThreadPool? just use normal functions starting with test_ should be fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted. changed.

@kouroshHakha kouroshHakha requested a review from youkaichao June 18, 2025 04:50
ekagra-ranjan and others added 8 commits June 17, 2025 21:52
…more methods and datasets (vllm-project#18847)

Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: kouroshhakha <kourosh@anyscale.com>
@mergify mergify bot added the documentation Improvements or additions to documentation label Jun 18, 2025
@simon-mo simon-mo merged commit 5e666f7 into vllm-project:main Jun 20, 2025
93 of 99 checks passed
juncheoll pushed a commit to juncheoll/vllm that referenced this pull request Jun 23, 2025
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
fhl2000 pushed a commit to fhl2000/vllm that referenced this pull request Jun 25, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
"""
Set the device for the current platform.
"""
torch.cuda.set_device(device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be getattr(torch, self.device_type) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants