-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[Bugfix][Ray] Set the cuda context eagerly in the ray worker #19583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
cc @youkaichao to take a look, who recently investigated the same problem under a different context:
pytorch/pytorch#155668
https://forums.developer.nvidia.com/t/whats-the-expected-behavior-of-calling-cudagetdevice-when-the-process-has-no-cuda-context/335784
.buildkite/test-pipeline.yaml
Outdated
@@ -759,6 +759,7 @@ steps: | |||
- torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py | |||
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)' | |||
- pytest -v -s -x lora/test_mixtral.py | |||
- pytest -v -s cuda/test_cuda_context.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need to be under 4xA100 tests?
vllm/executor/ray_utils.py
Outdated
@@ -113,6 +113,9 @@ def setup_device_if_necessary(self): | |||
# Not needed | |||
pass | |||
else: | |||
if current_platform.is_cuda(): | |||
from vllm.platforms.cuda import set_cuda_context | |||
set_cuda_context(self.worker.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we assert True
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change here? you just need to change the set_device
implementation in the cuda platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so set_device
method does not explicitly exist on cuda Platform definition. It's going through __getattr__
I think here https://github.com/vllm-project/vllm/blob/main/vllm/platforms/interface.py#L498? otherwise I don't know how set_device is mapped to torch.cuda.set_device on Cuda Platform class. For lack of understanding that I implemented it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you can add a set_device
interface, and default to torch.xxx.set_device
, and override it in cuda
vllm/platforms/cuda.py
Outdated
|
||
try: | ||
# Load CUDA driver library | ||
cuda = ctypes.CDLL('libcuda.so') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering how robust this is, as compared to the workaround of creating a tensor.
vllm/platforms/cuda.py
Outdated
@@ -50,6 +51,65 @@ def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> _R: | |||
return wrapper | |||
|
|||
|
|||
def set_cuda_context(device: Union[torch.device, int]) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would suggest just creating a tensor on the target device, and then call torch.cuda.set_device
. creating a context this way might interfere with other functionality, e.g. cuCtxCreate_v2
might be deprecated in the future (in favor of cuCtxCreate_v3
), and calling the function would break other things easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't expect it occurs here too 👀
Cool, I think I addressed all the comments. Plz take a look again @youkaichao @ruisearch42 |
from vllm.platforms import current_platform | ||
|
||
|
||
def check_cuda_context(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use torch._C._cuda_hasPrimaryContext(int device)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this API capture what we are tying to do here. torch._C._cuda_hasPrimaryContext(0)
returns true even if the background thread is created. The current method returns false which is consistent with the problem that I was trying to solve.
tests/cuda/test_cuda_context.py
Outdated
except Exception as e: | ||
return False, f"Wrong exception: {type(e).__name__}: {e}" | ||
|
||
with ThreadPoolExecutor(max_workers=1) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we use ThreadPool? just use normal functions starting with test_
should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted. changed.
…more methods and datasets (vllm-project#18847) Signed-off-by: kouroshhakha <kourosh@anyscale.com>
Co-authored-by: youkaichao <youkaichao@gmail.com> Signed-off-by: kouroshhakha <kourosh@anyscale.com>
7972905
to
4d9b41a
Compare
…oject#19583) Signed-off-by: juncheoll <th6re8e@naver.com>
…oject#19583) Signed-off-by: minpeter <kali2005611@gmail.com>
…oject#19583) Signed-off-by: fhl <2410591650@qq.com>
…oject#19583) Signed-off-by: Will Eaton <weaton@redhat.com>
""" | ||
Set the device for the current platform. | ||
""" | ||
torch.cuda.set_device(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be getattr(torch, self.device_type)
?
This PR sets the cuda context eagerly in the ray actor (torch.cuda.set_device is actually lazy I think)
There was a bug in Ray + P/D where the nixl/ucx cuda context is checked via direct cuda devices and it fails because Ray runs the execution of the model in a background thread which does not inherit the cuda context.