Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: CUDA_VISIBLE_DEVICES is not supported #14807

Open
1 task done
chenhongyu2048 opened this issue Mar 14, 2025 · 3 comments
Open
1 task done

[Bug]: CUDA_VISIBLE_DEVICES is not supported #14807

chenhongyu2048 opened this issue Mar 14, 2025 · 3 comments
Labels
bug Something isn't working ray anything related with ray

Comments

@chenhongyu2048
Copy link

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

It seems the executor will always select GPU 0,1,2,3... as the devices of ray workers. And this makes it impossible for user to assgin devices using export CUDA_VISIBLE_DEVICES=2,3 or os.environ["CUDA_VISIBLE_DEVICES"] = "2,3".

The bug could come with the below code:

    worker_node_and_gpu_ids = []
    for worker in [self.driver_dummy_worker] + self.workers:
        if worker is None:
            # driver_dummy_worker can be None when using ray spmd worker.
            continue
        worker_node_and_gpu_ids.append(
            ray.get(worker.get_node_and_gpu_ids.remote()) \
        ) # type: ignore

    for i, (node_id, gpu_ids) in enumerate(worker_node_and_gpu_ids):
        node_workers[node_id].append(i)
        # `gpu_ids` can be a list of strings or integers.
        # convert them to integers for consistency.
        # NOTE: gpu_ids can be larger than 9 (e.g. 16 GPUs),
        # string sorting is not sufficient.
        # see https://github.com/vllm-project/vllm/issues/5590
        gpu_ids = [int(x) for x in gpu_ids]
        node_gpus[node_id].extend(gpu_ids)
    for node_id, gpu_ids in node_gpus.items():
        node_gpus[node_id] = sorted(gpu_ids)

    # Set environment variables for the driver and workers.
    all_args_to_update_environment_variables = [{
        current_platform.device_control_env_var:
        ",".join(map(str, node_gpus[node_id])),
    } for (node_id, _) in worker_node_and_gpu_ids]

in vllm/vllm/executor/ray_distributed_executor.py.

print worker_node_and_gpu_ids, node_gpus and all_args_to_update_environment_variables will get:

worker_node_and_gpu_ids:  [('5627fe05f249fc3f956418f07961cd12015ef5da2ea6b98b13761542', ['0']), ('5627fe05f249fc3f956418f07961cd12015ef5da2ea6b98b13761542', ['1'])]
node_gpus:  defaultdict(<class 'list'>, {'5627fe05f249fc3f956418f07961cd12015ef5da2ea6b98b13761542': [0, 1]})
all_args_to_update_environment_variables:  [{'CUDA_VISIBLE_DEVICES': '0,1'}, {'CUDA_VISIBLE_DEVICES': '0,1'}]

And then in functionupdate_environment_variables, the 'CUDA_VISIBLE_DEVICES' in os.environ will be rewritten by the above settings:

def update_environment_variables(self, envs_list: List[Dict[str, str]]) -> None:
    envs = envs_list[self.rpc_rank]
    key = 'CUDA_VISIBLE_DEVICES'
    if key in envs and key in os.environ:
        # overwriting CUDA_VISIBLE_DEVICES is desired behavior
        # suppress the warning in `update_environment_variables`
        del os.environ[key]
    update_environment_variables(envs)

Other issue #14334 and #14191 report similar problems.

It seem ray workers have not provide a chance for users to change the visible gpu devices.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@chenhongyu2048 chenhongyu2048 added the bug Something isn't working label Mar 14, 2025
@chenhongyu2048
Copy link
Author

ok I got it.

export CUDA_VISIBLE_DEVICES=2,3
ray start --head
python ......

will work.
Maybe this is present in the vllm documentation?

@DarkLight1337
Copy link
Member

cc @youkaichao

@youkaichao
Copy link
Member

yes this is more about ray usage, to control the gpus managed by ray, you have to set it before ray start.

@youkaichao youkaichao added the ray anything related with ray label Mar 22, 2025
@github-project-automation github-project-automation bot moved this to Backlog in Ray Mar 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ray anything related with ray
Projects
Status: Backlog
Development

No branches or pull requests

3 participants