[Bug]: CUDA_VISIBLE_DEVICES is not supported #14807

chenhongyu2048 · 2025-03-14T08:25:11Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

🐛 Describe the bug

It seems the executor will always select GPU 0,1,2,3... as the devices of ray workers. And this makes it impossible for user to assgin devices using export CUDA_VISIBLE_DEVICES=2,3 or os.environ["CUDA_VISIBLE_DEVICES"] = "2,3".

The bug could come with the below code:

    worker_node_and_gpu_ids = []
    for worker in [self.driver_dummy_worker] + self.workers:
        if worker is None:
            # driver_dummy_worker can be None when using ray spmd worker.
            continue
        worker_node_and_gpu_ids.append(
            ray.get(worker.get_node_and_gpu_ids.remote()) \
        ) # type: ignore

    for i, (node_id, gpu_ids) in enumerate(worker_node_and_gpu_ids):
        node_workers[node_id].append(i)
        # `gpu_ids` can be a list of strings or integers.
        # convert them to integers for consistency.
        # NOTE: gpu_ids can be larger than 9 (e.g. 16 GPUs),
        # string sorting is not sufficient.
        # see https://github.com/vllm-project/vllm/issues/5590
        gpu_ids = [int(x) for x in gpu_ids]
        node_gpus[node_id].extend(gpu_ids)
    for node_id, gpu_ids in node_gpus.items():
        node_gpus[node_id] = sorted(gpu_ids)

    # Set environment variables for the driver and workers.
    all_args_to_update_environment_variables = [{
        current_platform.device_control_env_var:
        ",".join(map(str, node_gpus[node_id])),
    } for (node_id, _) in worker_node_and_gpu_ids]

in vllm/vllm/executor/ray_distributed_executor.py.

print worker_node_and_gpu_ids, node_gpus and all_args_to_update_environment_variables will get:

worker_node_and_gpu_ids:  [('5627fe05f249fc3f956418f07961cd12015ef5da2ea6b98b13761542', ['0']), ('5627fe05f249fc3f956418f07961cd12015ef5da2ea6b98b13761542', ['1'])]
node_gpus:  defaultdict(<class 'list'>, {'5627fe05f249fc3f956418f07961cd12015ef5da2ea6b98b13761542': [0, 1]})
all_args_to_update_environment_variables:  [{'CUDA_VISIBLE_DEVICES': '0,1'}, {'CUDA_VISIBLE_DEVICES': '0,1'}]

And then in functionupdate_environment_variables, the 'CUDA_VISIBLE_DEVICES' in os.environ will be rewritten by the above settings:

def update_environment_variables(self, envs_list: List[Dict[str, str]]) -> None:
    envs = envs_list[self.rpc_rank]
    key = 'CUDA_VISIBLE_DEVICES'
    if key in envs and key in os.environ:
        # overwriting CUDA_VISIBLE_DEVICES is desired behavior
        # suppress the warning in `update_environment_variables`
        del os.environ[key]
    update_environment_variables(envs)

Other issue #14334 and #14191 report similar problems.

It seem ray workers have not provide a chance for users to change the visible gpu devices.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

chenhongyu2048 · 2025-03-14T08:32:14Z

ok I got it.

export CUDA_VISIBLE_DEVICES=2,3
ray start --head
python ......

will work.
Maybe this is present in the vllm documentation?

DarkLight1337 · 2025-03-14T17:05:07Z

cc @youkaichao

youkaichao · 2025-03-22T02:42:03Z

yes this is more about ray usage, to control the gpus managed by ray, you have to set it before ray start.

chenhongyu2048 added the bug label Mar 14, 2025

youkaichao added the ray label Mar 22, 2025

github-project-automation bot added this to Ray Mar 22, 2025

github-project-automation bot moved this to Backlog in Ray Mar 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: CUDA_VISIBLE_DEVICES is not supported #14807

[Bug]: CUDA_VISIBLE_DEVICES is not supported #14807

chenhongyu2048 commented Mar 14, 2025

chenhongyu2048 commented Mar 14, 2025

DarkLight1337 commented Mar 14, 2025

youkaichao commented Mar 22, 2025

[Bug]: CUDA_VISIBLE_DEVICES is not supported #14807

[Bug]: CUDA_VISIBLE_DEVICES is not supported #14807

Comments

chenhongyu2048 commented Mar 14, 2025

Your current environment

🐛 Describe the bug

Before submitting a new issue...

chenhongyu2048 commented Mar 14, 2025

DarkLight1337 commented Mar 14, 2025

youkaichao commented Mar 22, 2025