-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hi, thank you so much for open-sourcing this great project and dataset!
I'm currently using vLLM in combination with a GRPO training pipeline and encountered the following issue while trying to run inference through trl.scripts.vllm_serve. Despite changing to various available ports, I consistently receive an OSError: [Errno 98] Address already in use during the init_communicator phase.
my bash script:
CUDA_VISIBLE_DEVICES=4 python -m trl.scripts.vllm_serve --model "pretrain_model/Qwen25-Omni-7B" --tensor_parallel_size 1 --port 8004 & CUDA_VISIBLE_DEVICES=6,7 accelerate launch --num_processes 2 --use_deepspeed --zero_stage 3 my_grpo_main_omni.py --yaml_config config/my_grpo_main_omni.yaml --vllm_server_port 8004
setup:
vLLM version: 0.8.5
CUDA version: 12.1
PyTorch version: 2.6(vllm need 2.6)
Python version: 3.10
error:
INFO 07-28 14:24:07 [core_client.py:439] Core engine process 0 ready.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8004 (Press CTRL+C to quit)
INFO: 127.0.0.1:55498 - "GET /health/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:55514 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:55530 - "POST /init_communicator/ HTTP/1.1" 200 OK
ERROR 07-28 14:24:08 [core.py:459] Invocation of collective_rpc method failed
ERROR 07-28 14:24:08 [core.py:459] Traceback (most recent call last):
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/v1/engine/core.py", line 456, in _handle_client_request
ERROR 07-28 14:24:08 [core.py:459] output.result = method(
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/v1/engine/core.py", line 306, in collective_rpc
ERROR 07-28 14:24:08 [core.py:459] return self.model_executor.collective_rpc(method, timeout, args,
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 07-28 14:24:08 [core.py:459] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/utils.py", line 2456, in run_method
ERROR 07-28 14:24:08 [core.py:459] return func(*args, **kwargs)
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/miniforge3/envs/audsemthinker/lib/python3.10/site-packages/trl/scripts/vllm_serve.py", line 105, in init_communicator
ERROR 07-28 14:24:08 [core.py:459] pg = StatelessProcessGroup.create(host=host, port=port, rank=rank, world_size=world_size)
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/distributed/utils.py", line 247, in create
ERROR 07-28 14:24:08 [core.py:459] listen_socket.bind((host, port))
ERROR 07-28 14:24:08 [core.py:459] OSError: [Errno 98] Address already in use
How should I properly configure vllm_serve + GRPO so that collective_rpc and StatelessProcessGroup.create don’t clash on ports?