OSError: [Errno 98] Address already in use when using GRPO with vLLM 0.8.5 (collective_rpc failure during init_communicator)

Hi, thank you so much for open-sourcing this great project and dataset!

I'm currently using vLLM in combination with a GRPO training pipeline and encountered the following issue while trying to run inference through trl.scripts.vllm_serve. Despite changing to various available ports, I consistently receive an OSError: [Errno 98] Address already in use during the init_communicator phase.

my bash script:

`CUDA_VISIBLE_DEVICES=4 python -m trl.scripts.vllm_serve --model "pretrain_model/Qwen25-Omni-7B" --tensor_parallel_size 1 --port 8004 & CUDA_VISIBLE_DEVICES=6,7 accelerate launch --num_processes 2 --use_deepspeed --zero_stage 3 my_grpo_main_omni.py --yaml_config config/my_grpo_main_omni.yaml --vllm_server_port 8004`

setup:
vLLM version: 0.8.5
CUDA version: 12.1
PyTorch version: 2.6(vllm need 2.6)
Python version: 3.10

error:
INFO 07-28 14:24:07 [core_client.py:439] Core engine process 0 ready.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8004 (Press CTRL+C to quit)
INFO:     127.0.0.1:55498 - "GET /health/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:55514 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:55530 - "POST /init_communicator/ HTTP/1.1" 200 OK
ERROR 07-28 14:24:08 [core.py:459] Invocation of collective_rpc method failed
ERROR 07-28 14:24:08 [core.py:459] Traceback (most recent call last):
ERROR 07-28 14:24:08 [core.py:459]   File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/v1/engine/core.py", line 456, in _handle_client_request
ERROR 07-28 14:24:08 [core.py:459]     output.result = method(
ERROR 07-28 14:24:08 [core.py:459]   File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/v1/engine/core.py", line 306, in collective_rpc
ERROR 07-28 14:24:08 [core.py:459]     return self.model_executor.collective_rpc(method, timeout, args,
ERROR 07-28 14:24:08 [core.py:459]   File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 07-28 14:24:08 [core.py:459]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-28 14:24:08 [core.py:459]   File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/utils.py", line 2456, in run_method
ERROR 07-28 14:24:08 [core.py:459]     return func(*args, **kwargs)
ERROR 07-28 14:24:08 [core.py:459]   File "/mnt/juicefs/user/zhaojinghua/miniforge3/envs/audsemthinker/lib/python3.10/site-packages/trl/scripts/vllm_serve.py", line 105, in init_communicator
ERROR 07-28 14:24:08 [core.py:459]     pg = StatelessProcessGroup.create(host=host, port=port, rank=rank, world_size=world_size)
ERROR 07-28 14:24:08 [core.py:459]   File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/distributed/utils.py", line 247, in create
ERROR 07-28 14:24:08 [core.py:459]     listen_socket.bind((host, port))
ERROR 07-28 14:24:08 [core.py:459] OSError: [Errno 98] Address already in use


How should I properly configure vllm_serve + GRPO so that collective_rpc and StatelessProcessGroup.create don’t clash on ports?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OSError: [Errno 98] Address already in use when using GRPO with vLLM 0.8.5 (collective_rpc failure during init_communicator) #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

OSError: [Errno 98] Address already in use when using GRPO with vLLM 0.8.5 (collective_rpc failure during init_communicator) #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions