Skip to content

OSError: [Errno 98] Address already in use when using GRPO with vLLM 0.8.5 (collective_rpc failure during init_communicator) #1

@Kiri0824

Description

@Kiri0824

Hi, thank you so much for open-sourcing this great project and dataset!

I'm currently using vLLM in combination with a GRPO training pipeline and encountered the following issue while trying to run inference through trl.scripts.vllm_serve. Despite changing to various available ports, I consistently receive an OSError: [Errno 98] Address already in use during the init_communicator phase.

my bash script:

CUDA_VISIBLE_DEVICES=4 python -m trl.scripts.vllm_serve --model "pretrain_model/Qwen25-Omni-7B" --tensor_parallel_size 1 --port 8004 & CUDA_VISIBLE_DEVICES=6,7 accelerate launch --num_processes 2 --use_deepspeed --zero_stage 3 my_grpo_main_omni.py --yaml_config config/my_grpo_main_omni.yaml --vllm_server_port 8004

setup:
vLLM version: 0.8.5
CUDA version: 12.1
PyTorch version: 2.6(vllm need 2.6)
Python version: 3.10

error:
INFO 07-28 14:24:07 [core_client.py:439] Core engine process 0 ready.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8004 (Press CTRL+C to quit)
INFO: 127.0.0.1:55498 - "GET /health/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:55514 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:55530 - "POST /init_communicator/ HTTP/1.1" 200 OK
ERROR 07-28 14:24:08 [core.py:459] Invocation of collective_rpc method failed
ERROR 07-28 14:24:08 [core.py:459] Traceback (most recent call last):
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/v1/engine/core.py", line 456, in _handle_client_request
ERROR 07-28 14:24:08 [core.py:459] output.result = method(
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/v1/engine/core.py", line 306, in collective_rpc
ERROR 07-28 14:24:08 [core.py:459] return self.model_executor.collective_rpc(method, timeout, args,
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 07-28 14:24:08 [core.py:459] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/utils.py", line 2456, in run_method
ERROR 07-28 14:24:08 [core.py:459] return func(*args, **kwargs)
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/miniforge3/envs/audsemthinker/lib/python3.10/site-packages/trl/scripts/vllm_serve.py", line 105, in init_communicator
ERROR 07-28 14:24:08 [core.py:459] pg = StatelessProcessGroup.create(host=host, port=port, rank=rank, world_size=world_size)
ERROR 07-28 14:24:08 [core.py:459] File "/mnt/juicefs/user/zhaojinghua/soft/vllm-0.8.5/vllm/distributed/utils.py", line 247, in create
ERROR 07-28 14:24:08 [core.py:459] listen_socket.bind((host, port))
ERROR 07-28 14:24:08 [core.py:459] OSError: [Errno 98] Address already in use

How should I properly configure vllm_serve + GRPO so that collective_rpc and StatelessProcessGroup.create don’t clash on ports?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions