Skip to content

ray issues when replicating training runs #26

@gm-kns

Description

@gm-kns

Hey!

I have been facing issues (mostly related to ray/vllm it seems). Starting from Qwen 2.5 0.5B instruct, it seems that there's some package configurations not working well together.
Could you share more details about the environment you used for training?

Here is some additional context: these are some package versions that I have, following your readme:

ray==2.49.2
torch==2.4.0+cu121
torchvision==0.19.0
vllm==0.6.3

In terms of error, I'm getting the following:

Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=data/train.parquet', 'data.val_files=data/test.parquet', 'data.train_batch_size=4', 'data.val_batch_size=4', 'data.max_prompt_length=4096', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=models/Qwen2.5-Coder-0.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=3e-7', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=4', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=False', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=80', 'actor_rollout_ref.rollout.tensor_model_parallel_size=4', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.2', 'actor_rollout_ref.rollout.n=4', 'actor_rollout_ref.rollout.temperature=1.1', 'actor_rollout_ref.ref.log_prob_micro_batch_size=80', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[wandb]', 'trainer.project_name=SQL-R1', 'trainer.experiment_name=8GPU-Qwen2.5-Coder-0.5B-Instruct-7B', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.default_local_dir=logs/SQL-R1/8GPU-Qwen2.5-Coder-0.5B-Instruct-7B', 'trainer.default_hdfs_dir=null', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.total_epochs=10']
...
  File "python/ray/includes/common.pxi", line 100, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'sgqx0FWorkerDict_0:0'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.

Downgrading ray to 2.10.0 helps me get a bit further, but still doesn't do it.

(WorkerDict pid=139513)   File "/home/.pyenv/versions/3.9.18/envs/sql_r1_39_v1/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
(WorkerDict pid=139513)     return TCPStore(
(WorkerDict pid=139513) torch.distributed.DistNetworkError: Connection reset by peer
.......
  File "/home/RL/SQL-R1/verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
    assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: SYolz4_register_center in []

I've been trying to disable offloading, reducing batch sizes, etc.. but the issue seems to be coming from the verl initialization.

Thank you ahead for the help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions