ray issues when replicating training runs

Hey! 

I have been facing issues (mostly related to `ray`/`vllm` it seems). Starting from Qwen 2.5 0.5B instruct, it seems that there's some package configurations not working well together. 
Could you share more details about the environment you used for training? 

Here is some additional context: these are some package versions that I have, following your readme:
```
ray==2.49.2
torch==2.4.0+cu121
torchvision==0.19.0
vllm==0.6.3
```

In terms of error, I'm getting the following: 
```
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=data/train.parquet', 'data.val_files=data/test.parquet', 'data.train_batch_size=4', 'data.val_batch_size=4', 'data.max_prompt_length=4096', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=models/Qwen2.5-Coder-0.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=3e-7', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=4', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=False', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=80', 'actor_rollout_ref.rollout.tensor_model_parallel_size=4', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.2', 'actor_rollout_ref.rollout.n=4', 'actor_rollout_ref.rollout.temperature=1.1', 'actor_rollout_ref.ref.log_prob_micro_batch_size=80', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[wandb]', 'trainer.project_name=SQL-R1', 'trainer.experiment_name=8GPU-Qwen2.5-Coder-0.5B-Instruct-7B', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.default_local_dir=logs/SQL-R1/8GPU-Qwen2.5-Coder-0.5B-Instruct-7B', 'trainer.default_hdfs_dir=null', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.total_epochs=10']
...
  File "python/ray/includes/common.pxi", line 100, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'sgqx0FWorkerDict_0:0'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
``` 

Downgrading ray to `2.10.0` helps me get a bit further, but still doesn't do it. 
```
(WorkerDict pid=139513)   File "/home/.pyenv/versions/3.9.18/envs/sql_r1_39_v1/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
(WorkerDict pid=139513)     return TCPStore(
(WorkerDict pid=139513) torch.distributed.DistNetworkError: Connection reset by peer
.......
  File "/home/RL/SQL-R1/verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
    assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: SYolz4_register_center in []
```

I've been trying to disable offloading, reducing batch sizes, etc.. but the issue seems to be coming from the verl initialization. 

Thank you ahead for the help!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ray issues when replicating training runs #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ray issues when replicating training runs #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions