-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
Hey!
I have been facing issues (mostly related to ray
/vllm
it seems). Starting from Qwen 2.5 0.5B instruct, it seems that there's some package configurations not working well together.
Could you share more details about the environment you used for training?
Here is some additional context: these are some package versions that I have, following your readme:
ray==2.49.2
torch==2.4.0+cu121
torchvision==0.19.0
vllm==0.6.3
In terms of error, I'm getting the following:
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=data/train.parquet', 'data.val_files=data/test.parquet', 'data.train_batch_size=4', 'data.val_batch_size=4', 'data.max_prompt_length=4096', 'data.max_response_length=2048', 'actor_rollout_ref.model.path=models/Qwen2.5-Coder-0.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=3e-7', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=4', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=False', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=80', 'actor_rollout_ref.rollout.tensor_model_parallel_size=4', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.2', 'actor_rollout_ref.rollout.n=4', 'actor_rollout_ref.rollout.temperature=1.1', 'actor_rollout_ref.ref.log_prob_micro_batch_size=80', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[wandb]', 'trainer.project_name=SQL-R1', 'trainer.experiment_name=8GPU-Qwen2.5-Coder-0.5B-Instruct-7B', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.default_local_dir=logs/SQL-R1/8GPU-Qwen2.5-Coder-0.5B-Instruct-7B', 'trainer.default_hdfs_dir=null', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.total_epochs=10']
...
File "python/ray/includes/common.pxi", line 100, in ray._raylet.check_status
ValueError: Failed to look up actor with name 'sgqx0FWorkerDict_0:0'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
Downgrading ray to 2.10.0
helps me get a bit further, but still doesn't do it.
(WorkerDict pid=139513) File "/home/.pyenv/versions/3.9.18/envs/sql_r1_39_v1/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
(WorkerDict pid=139513) return TCPStore(
(WorkerDict pid=139513) torch.distributed.DistNetworkError: Connection reset by peer
.......
File "/home/RL/SQL-R1/verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: SYolz4_register_center in []
I've been trying to disable offloading, reducing batch sizes, etc.. but the issue seems to be coming from the verl initialization.
Thank you ahead for the help!
Metadata
Metadata
Assignees
Labels
No labels