HTTPError when running train_ppo_llama_ray.sh #290

Zeyuan-Liu · 2024-05-12T08:32:48Z

What happened + What you expected to happen:

Operation process:
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

Success start head：

Usage stats collection is enabled. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 0.0.0.0

Ray runtime started.

Next steps
To add another node to this Ray cluster, run
ray start --address='0.0.0.0:6379'

To connect to this Ray cluster:
import ray
ray.init(_node_ip_address='0.0.0.0')

To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.

To terminate the Ray runtime, run
ray stop

To view the status of the cluster, use
ray status

To monitor and debug Ray, view the dashboard at
127.0.0.1:8265

If connection to the dashboard fails, check your firewall settings and network configuration.

My Configuration

set -x
export PATH=$HOME/.local/bin/:$PATH

ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}'
-- python3 examples/train_ppo_ray.py
--ref_num_nodes 1
--ref_num_gpus_per_node 1
--reward_num_nodes 1
--reward_num_gpus_per_node 1
--critic_num_nodes 1
--critic_num_gpus_per_node 2
--actor_num_nodes 1
--actor_num_gpus_per_node 4
--pretrain OpenLLMAI/Llama-2-7b-sft-model-ocra-500k
--reward_pretrain OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt
--save_path /openrlhf/examples/test_scripts/ckpt/7b_llama
--micro_train_batch_size 8
--train_batch_size 128
--micro_rollout_batch_size 16
--rollout_batch_size 1024
--max_epochs 1
--prompt_max_len 1024
--generate_max_len 1024
--zero_stage 2
--bf16
--actor_learning_rate 5e-7
--critic_learning_rate 9e-6
--init_kl_coef 0.01
--prompt_data Open-Orca/OpenOrca
--prompt_data_probs 1
--max_samples 80000
--normalize_reward
--actor_init_on_gpu
--adam_offload
--flash_attn
--gradient_checkpointing
--use_wandb {wandb_token}

Error Information

Traceback (most recent call last):
File "/root/miniconda3/envs/lzy/bin/ray", line 33, in
sys.exit(load_entry_point('ray==2.12.0', 'console_scripts', 'ray')())
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2612, in main
return cli()
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 264, in submit
client = _get_sdk_client(
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 29, in _get_sdk_client
client = JobSubmissionClient(
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/sdk.py", line 109, in init
self._check_connection_and_version(
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version
self._check_connection_and_version_with_url(min_version, version_error_message)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 267, in _check_connection_and_version_with_url
r.raise_for_status()
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://127.0.0.1:8265/api/version

The text was updated successfully, but these errors were encountered:

Zeyuan-Liu · 2024-05-12T08:57:25Z

Environment:

Python 3.10.14

Package Version

accelerate 0.29.3
aiohttp 3.9.5
aiohttp-cors 0.7.0
aiosignal 1.3.1
annotated-types 0.6.0
anyio 4.2.0
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 2.0.5
async-lru 2.0.4
async-timeout 4.0.3
attrs 23.1.0
Babel 2.11.0
beautifulsoup4 4.12.2
bitsandbytes 0.43.1
bleach 4.1.0
Brotli 1.0.9
cachetools 5.3.3
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 2.0.4
click 8.1.7
coloredlogs 15.0.1
colorful 0.5.6
comm 0.2.1
datasets 2.19.0
debugpy 1.6.7
decorator 5.1.1
deepspeed 0.13.2
defusedxml 0.7.1
dill 0.3.8
distlib 0.3.8
docker-pycreds 0.4.0
einops 0.7.0
exceptiongroup 1.2.0
executing 0.8.3
fastjsonschema 2.16.2
filelock 3.13.1
flash-attn 2.4.2
frozenlist 1.4.1
fsspec 2024.3.1
gitdb 4.0.11
GitPython 3.1.43
gmpy2 2.1.2
google-api-core 2.18.0
google-auth 2.29.0
googleapis-common-protos 1.63.0
grpcio 1.62.2
hjson 3.1.0
huggingface-hub 0.22.2
humanfriendly 10.0
idna 3.4
ipykernel 6.28.0
ipython 8.20.0
ipywidgets 8.1.2
isort 5.13.2
jedi 0.18.1
Jinja2 3.1.3
json5 0.9.6
jsonlines 4.0.0
jsonschema 4.19.2
jsonschema-specifications 2023.12.1
jupyter 1.0.0
jupyter_client 8.6.0
jupyter-console 6.6.3
jupyter_core 5.5.0
jupyter-events 0.8.0
jupyter-lsp 2.2.0
jupyter_server 2.10.0
jupyter_server_terminals 0.4.4
jupyterlab 4.0.11
jupyterlab-pygments 0.1.2
jupyterlab_server 2.25.1
jupyterlab-widgets 3.0.10
lightning-utilities 0.11.2
linkify-it-py 2.0.3
loralib 0.1.2
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mdit-py-plugins 0.4.0
mdurl 0.1.2
memray 1.12.0
mistune 2.0.4
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mpi4py 3.1.4
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
multiprocess 0.70.16
nbclient 0.8.0
nbconvert 7.10.0
nbformat 5.9.2
nest-asyncio 1.6.0
networkx 3.1
ninja 1.11.1.1
notebook 7.0.8
notebook_shim 0.2.3
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
opencensus 0.11.4
opencensus-context 0.1.3
openrlhf 0.2.6
optimum 1.19.1
overrides 7.4.0
packaging 23.2
pandas 2.2.2
pandocfilters 1.5.0
parso 0.8.3
peft 0.10.0
pexpect 4.8.0
pillow 10.2.0
pip 23.3.1
platformdirs 3.10.0
ply 3.11
prometheus-client 0.14.1
prompt-toolkit 3.0.43
proto-plus 1.23.0
protobuf 4.25.3
psutil 5.9.0
ptyprocess 0.7.0
pure-eval 0.2.2
py-cpuinfo 9.0.0
py-spy 0.3.14
pyarrow 16.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.0
pyasn1_modules 0.4.0
pycparser 2.21
pydantic 2.7.1
pydantic_core 2.18.2
Pygments 2.15.1
pynvml 11.5.0
PyQt5 5.15.10
PyQt5-sip 12.13.0
PySocks 1.7.1
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2024.1
PyYAML 6.0.1
pyzmq 25.1.2
qtconsole 5.5.1
QtPy 2.4.1
ray 2.12.0
referencing 0.35.0
regex 2024.4.16
requests 2.31.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.7.1
rpds-py 0.10.6
rsa 4.9
safetensors 0.4.3
Send2Trash 1.8.2
sentencepiece 0.2.0
sentry-sdk 2.0.1
setproctitle 1.3.3
setuptools 68.2.2
sip 6.7.12
six 1.16.0
smart-open 7.0.4
smmap 5.0.1
sniffio 1.3.0
soupsieve 2.5
stack-data 0.2.0
sympy 1.12
terminado 0.17.1
textual 0.58.0
tinycss2 1.2.1
tokenizers 0.15.2
tomli 2.0.1
torch 2.3.0
torchaudio 2.3.0
torchmetrics 1.3.2
torchvision 0.18.0
tornado 6.3.3
tqdm 4.66.2
traitlets 5.7.1
transformers 4.38.2
transformers-stream-generator 0.0.5
triton 2.3.0
typing_extensions 4.11.0
tzdata 2024.1
uc-micro-py 1.0.3
urllib3 2.1.0
virtualenv 20.26.0
wandb 0.16.6
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.58.0
wheel 0.41.3
widgetsnbextension 4.0.10
wrapt 1.16.0
xxhash 3.4.1
yarl 1.9.4

hijkzzz · 2024-05-12T11:13:35Z

could you try the docker container?

Zeyuan-Liu · 2024-05-13T03:51:26Z

could you try the docker container?
When I try to run bash docker_run.sh build, the following issues happens:

build=build
+++ dirname docker_run.sh
++ cd ./../../
++ pwd
PROJECT_PATH=/home/wangyx/lzy/rlhf/OpenRLHF-main
IMAGE_NAME=nvcr.io/nvidia/pytorch:23.12-py3
[[ build == \b ]]
docker image rm nvcr.io/nvidia/pytorch:23.12-py3
Error response from daemon: No such image: nvcr.io/nvidia/pytorch:23.12-py3
docker build -t nvcr.io/nvidia/pytorch:23.12-py3 /home/wangyx/lzy/rlhf/OpenRLHF-main/dockerfile
[+] Building 0.7s (2/2) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 701B 0.0s
=> ERROR [internal] load metadata for nvcr.io/nvidia/pytorch:23.12-py3 0.5s

[internal] load metadata for nvcr.io/nvidia/pytorch:23.12-py3:

Dockerfile:1

1 | >>> FROM nvcr.io/nvidia/pytorch:23.12-py3
2 |
3 | WORKDIR /app

ERROR: failed to solve: nvcr.io/nvidia/pytorch:23.12-py3: failed to resolve source metadata for nvcr.io/nvidia/pytorch:23.12-py3: failed to do request: Head "https://nvcr.io/v2/nvidia/pytorch/manifests/23.12-py3": dial tcp 54.148.47.228:443: connect: network is unreachable

hijkzzz · 2024-05-13T04:18:45Z

network issue, can you access nvcr.io/nvidia/pytorch:23.12-py3 ?

Zeyuan-Liu · 2024-05-13T04:26:38Z

When I click on the aforementioned link nvcr.io/nvidia/pytorch:23.12-py3, I can get access to

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags

but can not open

https://nvcr.io/nvidia/pytorch:23.12-py3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTPError when running train_ppo_llama_ray.sh #290

HTTPError when running train_ppo_llama_ray.sh #290

Zeyuan-Liu commented May 12, 2024

Zeyuan-Liu commented May 12, 2024

hijkzzz commented May 12, 2024 •

edited

Zeyuan-Liu commented May 13, 2024

hijkzzz commented May 13, 2024

Zeyuan-Liu commented May 13, 2024 •

edited

HTTPError when running train_ppo_llama_ray.sh #290

HTTPError when running train_ppo_llama_ray.sh #290

Comments

Zeyuan-Liu commented May 12, 2024

Zeyuan-Liu commented May 12, 2024

hijkzzz commented May 12, 2024 • edited

Zeyuan-Liu commented May 13, 2024

hijkzzz commented May 13, 2024

Zeyuan-Liu commented May 13, 2024 • edited

hijkzzz commented May 12, 2024 •

edited

Zeyuan-Liu commented May 13, 2024 •

edited