Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTPError when running train_ppo_llama_ray.sh #290

Open
Zeyuan-Liu opened this issue May 12, 2024 · 5 comments
Open

HTTPError when running train_ppo_llama_ray.sh #290

Zeyuan-Liu opened this issue May 12, 2024 · 5 comments

Comments

@Zeyuan-Liu
Copy link

What happened + What you expected to happen:

Operation process:
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

Success start head:

Usage stats collection is enabled. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 0.0.0.0

Ray runtime started.

Next steps
To add another node to this Ray cluster, run
ray start --address='0.0.0.0:6379'

To connect to this Ray cluster:
import ray
ray.init(_node_ip_address='0.0.0.0')

To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.

To terminate the Ray runtime, run
ray stop

To view the status of the cluster, use
ray status

To monitor and debug Ray, view the dashboard at
127.0.0.1:8265

If connection to the dashboard fails, check your firewall settings and network configuration.

My Configuration

set -x
export PATH=$HOME/.local/bin/:$PATH

ray job submit --address="http://127.0.0.1:8265"
--runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}'
-- python3 examples/train_ppo_ray.py
--ref_num_nodes 1
--ref_num_gpus_per_node 1
--reward_num_nodes 1
--reward_num_gpus_per_node 1
--critic_num_nodes 1
--critic_num_gpus_per_node 2
--actor_num_nodes 1
--actor_num_gpus_per_node 4
--pretrain OpenLLMAI/Llama-2-7b-sft-model-ocra-500k
--reward_pretrain OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt
--save_path /openrlhf/examples/test_scripts/ckpt/7b_llama
--micro_train_batch_size 8
--train_batch_size 128
--micro_rollout_batch_size 16
--rollout_batch_size 1024
--max_epochs 1
--prompt_max_len 1024
--generate_max_len 1024
--zero_stage 2
--bf16
--actor_learning_rate 5e-7
--critic_learning_rate 9e-6
--init_kl_coef 0.01
--prompt_data Open-Orca/OpenOrca
--prompt_data_probs 1
--max_samples 80000
--normalize_reward
--actor_init_on_gpu
--adam_offload
--flash_attn
--gradient_checkpointing
--use_wandb {wandb_token}

Error Information

Traceback (most recent call last):
File "/root/miniconda3/envs/lzy/bin/ray", line 33, in
sys.exit(load_entry_point('ray==2.12.0', 'console_scripts', 'ray')())
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2612, in main
return cli()
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 264, in submit
client = _get_sdk_client(
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/cli.py", line 29, in _get_sdk_client
client = JobSubmissionClient(
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/job/sdk.py", line 109, in init
self._check_connection_and_version(
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 248, in _check_connection_and_version
self._check_connection_and_version_with_url(min_version, version_error_message)
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 267, in _check_connection_and_version_with_url
r.raise_for_status()
File "/root/miniconda3/envs/lzy/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://127.0.0.1:8265/api/version

@Zeyuan-Liu
Copy link
Author

Environment:

Python 3.10.14

Package Version


accelerate 0.29.3
aiohttp 3.9.5
aiohttp-cors 0.7.0
aiosignal 1.3.1
annotated-types 0.6.0
anyio 4.2.0
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 2.0.5
async-lru 2.0.4
async-timeout 4.0.3
attrs 23.1.0
Babel 2.11.0
beautifulsoup4 4.12.2
bitsandbytes 0.43.1
bleach 4.1.0
Brotli 1.0.9
cachetools 5.3.3
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 2.0.4
click 8.1.7
coloredlogs 15.0.1
colorful 0.5.6
comm 0.2.1
datasets 2.19.0
debugpy 1.6.7
decorator 5.1.1
deepspeed 0.13.2
defusedxml 0.7.1
dill 0.3.8
distlib 0.3.8
docker-pycreds 0.4.0
einops 0.7.0
exceptiongroup 1.2.0
executing 0.8.3
fastjsonschema 2.16.2
filelock 3.13.1
flash-attn 2.4.2
frozenlist 1.4.1
fsspec 2024.3.1
gitdb 4.0.11
GitPython 3.1.43
gmpy2 2.1.2
google-api-core 2.18.0
google-auth 2.29.0
googleapis-common-protos 1.63.0
grpcio 1.62.2
hjson 3.1.0
huggingface-hub 0.22.2
humanfriendly 10.0
idna 3.4
ipykernel 6.28.0
ipython 8.20.0
ipywidgets 8.1.2
isort 5.13.2
jedi 0.18.1
Jinja2 3.1.3
json5 0.9.6
jsonlines 4.0.0
jsonschema 4.19.2
jsonschema-specifications 2023.12.1
jupyter 1.0.0
jupyter_client 8.6.0
jupyter-console 6.6.3
jupyter_core 5.5.0
jupyter-events 0.8.0
jupyter-lsp 2.2.0
jupyter_server 2.10.0
jupyter_server_terminals 0.4.4
jupyterlab 4.0.11
jupyterlab-pygments 0.1.2
jupyterlab_server 2.25.1
jupyterlab-widgets 3.0.10
lightning-utilities 0.11.2
linkify-it-py 2.0.3
loralib 0.1.2
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mdit-py-plugins 0.4.0
mdurl 0.1.2
memray 1.12.0
mistune 2.0.4
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mpi4py 3.1.4
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
multiprocess 0.70.16
nbclient 0.8.0
nbconvert 7.10.0
nbformat 5.9.2
nest-asyncio 1.6.0
networkx 3.1
ninja 1.11.1.1
notebook 7.0.8
notebook_shim 0.2.3
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
opencensus 0.11.4
opencensus-context 0.1.3
openrlhf 0.2.6
optimum 1.19.1
overrides 7.4.0
packaging 23.2
pandas 2.2.2
pandocfilters 1.5.0
parso 0.8.3
peft 0.10.0
pexpect 4.8.0
pillow 10.2.0
pip 23.3.1
platformdirs 3.10.0
ply 3.11
prometheus-client 0.14.1
prompt-toolkit 3.0.43
proto-plus 1.23.0
protobuf 4.25.3
psutil 5.9.0
ptyprocess 0.7.0
pure-eval 0.2.2
py-cpuinfo 9.0.0
py-spy 0.3.14
pyarrow 16.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.0
pyasn1_modules 0.4.0
pycparser 2.21
pydantic 2.7.1
pydantic_core 2.18.2
Pygments 2.15.1
pynvml 11.5.0
PyQt5 5.15.10
PyQt5-sip 12.13.0
PySocks 1.7.1
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2024.1
PyYAML 6.0.1
pyzmq 25.1.2
qtconsole 5.5.1
QtPy 2.4.1
ray 2.12.0
referencing 0.35.0
regex 2024.4.16
requests 2.31.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.7.1
rpds-py 0.10.6
rsa 4.9
safetensors 0.4.3
Send2Trash 1.8.2
sentencepiece 0.2.0
sentry-sdk 2.0.1
setproctitle 1.3.3
setuptools 68.2.2
sip 6.7.12
six 1.16.0
smart-open 7.0.4
smmap 5.0.1
sniffio 1.3.0
soupsieve 2.5
stack-data 0.2.0
sympy 1.12
terminado 0.17.1
textual 0.58.0
tinycss2 1.2.1
tokenizers 0.15.2
tomli 2.0.1
torch 2.3.0
torchaudio 2.3.0
torchmetrics 1.3.2
torchvision 0.18.0
tornado 6.3.3
tqdm 4.66.2
traitlets 5.7.1
transformers 4.38.2
transformers-stream-generator 0.0.5
triton 2.3.0
typing_extensions 4.11.0
tzdata 2024.1
uc-micro-py 1.0.3
urllib3 2.1.0
virtualenv 20.26.0
wandb 0.16.6
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.58.0
wheel 0.41.3
widgetsnbextension 4.0.10
wrapt 1.16.0
xxhash 3.4.1
yarl 1.9.4

@hijkzzz
Copy link
Collaborator

hijkzzz commented May 12, 2024

could you try the docker container?

@Zeyuan-Liu
Copy link
Author

could you try the docker container?
When I try to run bash docker_run.sh build, the following issues happens:

  • build=build
    +++ dirname docker_run.sh
    ++ cd ./../../
    ++ pwd
  • PROJECT_PATH=/home/wangyx/lzy/rlhf/OpenRLHF-main
  • IMAGE_NAME=nvcr.io/nvidia/pytorch:23.12-py3
  • [[ build == \b ]]
  • docker image rm nvcr.io/nvidia/pytorch:23.12-py3
    Error response from daemon: No such image: nvcr.io/nvidia/pytorch:23.12-py3
  • docker build -t nvcr.io/nvidia/pytorch:23.12-py3 /home/wangyx/lzy/rlhf/OpenRLHF-main/dockerfile
    [+] Building 0.7s (2/2) FINISHED docker:default
    => [internal] load build definition from Dockerfile 0.1s
    => => transferring dockerfile: 701B 0.0s
    => ERROR [internal] load metadata for nvcr.io/nvidia/pytorch:23.12-py3 0.5s

[internal] load metadata for nvcr.io/nvidia/pytorch:23.12-py3:

Dockerfile:1

1 | >>> FROM nvcr.io/nvidia/pytorch:23.12-py3
2 |
3 | WORKDIR /app

ERROR: failed to solve: nvcr.io/nvidia/pytorch:23.12-py3: failed to resolve source metadata for nvcr.io/nvidia/pytorch:23.12-py3: failed to do request: Head "https://nvcr.io/v2/nvidia/pytorch/manifests/23.12-py3": dial tcp 54.148.47.228:443: connect: network is unreachable

@hijkzzz
Copy link
Collaborator

hijkzzz commented May 13, 2024

network issue, can you access nvcr.io/nvidia/pytorch:23.12-py3 ?

@Zeyuan-Liu
Copy link
Author

Zeyuan-Liu commented May 13, 2024

When I click on the aforementioned link nvcr.io/nvidia/pytorch:23.12-py3, I can get access to

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags

but can not open

https://nvcr.io/nvidia/pytorch:23.12-py3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants