Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Baseline] LLaMA2-7B RLHF training curves #263

Open
hijkzzz opened this issue Apr 9, 2024 · 2 comments
Open

[Baseline] LLaMA2-7B RLHF training curves #263

hijkzzz opened this issue Apr 9, 2024 · 2 comments

Comments

@hijkzzz
Copy link
Collaborator

hijkzzz commented Apr 9, 2024

deepspeed ./train_ppo.py \
    --pretrain OpenLLMAI/Llama-2-7b-sft-model-ocra-500k \
    --reward_pretrain OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt \
    --save_path ./ckpt/7b_llama \
    --save_steps -1 \
    --logging_steps 1 \
    --eval_steps -1 \
    --micro_train_batch_size 2 \
    --train_batch_size 128 \
    --micro_rollout_batch_size 4 \
    --rollout_batch_size 1024 \
    --max_epochs 1 \
    --prompt_max_len 1024 \
    --generate_max_len 1024 \
    --zero_stage 2 \
    --bf16 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    --init_kl_coef 0.01 \
    --prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward \
    --prompt_data_probs 0.4,0.5,0.1 \
    --max_samples 80000 \
    --normalize_reward \
    --adam_offload \
    --flash_attn \
    --gradient_checkpointing

image

@hijkzzz hijkzzz changed the title LLaMA2-7B RLHF curves LLaMA2-7B Ray+RLHF curves Apr 9, 2024
@hijkzzz hijkzzz changed the title LLaMA2-7B Ray+RLHF curves LLaMA2-7B Ray+RLHF+default setting training curves Apr 9, 2024
@hijkzzz hijkzzz changed the title LLaMA2-7B Ray+RLHF+default setting training curves [Baseline] LLaMA2-7B RLHF training curves Apr 9, 2024
@mickelliu
Copy link
Contributor

mickelliu commented Apr 28, 2024

Very interesting. Glad to see you can pull nice results with the current setup.
I'm contributing the training curve for fine-tuning another llama2-based model, tulu2-7B with UltraRM-13B on the ultrafeedback dataset.

image

The fine-tuned result (in terms of rewards) isn't as high as the other library (e.g. EasyLM) under similar hyperparameter settings, and I'm still trying to figure out why.

@mickelliu
Copy link
Contributor

The fine-tuned result (in terms of rewards) isn't as high as the other library (e.g. EasyLM) under similar hyperparameter settings, and I'm still trying to figure out why.

btw this is resolved. I was able to pull good-performing models comparable to our other setups just with a few minor differences. Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants