Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

内存超出问题 #277

Open
burger-pb opened this issue Apr 27, 2024 · 4 comments
Open

内存超出问题 #277

burger-pb opened this issue Apr 27, 2024 · 4 comments

Comments

@burger-pb
Copy link

使用PPO训练13B的模型,内存占用特别高,我应该怎么解决

@hijkzzz
Copy link
Collaborator

hijkzzz commented Apr 27, 2024

如果使用 Adam Offload 改用 BF16 梯度累计
--grad_accum_dtype bf16

@burger-pb
Copy link
Author

burger-pb commented Apr 28, 2024

大佬,现在的话显存也是不足,模型是codellama-13b,参数如下
sudo torchrun
--nproc_per_node=4
train_ppo.py
--pretrain xxxxx/codellama_13b_sft_v3
--reward_xxxxx/codellama_13b_v3_rm
--save_path ./result/13b_codellama_ppo
--save_steps 1000
--logging_steps 500
--eval_steps 500
--micro_train_batch_size 1
--train_batch_size 4
--micro_rollout_batch_size 4
--rollout_batch_size 1024
--max_epochs 1
--prompt_max_len 1024
--generate_max_len 1024
--zero_stage 2
--bf16
--actor_learning_rate 5e-7
--critic_learning_rate 9e-6
--init_kl_coef 0.01
--prompt_data xxxx/ppo_train_data
--max_samples 1500000
--normalize_reward
--actor_init_on_gpu
--flash_attn
--gradient_checkpointing
--input_template [INST]{}[/INST]
--input_key prompt
--grad_accum_dtype bf16
我用的是A800的卡,但是显存开始不够了,我已经按照你说的把Adam Offload改成--grad_accum_dtype bf16了,麻烦您给看一下我应该怎么去改,train_batch_size 我已经改到最小了,因为我用的是lora合并的模型,所以用不了deepspeed stage3
我数据可能比较多大概有150多万条

@hijkzzz
Copy link
Collaborator

hijkzzz commented Apr 28, 2024

如果是显存不足需要用 ray 啊~train_ppo_ray.py
参考 examples/test_scripts/train_ppo_llama_ray.sh

@mickelliu
Copy link
Contributor

hi @hijkzzz, I observed high RAM usage in the 70B Llama-2 fine-tuning task. I got CPU RAM OOM (not CUDA OOM) when I tried to run it on a 1TB RAM machine, each actor uses around ~250Gibs.

I have already tried bfloat16 in grad accumulation type, and the inference runs fine, just in the first training step it got OOM. If I don't put in the bfloat16 grad, it won't even survive through the inference step. I guess this is expected but curious to hear about your experience.

Do you have a rough estimate of how many A100 80 GB GPUs would be needed if we get rid of the adam CPU offloading and put everything on GPUs? How much of a speed increase would that be roughly speaking? We don't have NVLink enabled in our machine.

btw 非常好 library,爱来自美国

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants