内存超出问题 #277

burger-pb · 2024-04-27T09:01:15Z

使用PPO训练13B的模型，内存占用特别高，我应该怎么解决

hijkzzz · 2024-04-27T11:52:16Z

如果使用 Adam Offload 改用 BF16 梯度累计
--grad_accum_dtype bf16

burger-pb · 2024-04-28T08:48:06Z

大佬，现在的话显存也是不足，模型是codellama-13b,参数如下
sudo torchrun
--nproc_per_node=4
train_ppo.py
--pretrain xxxxx/codellama_13b_sft_v3
--reward_xxxxx/codellama_13b_v3_rm
--save_path ./result/13b_codellama_ppo
--save_steps 1000
--logging_steps 500
--eval_steps 500
--micro_train_batch_size 1
--train_batch_size 4
--micro_rollout_batch_size 4
--rollout_batch_size 1024
--max_epochs 1
--prompt_max_len 1024
--generate_max_len 1024
--zero_stage 2
--bf16
--actor_learning_rate 5e-7
--critic_learning_rate 9e-6
--init_kl_coef 0.01
--prompt_data xxxx/ppo_train_data
--max_samples 1500000
--normalize_reward
--actor_init_on_gpu
--flash_attn
--gradient_checkpointing
--input_template [INST]{}[/INST]
--input_key prompt
--grad_accum_dtype bf16
我用的是A800的卡，但是显存开始不够了，我已经按照你说的把Adam Offload改成--grad_accum_dtype bf16了，麻烦您给看一下我应该怎么去改,train_batch_size 我已经改到最小了，因为我用的是lora合并的模型，所以用不了deepspeed stage3
我数据可能比较多大概有150多万条

hijkzzz · 2024-04-28T15:15:48Z

如果是显存不足需要用 ray 啊~train_ppo_ray.py
参考 examples/test_scripts/train_ppo_llama_ray.sh

mickelliu · 2024-05-30T19:38:49Z

hi @hijkzzz, I observed high RAM usage in the 70B Llama-2 fine-tuning task. I got CPU RAM OOM (not CUDA OOM) when I tried to run it on a 1TB RAM machine, each actor uses around ~250Gibs.

I have already tried bfloat16 in grad accumulation type, and the inference runs fine, just in the first training step it got OOM. If I don't put in the bfloat16 grad, it won't even survive through the inference step. I guess this is expected but curious to hear about your experience.

Do you have a rough estimate of how many A100 80 GB GPUs would be needed if we get rid of the adam CPU offloading and put everything on GPUs? How much of a speed increase would that be roughly speaking? We don't have NVLink enabled in our machine.

btw 非常好 library，爱来自美国

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

内存超出问题 #277

内存超出问题 #277

burger-pb commented Apr 27, 2024

hijkzzz commented Apr 27, 2024

burger-pb commented Apr 28, 2024 •

edited

hijkzzz commented Apr 28, 2024 •

edited

mickelliu commented May 30, 2024

内存超出问题 #277

内存超出问题 #277

Comments

burger-pb commented Apr 27, 2024

hijkzzz commented Apr 27, 2024

burger-pb commented Apr 28, 2024 • edited

hijkzzz commented Apr 28, 2024 • edited

mickelliu commented May 30, 2024

burger-pb commented Apr 28, 2024 •

edited

hijkzzz commented Apr 28, 2024 •

edited