Memory occupy with multi GPUs Training #548

yuanyaaa · 2023-08-22T03:41:26Z

When I use trlx to fine-tune Flan-T5-Large with single GPU, the memory used is about 11GB; However, when I use accelerate for parallel training, the memory used is 4*16GB! I can't understand why is it. And whether can I use about 11GB for parallel training? Is the problem caused by config?
The accelerate config is:

distributed_type: MULTI_GPU 
downcast_bf16: 'no' 
gpu_ids: all 
machine_rank: 0 
main_training_function: main 
mixed_precision: 'no' 
num_machines: 1 
num_processes: 4 
rdzv_backend: static 
same_network: true 
tpu_env: [] 
tpu_use_cluster: false 
tpu_use_sudo: false 
use_cpu: false

Thank you very much for your reply!

The text was updated successfully, but these errors were encountered:

maxreciprocate · 2023-08-28T16:39:31Z

Hi @yuanyaaa! Sorry for the late reply 😞. I can't reproduce this behaviour with this config accelerate launch --config_file configs/accelerate/zero2-bf16.yaml.
https://wandb.ai/carperai/trlx/reports/Memory-occupy-with-multi-GPUs-Training-548---Vmlldzo1MjUzMjMy

Also consider using ZeRO3 if you want to save more memory, or else you may want to lower these options in the config https://github.com/CarperAI/trlx#configure-hyperparameters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory occupy with multi GPUs Training #548

Memory occupy with multi GPUs Training #548

yuanyaaa commented Aug 22, 2023

maxreciprocate commented Aug 28, 2023

Memory occupy with multi GPUs Training #548

Memory occupy with multi GPUs Training #548

Comments

yuanyaaa commented Aug 22, 2023

maxreciprocate commented Aug 28, 2023