Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory occupy with multi GPUs Training #548

Open
yuanyaaa opened this issue Aug 22, 2023 · 1 comment
Open

Memory occupy with multi GPUs Training #548

yuanyaaa opened this issue Aug 22, 2023 · 1 comment

Comments

@yuanyaaa
Copy link

When I use trlx to fine-tune Flan-T5-Large with single GPU, the memory used is about 11GB; However, when I use accelerate for parallel training, the memory used is 4*16GB! I can't understand why is it. And whether can I use about 11GB for parallel training? Is the problem caused by config?
The accelerate config is:

distributed_type: MULTI_GPU 
downcast_bf16: 'no' 
gpu_ids: all 
machine_rank: 0 
main_training_function: main 
mixed_precision: 'no' 
num_machines: 1 
num_processes: 4 
rdzv_backend: static 
same_network: true 
tpu_env: [] 
tpu_use_cluster: false 
tpu_use_sudo: false 
use_cpu: false 

Thank you very much for your reply!

@maxreciprocate
Copy link
Collaborator

Hi @yuanyaaa! Sorry for the late reply 😞. I can't reproduce this behaviour with this config accelerate launch --config_file configs/accelerate/zero2-bf16.yaml.
https://wandb.ai/carperai/trlx/reports/Memory-occupy-with-multi-GPUs-Training-548---Vmlldzo1MjUzMjMy

Also consider using ZeRO3 if you want to save more memory, or else you may want to lower these options in the config https://github.com/CarperAI/trlx#configure-hyperparameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants