Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consuming much more gpu memory than expected using model_training and model_eval #3611

Closed
SingL3 opened this issue Jul 27, 2023 · 2 comments · Fixed by #3614
Closed

Consuming much more gpu memory than expected using model_training and model_eval #3611

SingL3 opened this issue Jul 27, 2023 · 2 comments · Fixed by #3614

Comments

@SingL3
Copy link
Contributor

SingL3 commented Jul 27, 2023

Previously, I have trained a pythia-6.9b using code here: dolly
I can train with the below setting on 4xA100(80G) without GPU OOM:

per-device-train-batch-size: 8
per-device-eval-batch-size: 8
gradient-accumulation-steps: 2
max len: 2048
gradient checkpointing: false
use_cache: true
bf16: true

with deepspeed config here
I can also evaluate the output model with lm-evaluation-harness on single gpu with a non-one batch size.
However, now I am using model_training to train a reward model.
I can only run with the below setting on 8xA100(80G):

per_device_train_batch_size: 4 # can be bigger using gradient checkpointing
per_device_eval_batch_size: 4
gradient_accumulation_steps: 4
max len: 2048
gradient checkpointing: true # otherwise got GPU OOM even with per_device_train_batch_size 1
use_cache: false # have to turn off since conflict with grandient checkpointing
bf16: true

with deepspeed config zero3_config_sft.config.(As you can see, it is very alike with the one above)
In addition, I can not eval the output model using eval_rm.py on single gpu(even with batch size 1) because of GPU OOM.
I didnt find any code that reduce GPU memory in dolly or lm-evaluation-harness. And the model GPTNeoXforCasualLm should consume more memory than GPTNeoXRewardModel as I see the code of the output layer.

@andreaskoepf
Copy link
Collaborator

Yes, I also noticed that our current trainer code / configurations don't work even for smaller models on single 80 GB GPUs. It would be great to get this analyzed/fixed.

@SingL3
Copy link
Contributor Author

SingL3 commented Jul 28, 2023

@andreaskoepf
I will take a look into this issue and try to fix some of them (I think there may be multiple reasons for this). If you have any clue or suggestion, please let me know and I would appreciate.

shahules786 pushed a commit that referenced this issue Aug 30, 2023
Fix #3611.
Still debugging or model_training.

---------

Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants