Consuming much more gpu memory than expected using model_training and model_eval #3611

SingL3 · 2023-07-27T10:52:55Z

Previously, I have trained a pythia-6.9b using code here: dolly
I can train with the below setting on 4xA100(80G) without GPU OOM:

per-device-train-batch-size: 8
per-device-eval-batch-size: 8
gradient-accumulation-steps: 2
max len: 2048
gradient checkpointing: false
use_cache: true
bf16: true

with deepspeed config here
I can also evaluate the output model with lm-evaluation-harness on single gpu with a non-one batch size.
However, now I am using model_training to train a reward model.
I can only run with the below setting on 8xA100(80G):

per_device_train_batch_size: 4 # can be bigger using gradient checkpointing
per_device_eval_batch_size: 4
gradient_accumulation_steps: 4
max len: 2048
gradient checkpointing: true # otherwise got GPU OOM even with per_device_train_batch_size 1
use_cache: false # have to turn off since conflict with grandient checkpointing
bf16: true

with deepspeed config zero3_config_sft.config.(As you can see, it is very alike with the one above)
In addition, I can not eval the output model using eval_rm.py on single gpu(even with batch size 1) because of GPU OOM.
I didnt find any code that reduce GPU memory in dolly or lm-evaluation-harness. And the model GPTNeoXforCasualLm should consume more memory than GPTNeoXRewardModel as I see the code of the output layer.

The text was updated successfully, but these errors were encountered:

andreaskoepf · 2023-07-28T16:12:20Z

Yes, I also noticed that our current trainer code / configurations don't work even for smaller models on single 80 GB GPUs. It would be great to get this analyzed/fixed.

SingL3 · 2023-07-28T17:42:39Z

@andreaskoepf
I will take a look into this issue and try to fix some of them (I think there may be multiple reasons for this). If you have any clue or suggestion, please let me know and I would appreciate.

Fix #3611. Still debugging or model_training. --------- Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>

SingL3 mentioned this issue Jul 28, 2023

[Fix] Consume much more gpt memory running eval_rm #3614

Merged

shahules786 closed this as completed in #3614 Aug 30, 2023

shahules786 pushed a commit that referenced this issue Aug 30, 2023

[Fix] Consume much more gpt memory running eval_rm (#3614)

709bb99

Fix #3611. Still debugging or model_training. --------- Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consuming much more gpu memory than expected using model_training and model_eval #3611

Consuming much more gpu memory than expected using model_training and model_eval #3611

SingL3 commented Jul 27, 2023 •

edited

andreaskoepf commented Jul 28, 2023

SingL3 commented Jul 28, 2023 •

edited

Consuming much more gpu memory than expected using model_training and model_eval #3611

Consuming much more gpu memory than expected using model_training and model_eval #3611

Comments

SingL3 commented Jul 27, 2023 • edited

andreaskoepf commented Jul 28, 2023

SingL3 commented Jul 28, 2023 • edited

SingL3 commented Jul 27, 2023 •

edited

SingL3 commented Jul 28, 2023 •

edited