New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP failed with big model under deepspeed_stage_3
strategy
#11760
Comments
It's pretty similar to #4420 . |
@feipenghe can you report back after trying #4420 ? Did it resolve your issue? :) |
@justusschock After several trials, I finally determined that it's the issue of not enough RAM. My machine has RAM size of 256 GB. I tried T5-3B which used peak RAM size 126GB so I think T5-11B needs RAM size around 500GB. When I ran the code, RAM usage accumulates to a peak memory usage and then went down after messages like I read some deepspeed documentation https://deepspeed.readthedocs.io/en/latest/memory.html, under the general RAM section, it says there is a certain amount of memory needed at the beginning to initialize the model on CPU memory. It kind of makes sense in my case according to the formula. However, I am not using "deepspeed_stage_3_offload" but only "deepspeed_stage_3". I wonder if 500GB RAM is still needed to initialize deepspeed distributed? |
hey @feipenghe I can confirm that I can reproduce this simply by instantiating t5-11b in the BoringModel somewhere. Further investigation leads me to believe that this memory issue is caused by the pre-trained weights being loaded on each individual device internally in I'm unsure of what a fix could be in this case, I'll brainstorm a bit here as this would be a good use case to tackle! |
I met the same issue when using T5-11B. I had some discussions with a developer of DeepSpeed. microsoft/DeepSpeed#1814 |
@SeanNaren, have you come up with some way to fix this? |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Hi any updates here? I meet the same issue with PL |
馃悰 Bug
I am running T5-11B(~40GB) model with
deepspeed_stage_3
strategy and it failed to_init_deepspeed_distributed()
Trainer works for smaller model such as T5-large (~3GB)
Here is the traceback.
Another runs gives different traceback.
To Reproduce
I think a pytorch lightning trainer with a t5-11b model is enough for reproducing the bug.
Expected behavior
It is expected to start training.
Environment
Additional context
cc @SeanNaren @awaelchli @rohitgr7 @akihironitta
The text was updated successfully, but these errors were encountered: