New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi node deepspeed can't load_from_checkpoint
#11947
Comments
load_from_checkpoint
However, |
hey @thomas-happify we removed To ensure I understand what the issue is, it seems that DeepSpeed is saving the corresponding optim states on the local drives that do not know of each other? or is it a shared drive? |
@SeanNaren |
@SeanNaren do u face this issue in multi-node setting? or is this Azure specific? |
hey @thomas-happify I'll try to find time to think of what the right solution is. I think it makes sense that all processes save their own individual shard onto the disk, however in a perfect world it would be possible for all processes to save to a shared disk (thus making the checkpoint available on all machines). I think this is doable via an NFS drive but unsure how viable it is in your setup. |
Thx @SeanNaren ! |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Hi any update here? I also meet the similar question and only save rank0 for both optim states and model states. |
hi, have you solved the problem? I meet the same problem. |
I encountered this issue too. Any updates? |
馃悰 Bug
load_from_checkpoint()
doesn't work under multi node traininglooks like
zero_pp_rank_1_mp_rank_00_optim_states.pt
is stored on node 1 buttrainer.is_global_zero
doesn't have access to.Here are the full logs
node_0_log.txt
node_1_log.txt
Please feel free to contact me if you need temp access to Azure computes and I will work something out.
Thanks!
To Reproduce
Expected behavior
load_from_checkpoint()
should be able to load deepspeed ckpt.Environment
Additional context
cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @SeanNaren @akihironitta
The text was updated successfully, but these errors were encountered: