-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot save and load checkpoints with FSDPStrategy #17702
Comments
I think saving full state dict only on rank 0 is the proper behavior (though fabric provides an option to switch between full state and shared state). This issue is also mentioned in #16815 , however, I can't reproduce it now.... |
The problem I encounter is that only the first shard is saved which is useless. Is the issue solved on your side ? (Is that what you mean by "can't reproduce it") |
I tested your code, and it can't reproduce the error. It seems like the weights are stored properly... |
Which version of lightning are you using ? |
master, tried on both 2x3090 and 8xv100 |
Considering the implementation of lightning_module_state_dict() in FSDPStrategy it was normal behavior, changing rank0_only to False made it work as I wanted it to.
|
Why would you like to move the full state dict to all ranks? It's not a shared state dict. |
In this particular case yes. But with my config (I tried several releases + master) by using the default FSDP implementation it only saved N/k parameters (N number of parameters of the model and k number of shards), so only shard 0 which is not useable. You get several checkpoints, one for each shard, with my first code? |
So, in your case, it's not recommended to use |
Bug description
Hello,
I must be missing something but when using FSDPStrategy with 2 gpus and the following code I encounter several problems:
Is it possible to save a full state dict with FSDP (and be able to load the model afterwards on a different number of gpus) ?
Thank you for your help
What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
cc @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: