Model restore fails from stored checkpoint when using Deepspeed #7282
Labels
3rd party
Related to a 3rd-party
bug
Something isn't working
distributed
Generic distributed-related topic
help wanted
Open to be worked on
priority: 1
Medium priority task
Milestone
馃悰 Bug
Trying to restore a checkpoint to resume training but it fails with the below exceptions
To Reproduce
Run the following mode with commented out restore argument, then run it again with uncommenting the restore and you will see the exception.
Expected behavior
Training resume successfully from stored checkpoint.
Environment
Tried with lightning version:
1.2.10, 1.3.0.rc1 and master
pytorch:
1.7.1
OS:
Ubuntu 18.04
@SeanNaren As discussed on slack ^^
The text was updated successfully, but these errors were encountered: