-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume training #13
Comments
Please correct me as I could be wrong. Here they use the pytorch-accelerated framework, I do not see checkpointing activated by default - so I sadly doubt that it is not possible to recover it. You have to do it manually enable checkpointing beforehand with the trainer; see here https://pytorch-accelerated.readthedocs.io/en/latest/callbacks.html. |
From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the |
According to @varshanth , I tried to use load_checkpoint method before trainer.train to resume my training. Something like this:
But the error occur,
Is my implementation correct? |
The error you received basically says that the model that you instantiated and the model that you loaded through the checkpoint are not the same. There is a mismatch in the parameter shapes that the optimizer had for calculating the gradient with momentum and the parameters loaded through the checkpoint for that particular layer. Can you please double-check to see if the model you trained and the model you loaded are the same with no changes made between the save and the load? |
Hi @varshanth , I'm confirm that model trained and model loaded are the same model, with no changes made. However, when printing out the
Output:
Is this could be the issue? |
My training stop due to PC accidentally shut down. Is it possible to resume back the training? If yes, how I'm gonna do it?
The text was updated successfully, but these errors were encountered: