-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelCheckpoint does not save checkpoint on training end #8126
Comments
Dear @GuillaumeTong, Thanks for raising this issue. This is currently being baked. We are working on Fault Tolerant Training and we will enable this pretty soon. Best, |
@GuillaumeTong - I believe this issue is highlighting 2 different requests:
|
Not yet but soon as progress tracking is not yet integrated.
|
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Status update: we save a dedicated checkpoint on exception to resume fault tolerance. However, this is still experimental and the saving logic and naming are bound to change. |
Dear @GuillaumeTong, Have you tried the fault tolerant training feature ? Best, |
Hi @tchaton ,
|
Dear @GuillaumeTong, I believe your approach is best for your specific case. I will be closing this issue. Best, |
🚀 Feature
See title
Motivation
When finishing training, either through keyboard interrupt, unexpected error, reaching the end of the intended training period, or any other means, it is very desirable to keep a checkpoint of the most recent state of our training.
Pitch
Imagine you need to interrupt the current training, but the last checkpoint was made hours ago, and you cannot wait for the next checkpoint to be saved in 3000 more steps. You need the system to drop a checkpoint for you when you stop the training.
Alternatives
Users could extend their own ModelCheckpoint to have an on_fit_end hook
The text was updated successfully, but these errors were encountered: