Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelCheckpoint does not save checkpoint on training end #8126

Closed
GuillaumeTong opened this issue Jun 25, 2021 · 8 comments
Closed

ModelCheckpoint does not save checkpoint on training end #8126

GuillaumeTong opened this issue Jun 25, 2021 · 8 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on priority: 2 Low priority task waiting on author Waiting on user action, correction, or update
Milestone

Comments

@GuillaumeTong
Copy link

🚀 Feature

See title

Motivation

When finishing training, either through keyboard interrupt, unexpected error, reaching the end of the intended training period, or any other means, it is very desirable to keep a checkpoint of the most recent state of our training.

Pitch

Imagine you need to interrupt the current training, but the last checkpoint was made hours ago, and you cannot wait for the next checkpoint to be saved in 3000 more steps. You need the system to drop a checkpoint for you when you stop the training.

Alternatives

Users could extend their own ModelCheckpoint to have an on_fit_end hook

@GuillaumeTong GuillaumeTong added feature Is an improvement or enhancement help wanted Open to be worked on labels Jun 25, 2021
@tchaton
Copy link
Contributor

tchaton commented Jun 28, 2021

Dear @GuillaumeTong,

Thanks for raising this issue. This is currently being baked. We are working on Fault Tolerant Training and we will enable this pretty soon.
As there are many challenges, we are making sure it is done properly.

Best,
T.C

@tchaton tchaton mentioned this issue Jun 28, 2021
11 tasks
@ananthsub
Copy link
Contributor

@GuillaumeTong - I believe this issue is highlighting 2 different requests:

  1. Saving the checkpoint at the end of training (e.g. on_fit_end or on_train_end) - This I believe we should do in the ModelCheckpoint callback if save_last=True. [blocked by #6997]Consolidate Training End Model Checkpoint #6671 started this but we held off for progress tracking items (@carmocca @awaelchli should we revisit this now?)

  2. Saving the checkpoint in case of errors: This is much trickier, and was actually in Lightning before. However, saving checkpoints can involve collective operations. If one rank receives an exception, this error handling path invoking checkpoint saving will lead to hangs and timeouts as not all processes participate in the collective call. In turn, this obscures the original error/stacktrace, which makes debugging much more difficult.

@carmocca
Copy link
Contributor

started this but we held off for progress tracking items (@carmocca @awaelchli should we revisit this now?)

Not yet but soon as progress tracking is not yet integrated.

this error handling path invoking checkpoint saving will lead to hangs and timeouts as not all processes participate in the collective call

@tchaton added #8167 for this

@stale
Copy link

stale bot commented Jul 30, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jul 30, 2021
@carmocca carmocca added this to the v1.5 milestone Jul 30, 2021
@stale stale bot removed the won't fix This will not be worked on label Jul 30, 2021
@carmocca
Copy link
Contributor

Status update: we save a dedicated checkpoint on exception to resume fault tolerance. However, this is still experimental and the saving logic and naming are bound to change.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L1078

https://github.com/PyTorchLightning/pytorch-lightning/blob/529c42f848055c80ea429d0dda6012bb5304a365/pytorch_lightning/trainer/trainer.py#L1325-L1330

@tchaton
Copy link
Contributor

tchaton commented Oct 6, 2021

Dear @GuillaumeTong,

Have you tried the fault tolerant training feature ?

Best,
T.C

@tchaton tchaton added waiting on author Waiting on user action, correction, or update priority: 2 Low priority task labels Oct 6, 2021
@GuillaumeTong
Copy link
Author

Hi @tchaton ,
Sorry, I have not tried the fault tolerant feature.
My main issue is with the "Saving the checkpoint at the end of training (e.g. on_fit_end or on_train_end)" part, as @ananthsub puts it.
I ended up writing my own extension of ModelCheckpoint as below (handling training end), along with a try/catch block around my training script (handling errors), and it mostly serves me better, I think, than what the fault tolerant feature alone offers (handling errors), as far as I understand.

class CustomModelCheckpoint(ModelCheckpoint):
    def on_train_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.save_checkpoint(trainer)

    def on_keyboard_interrupt(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.save_checkpoint(trainer)

@tchaton
Copy link
Contributor

tchaton commented Oct 6, 2021

Dear @GuillaumeTong,

I believe your approach is best for your specific case. I will be closing this issue.

Best,
T.C

@tchaton tchaton closed this as completed Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on priority: 2 Low priority task waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants