ModelCheckpoint does not save checkpoint on training end #8126

GuillaumeTong · 2021-06-25T10:43:01Z

🚀 Feature

See title

Motivation

When finishing training, either through keyboard interrupt, unexpected error, reaching the end of the intended training period, or any other means, it is very desirable to keep a checkpoint of the most recent state of our training.

Pitch

Imagine you need to interrupt the current training, but the last checkpoint was made hours ago, and you cannot wait for the next checkpoint to be saved in 3000 more steps. You need the system to drop a checkpoint for you when you stop the training.

Alternatives

Users could extend their own ModelCheckpoint to have an on_fit_end hook

tchaton · 2021-06-28T12:25:41Z

Dear @GuillaumeTong,

Thanks for raising this issue. This is currently being baked. We are working on Fault Tolerant Training and we will enable this pretty soon.
As there are many challenges, we are making sure it is done properly.

Best,
T.C

ananthsub · 2021-06-30T04:36:51Z

@GuillaumeTong - I believe this issue is highlighting 2 different requests:

Saving the checkpoint at the end of training (e.g. on_fit_end or on_train_end) - This I believe we should do in the ModelCheckpoint callback if save_last=True. [blocked by #6997]Consolidate Training End Model Checkpoint #6671 started this but we held off for progress tracking items (@carmocca @awaelchli should we revisit this now?)
Saving the checkpoint in case of errors: This is much trickier, and was actually in Lightning before. However, saving checkpoints can involve collective operations. If one rank receives an exception, this error handling path invoking checkpoint saving will lead to hangs and timeouts as not all processes participate in the collective call. In turn, this obscures the original error/stacktrace, which makes debugging much more difficult.

carmocca · 2021-06-30T10:15:39Z

started this but we held off for progress tracking items (@carmocca @awaelchli should we revisit this now?)

Not yet but soon as progress tracking is not yet integrated.

this error handling path invoking checkpoint saving will lead to hangs and timeouts as not all processes participate in the collective call

@tchaton added #8167 for this

stale · 2021-07-30T10:29:39Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca · 2021-07-30T13:32:26Z

Status update: we save a dedicated checkpoint on exception to resume fault tolerance. However, this is still experimental and the saving logic and naming are bound to change.

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L1078

https://github.com/PyTorchLightning/pytorch-lightning/blob/529c42f848055c80ea429d0dda6012bb5304a365/pytorch_lightning/trainer/trainer.py#L1325-L1330

tchaton · 2021-10-06T08:32:13Z

Dear @GuillaumeTong,

Have you tried the fault tolerant training feature ?

Best,
T.C

GuillaumeTong · 2021-10-06T10:26:07Z

Hi @tchaton ,
Sorry, I have not tried the fault tolerant feature.
My main issue is with the "Saving the checkpoint at the end of training (e.g. on_fit_end or on_train_end)" part, as @ananthsub puts it.
I ended up writing my own extension of ModelCheckpoint as below (handling training end), along with a try/catch block around my training script (handling errors), and it mostly serves me better, I think, than what the fault tolerant feature alone offers (handling errors), as far as I understand.

class CustomModelCheckpoint(ModelCheckpoint):
    def on_train_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.save_checkpoint(trainer)

    def on_keyboard_interrupt(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
        self.save_checkpoint(trainer)

tchaton · 2021-10-06T11:03:51Z

Dear @GuillaumeTong,

I believe your approach is best for your specific case. I will be closing this issue.

Best,
T.C

GuillaumeTong added feature Is an improvement or enhancement help wanted Open to be worked on labels Jun 25, 2021

tchaton mentioned this issue Jun 28, 2021

[Feat] Add Loops Restart #8131

Closed

11 tasks

stale bot added the won't fix This will not be worked on label Jul 30, 2021

carmocca added this to the v1.5 milestone Jul 30, 2021

stale bot removed the won't fix This will not be worked on label Jul 30, 2021

tchaton added waiting on author Waiting on user action, correction, or update priority: 2 Low priority task labels Oct 6, 2021

tchaton closed this as completed Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelCheckpoint does not save checkpoint on training end #8126

ModelCheckpoint does not save checkpoint on training end #8126

GuillaumeTong commented Jun 25, 2021

tchaton commented Jun 28, 2021 •

edited

Loading

ananthsub commented Jun 30, 2021

carmocca commented Jun 30, 2021

stale bot commented Jul 30, 2021

carmocca commented Jul 30, 2021

tchaton commented Oct 6, 2021

GuillaumeTong commented Oct 6, 2021

tchaton commented Oct 6, 2021

ModelCheckpoint does not save checkpoint on training end #8126

ModelCheckpoint does not save checkpoint on training end #8126

Comments

GuillaumeTong commented Jun 25, 2021

🚀 Feature

Motivation

Pitch

Alternatives

tchaton commented Jun 28, 2021 • edited Loading

ananthsub commented Jun 30, 2021

carmocca commented Jun 30, 2021

stale bot commented Jul 30, 2021

carmocca commented Jul 30, 2021

tchaton commented Oct 6, 2021

GuillaumeTong commented Oct 6, 2021

tchaton commented Oct 6, 2021

tchaton commented Jun 28, 2021 •

edited

Loading