Resume training with resetting / increasing max number of epochs #2823

thepowerfuldeez · 2020-08-04T15:55:34Z

Hi! I would like to know how can one continue training from existing checkpoint if after resuming you got saved learning rate, current epoch and other significant info which interrupts training immediately.
Let's say I train classifier using ReduceLROnPlateau and saving best epoch via ModelCheckpoint callback. I set max_epochs as 10, train for 5 epochs, at 9 epoch lr scheduler got activated and metric improves. So i have learning rate reduced at 10th epoch and best checkpoint also leads to 10th epoch.

Then I resume training from this checkpoint. I also have set max_epochs to 10 and start from another learning rate. But all that I got is my current epoch set to 10, learning rate changes to which one the saving with the checkpoint callback was performed and training stops because 10 is the last epoch. How can we improve such situations?

This would be also very useful when training using stages. You might have first stage for pretraining for 100 epochs and you would like to train for another 50 epochs at another dataset etc, but you might get checkpoint at let's say epoch 77 and you will not be able to train second stage because max_epochs would be set to 50.

The text was updated successfully, but these errors were encountered:

Borda · 2020-08-04T19:26:55Z

we have a discussion about it in #2146
cc: @williamFalcon

stale · 2020-10-22T04:24:20Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

hwidong-na · 2022-05-20T01:03:31Z

I want to iterate small number of epochs for each outer loop iteration. Here is a workaround by calling reset_on_epoch().

    ...
    trainer = pl.Trainer(
        ...
        max_epochs=num_inner_epochs,
        ...
    )
    for epoch in range(num_outer_epochs):
        trainer.fit_loop.epoch_progress.reset_on_epoch()
        ...

thepowerfuldeez added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 4, 2020

stale bot added the won't fix This will not be worked on label Oct 22, 2020

stale bot closed this as completed Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume training with resetting / increasing max number of epochs #2823

Resume training with resetting / increasing max number of epochs #2823

thepowerfuldeez commented Aug 4, 2020

Borda commented Aug 4, 2020 •

edited

Loading

stale bot commented Oct 22, 2020

hwidong-na commented May 20, 2022

Resume training with resetting / increasing max number of epochs #2823

Resume training with resetting / increasing max number of epochs #2823

Comments

thepowerfuldeez commented Aug 4, 2020

Borda commented Aug 4, 2020 • edited Loading

stale bot commented Oct 22, 2020

hwidong-na commented May 20, 2022

Borda commented Aug 4, 2020 •

edited

Loading