Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume training with resetting / increasing max number of epochs #2823

Closed
thepowerfuldeez opened this issue Aug 4, 2020 · 3 comments
Closed
Labels
feature Is an improvement or enhancement help wanted Open to be worked on won't fix This will not be worked on

Comments

@thepowerfuldeez
Copy link

Hi! I would like to know how can one continue training from existing checkpoint if after resuming you got saved learning rate, current epoch and other significant info which interrupts training immediately.
Let's say I train classifier using ReduceLROnPlateau and saving best epoch via ModelCheckpoint callback. I set max_epochs as 10, train for 5 epochs, at 9 epoch lr scheduler got activated and metric improves. So i have learning rate reduced at 10th epoch and best checkpoint also leads to 10th epoch.

Then I resume training from this checkpoint. I also have set max_epochs to 10 and start from another learning rate. But all that I got is my current epoch set to 10, learning rate changes to which one the saving with the checkpoint callback was performed and training stops because 10 is the last epoch. How can we improve such situations?

This would be also very useful when training using stages. You might have first stage for pretraining for 100 epochs and you would like to train for another 50 epochs at another dataset etc, but you might get checkpoint at let's say epoch 77 and you will not be able to train second stage because max_epochs would be set to 50.

@thepowerfuldeez thepowerfuldeez added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 4, 2020
@Borda
Copy link
Member

Borda commented Aug 4, 2020

we have a discussion about it in #2146
cc: @williamFalcon

@stale
Copy link

stale bot commented Oct 22, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 22, 2020
@stale stale bot closed this as completed Oct 29, 2020
@hwidong-na
Copy link

I want to iterate small number of epochs for each outer loop iteration. Here is a workaround by calling reset_on_epoch().

    ...
    trainer = pl.Trainer(
        ...
        max_epochs=num_inner_epochs,
        ...
    )
    for epoch in range(num_outer_epochs):
        trainer.fit_loop.epoch_progress.reset_on_epoch()
        ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants