Resume training with resetting / increasing max number of epochs #2823
Labels
feature
Is an improvement or enhancement
help wanted
Open to be worked on
won't fix
This will not be worked on
Hi! I would like to know how can one continue training from existing checkpoint if after resuming you got saved learning rate, current epoch and other significant info which interrupts training immediately.
Let's say I train classifier using ReduceLROnPlateau and saving best epoch via ModelCheckpoint callback. I set
max_epochs
as 10, train for 5 epochs, at 9 epoch lr scheduler got activated and metric improves. So i have learning rate reduced at 10th epoch and best checkpoint also leads to 10th epoch.Then I resume training from this checkpoint. I also have set
max_epochs
to 10 and start from another learning rate. But all that I got is my current epoch set to 10, learning rate changes to which one the saving with the checkpoint callback was performed and training stops because 10 is the last epoch. How can we improve such situations?This would be also very useful when training using stages. You might have first stage for pretraining for 100 epochs and you would like to train for another 50 epochs at another dataset etc, but you might get checkpoint at let's say epoch 77 and you will not be able to train second stage because
max_epochs
would be set to 50.The text was updated successfully, but these errors were encountered: