-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
The saved epoch number seems to be wrong. I don't know whether it is my fault.
Specifically, I first train my model for 2 epochs, with the following code:
exp = Experiment(save_dir='.')
trainer = Trainer(experiment=exp, max_nb_epochs=2, gpus=[0], checkpoint_callback=checkpoint_callback)
trainer.fit(model)During the first epoch, epoch=0. After the training of the first epoch, it shows:
Epoch 00001: avg_val_loss improved from inf to 1.42368, saving model to checkpoints//_ckpt_epoch_1.ckpt
During the second epoch, epoch=1. After the training of the second epoch, it shows:
Epoch 00002: avg_val_loss improved from 1.42368 to 1.23873, saving model to checkpoints//_ckpt_epoch_2.ckpt
At this moment, I save exp with the code:
exp.save()and it gives:
100%|████| 15000/15000 [04:31<00:00, 454.06it/s, avg_val_loss=1.24, batch_nb=12499, epoch=1, gpu=0, loss=1.283, v_nb=0]
And then, I want to continue my training with the following code:
new_exp = Experiment(save_dir='.', version=0)
new_trainer = Trainer(experiment=new_exp, max_nb_epochs=3, gpus=[0], checkpoint_callback=checkpoint_callback)
new_model = Net()
new_trainer.fit(new_model)It starts with epoch=1 instead of epoch=2. Therefore, to reach new_trainer's max_nb_epochs=3, another 2 epochs will be implemented.
Obviously, the epoch number in the saved exp is wrong. After the first two epochs, the saved epoch number should be 2. But it saved epoch=1, which causes the continuing training starts from epoch=1.
It really confused me. Looking forward to your help. Thanks.