Skip to content

The saved epoch number seems to be wrong? #296

@btyu

Description

@btyu

The saved epoch number seems to be wrong. I don't know whether it is my fault.
Specifically, I first train my model for 2 epochs, with the following code:

exp = Experiment(save_dir='.')
trainer = Trainer(experiment=exp, max_nb_epochs=2, gpus=[0], checkpoint_callback=checkpoint_callback)
trainer.fit(model)

During the first epoch, epoch=0. After the training of the first epoch, it shows:

Epoch 00001: avg_val_loss improved from inf to 1.42368, saving model to checkpoints//_ckpt_epoch_1.ckpt

During the second epoch, epoch=1. After the training of the second epoch, it shows:

Epoch 00002: avg_val_loss improved from 1.42368 to 1.23873, saving model to checkpoints//_ckpt_epoch_2.ckpt

At this moment, I save exp with the code:

exp.save()

and it gives:

100%|████| 15000/15000 [04:31<00:00, 454.06it/s, avg_val_loss=1.24, batch_nb=12499, epoch=1, gpu=0, loss=1.283, v_nb=0]

And then, I want to continue my training with the following code:

new_exp = Experiment(save_dir='.', version=0)
new_trainer = Trainer(experiment=new_exp, max_nb_epochs=3, gpus=[0], checkpoint_callback=checkpoint_callback)
new_model = Net()
new_trainer.fit(new_model)

It starts with epoch=1 instead of epoch=2. Therefore, to reach new_trainer's max_nb_epochs=3, another 2 epochs will be implemented.

Obviously, the epoch number in the saved exp is wrong. After the first two epochs, the saved epoch number should be 2. But it saved epoch=1, which causes the continuing training starts from epoch=1.

It really confused me. Looking forward to your help. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions