Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add version suffix to last.ckpt #5030

Closed
carmocca opened this issue Dec 8, 2020 · 11 comments 路 Fixed by #12902
Closed

Add version suffix to last.ckpt #5030

carmocca opened this issue Dec 8, 2020 · 11 comments 路 Fixed by #12902
Assignees
Labels
callback: model checkpoint feature Is an improvement or enhancement
Milestone

Comments

@carmocca
Copy link
Contributor

carmocca commented Dec 8, 2020

馃殌 Feature

If you use ModelCheckpoint(save_last=True) and you run an experiment twice in the same directory, then this set of checkpoints is generated:

file-v0.ckpt
file-v1.ckpt
...
last.ckpt (the last of the second run)

the idea is to add a version also to last.ckpt if it would get overwritten:

file-v0.ckpt
file-v1.ckpt
...
last-v0.ckpt (last of the first run)
last-v1.ckpt (last of the second run)

Motivation

Avoid overwriting existing checkpoints

Alternatives

Modifying the default CHECKPOINT_NAME_LAST to avoid the conflict

Additional context

Discussed in #5000 (comment)

To be done after #5008 is merged.

cc @Borda @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

@carmocca carmocca added feature Is an improvement or enhancement help wanted Open to be worked on labels Dec 8, 2020
@carmocca
Copy link
Contributor Author

carmocca commented Dec 9, 2020

@awaelchli answering #5000 (comment) here

How do you propose we fix this then?

@awaelchli
Copy link
Member

awaelchli commented Dec 9, 2020

@carmocca Before I make a suggestion could you help me understand the context a bit more. Is this issue a) about damage control for users that accidentally write to the same directory, or b) is this about a particular use case where a user is doing this on purpose?

I am asking this because of two reasons. 1) I notice in your description you didn't append epoch numbers to the filenames and the primary reason why the save_last option is exists is for sequential checkpoints numbered by some metric like epoch or step. 2) last.ckpt is always a copy of the last epoch, i.e. epoch_N.ckpt is identical to last.ckpt. Therefore it would be save to overwrite last.ckpt in a subsequent run.

Note that situation a) cannot be solved by appending version affixes to the files. Consider the case where user changes epoch size between runs so they would get a mix of files and you wouldn't be able to tell if epoch=x_step=y.ckpt or epoch=x_step=z.ckpt belonged to the first or second run. One would only be able to tell by inspecting the individual checkpoints for hyperparameters.

@awaelchli
Copy link
Member

One may consider this option:

file-v0.ckpt
file-v1.ckpt
...
last-v0.ckpt (last of the first run)
last-v1.ckpt (last of the second run)
last.ckpt (copy of last-v1)

@carmocca
Copy link
Contributor Author

carmocca commented Dec 9, 2020

It is about a)

I notice in your description you didn't append epoch numbers

Sorry, I didn't for brevity

  1. last.ckpt is always a copy of the last epoch, i.e. epoch_N.ckpt is identical to last.ckpt. Therefore it would be save to overwrite last.ckpt in a subsequent run.

That is not the case if you are monitoring something. For example, if you run an experiment for 10 epochs and your best performance is epoch_4.ckpt, last.ckpt corresponds to the tenth epoch. We discussed this in #4335 (comment)

Consider the case where user changes epoch size between runs so they would get a mix of files and you wouldn't be able to tell if epoch=x_step=y.ckpt or epoch=x_step=z.ckpt belonged to the first or second run. One would only be able to tell by inspecting the individual checkpoints for hyperparameters.

This is impossible already in master, version suffixes do not imply that all -v0 files necessarily correspond to the first run, and -v1 to the second run, etc. It is just a means of avoiding overwritting files.

@carmocca
Copy link
Contributor Author

carmocca commented Dec 9, 2020

One may consider this option:

file-v0.ckpt
file-v1.ckpt
...
last-v0.ckpt (last of the first run)
last-v1.ckpt (last of the second run)
last.ckpt (copy of last-v1)

if last.ckpt must stay, then I would keep it as is and make version suffixes start at 1.

However, this would be inconsistent with what was discussed in #5000.

So should we just start versions at 1 and forget about renaming files?

file.ckpt (instead of file-v0.ckpt)
file-v1.ckpt
...
last.ckpt (last of the first run)
last-v1.ckpt (last of the second run)

cc @rohitgr7 @Borda

@awaelchli
Copy link
Member

awaelchli commented Dec 9, 2020

For example, if you run an experiment for 10 epochs and your best performance is epoch_4.ckpt, last.ckpt corresponds to the tenth epoch.

This is correct behaviour. last.ckpt as the name suggests should point to the last saved checkpoint in a run, in this case it would be epoch 10 (epoch=9.ckpt). Again, last.ckpt is there for convenience if the user expects their run to be interrupted (for example manually) they can easily restore and continue training with last.ckpt without having to look up the exact filename. Having a file best.ckpt is a separate feature one could look into :)

@carmocca
Copy link
Contributor Author

carmocca commented Dec 9, 2020

Exactly, but you might not have saved on your tenth epoch because your model wasnt in the top-k

last.ckpt is always a copy of the last epoch, i.e. epoch_N.ckpt is identical to last.ckpt. Therefore it would be save to overwrite last.ckpt in a subsequent run.

So epoch_N.ckpt doesnt exist but last.ckpt does. Then it is not safe to overwrite last.ckpt

@awaelchli
Copy link
Member

These edge cases are terrible. Looks like the versioning is necessary then.

@rohitgr7
Copy link
Contributor

rohitgr7 commented Dec 9, 2020

No strong opinion here. I am fine with both #5030 (comment) renaming or not, 0 or 1 as the first prefix version.

@carmocca carmocca self-assigned this Dec 10, 2020
@carmocca carmocca added design Includes a design discussion checkpointing Related to checkpointing and removed help wanted Open to be worked on labels Dec 10, 2020
@Borda Borda added this to the 1.1.x milestone Dec 11, 2020
@Borda Borda modified the milestones: 1.1.x, 1.2 Dec 30, 2020
@edenlightning edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021
@edenlightning
Copy link
Contributor

@carmocca is this done?

@carmocca
Copy link
Contributor Author

carmocca commented Feb 16, 2021

@carmocca is this done?

No

@edenlightning edenlightning removed this from the v1.3 milestone Apr 27, 2021
@carmocca carmocca added this to the v1.5 milestone Jun 6, 2021
@carmocca carmocca removed the design Includes a design discussion label Jun 6, 2021
@awaelchli awaelchli modified the milestones: v1.5, v1.6 Nov 4, 2021
@carmocca carmocca added the help wanted Open to be worked on label Feb 3, 2022
@carmocca carmocca removed their assignment Feb 3, 2022
@carmocca carmocca modified the milestones: 1.6, future Feb 3, 2022
@carmocca carmocca added callback: model checkpoint and removed checkpointing Related to checkpointing labels Feb 3, 2022
@otaj otaj self-assigned this Apr 19, 2022
@carmocca carmocca modified the milestones: future, 1.7 Apr 27, 2022
@carmocca carmocca removed the help wanted Open to be worked on label Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback: model checkpoint feature Is an improvement or enhancement
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants