Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split restore_training_state into logical parts [1 / 2] #7901

Merged
merged 6 commits into from
Jun 10, 2021

Conversation

awaelchli
Copy link
Member

@awaelchli awaelchli commented Jun 9, 2021

What does this PR do?

In #7900 CheckpointConnector.restore_training_state gets split into multiple pieces:

restore_callbacks()
restore_progress()
restore_optimizers()
restore_lr_schedulers()

This PR introduces the new functions, but they are unused.
#7900 will then refactor the CheckpointConnector.restore_training_state method to use them.

This is mainly to reduce a hard to read diff for reviews!

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@awaelchli awaelchli changed the title split CheckpointConnector.restore_training_state into logical parts [1 / 2] split CheckpointConnector.restore_training_state into logical parts [1 / 2] Jun 9, 2021
@awaelchli awaelchli changed the title split CheckpointConnector.restore_training_state into logical parts [1 / 2] split restore_training_state into logical parts [1 / 2] Jun 9, 2021
@codecov
Copy link

codecov bot commented Jun 9, 2021

Codecov Report

Merging #7901 (ee0867a) into master (ec4f885) will increase coverage by 4%.
The diff coverage is 12%.

@@           Coverage Diff           @@
##           master   #7901    +/-   ##
=======================================
+ Coverage      88%     92%    +4%     
=======================================
  Files         204     200     -4     
  Lines       13667   12837   -830     
=======================================
- Hits        12047   11819   -228     
+ Misses       1620    1018   -602     

@awaelchli awaelchli added this to the v1.4 milestone Jun 9, 2021
@awaelchli awaelchli added checkpointing Related to checkpointing feature Is an improvement or enhancement labels Jun 9, 2021
@awaelchli awaelchli marked this pull request as ready for review June 9, 2021 14:43
@awaelchli awaelchli requested a review from ananthsub June 9, 2021 20:43
Comment on lines +211 to +213
""" Restores all callbacks from the pre-loaded checkpoint. """
if not self._loaded_checkpoint:
return
Copy link
Contributor

@ananthsub ananthsub Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions:

  • is there a risk of these restoration functions being called outside this context? should the start and end restore from checkpoint be replaced with a dedicated context manager?
  • in splitting these out, should we be prescriptive about the order they are loaded?

Copy link
Member Author

@awaelchli awaelchli Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, good question.
If you look at the "end result" #7652 (open to discussion) you will see here in the Trainer file resume_start() and resume_end() are actually in very different places, so I can't make it into a context manager.

Yes, I think it's best to document the order. The order may be important. In the future, we will want to enable configuration of what is restored, so some of these functions get called on demand and some won't be called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, maybe a context manager could still work. I will investigate it in #7652

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Ananth's suggestion is good.

Also, could any other class want to call the start and end methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think we would only want to call it for unit testing, or the context manager if that works.
So I know what you are saying, yes I will put the underscores everywhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, ctx manager could kinda work but I see an issue. can we move the conversation to #7652 so I can directly point to the code in trainer.py?

Comment on lines 215 to 221
if any([key in self._loaded_checkpoint for key in DEPRECATED_CHECKPOINT_KEYS]):
raise ValueError(
"The checkpoint you're attempting to load follows an"
" outdated schema. You can upgrade to the current schema by running"
" `python -m pytorch_lightning.utilities.upgrade_checkpoint --file model.ckpt`"
" where `model.ckpt` is your checkpoint file."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this validation be done in resume_start ?

Copy link
Member Author

@awaelchli awaelchli Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. maybe we could.
one thought though, in the future we will have a way to configure what to load, so if these functions get called individually we may want to have the validation together with the particular objects that are being restored.

Comment on lines +211 to +213
""" Restores all callbacks from the pre-loaded checkpoint. """
if not self._loaded_checkpoint:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Ananth's suggestion is good.

Also, could any other class want to call the start and end methods?

Comment on lines +261 to +265
if "optimizer_states" not in self._loaded_checkpoint or "lr_schedulers" not in self._loaded_checkpoint:
raise KeyError(
"Trying to restore training state but checkpoint contains only the model."
" This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this chain of PRs plan to tackle the issue of restoring part of the checkpoint?

#5339

Copy link
Member Author

@awaelchli awaelchli Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the aim directly , but it will definitely help, and we can directly continue with it after these PRs. There will be nothing standing in the way as far as I can tell :)

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it ! Really nice cleanup.

@tchaton tchaton added the ready PRs ready to be merged label Jun 10, 2021
@mergify mergify bot requested a review from a team June 10, 2021 13:34
Copy link
Member

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, small comment

@awaelchli awaelchli enabled auto-merge (squash) June 10, 2021 14:53
@awaelchli awaelchli merged commit d209b68 into master Jun 10, 2021
@awaelchli awaelchli deleted the feature/resume-6-1 branch June 10, 2021 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants