split `restore_training_state` into logical parts [1 / 2] #7901

awaelchli · 2021-06-09T13:34:51Z

What does this PR do?

In #7900 CheckpointConnector.restore_training_state gets split into multiple pieces:

restore_callbacks()
restore_progress()
restore_optimizers()
restore_lr_schedulers()

This PR introduces the new functions, but they are unused.
#7900 will then refactor the CheckpointConnector.restore_training_state method to use them.

This is mainly to reduce a hard to read diff for reviews!

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-06-09T13:36:09Z

Codecov Report

Merging #7901 (ee0867a) into master (ec4f885) will increase coverage by 4%.
The diff coverage is 12%.

@@           Coverage Diff           @@
##           master   #7901    +/-   ##
=======================================
+ Coverage      88%     92%    +4%     
=======================================
  Files         204     200     -4     
  Lines       13667   12837   -830     
=======================================
- Hits        12047   11819   -228     
+ Misses       1620    1018   -602

ananthsub · 2021-06-09T20:55:52Z

pytorch_lightning/trainer/connectors/checkpoint_connector.py

+        """ Restores all callbacks from the pre-loaded checkpoint. """
+        if not self._loaded_checkpoint:
+            return


A few questions:

is there a risk of these restoration functions being called outside this context? should the start and end restore from checkpoint be replaced with a dedicated context manager?

in splitting these out, should we be prescriptive about the order they are loaded?

Hey, good question.
If you look at the "end result" #7652 (open to discussion) you will see here in the Trainer file resume_start() and resume_end() are actually in very different places, so I can't make it into a context manager.

Yes, I think it's best to document the order. The order may be important. In the future, we will want to enable configuration of what is restored, so some of these functions get called on demand and some won't be called.

actually, maybe a context manager could still work. I will investigate it in #7652

I think Ananth's suggestion is good.

Also, could any other class want to call the start and end methods?

No, I think we would only want to call it for unit testing, or the context manager if that works.
So I know what you are saying, yes I will put the underscores everywhere

okay, ctx manager could kinda work but I see an issue. can we move the conversation to #7652 so I can directly point to the code in trainer.py?

ananthsub · 2021-06-09T20:57:54Z

pytorch_lightning/trainer/connectors/checkpoint_connector.py

+        if any([key in self._loaded_checkpoint for key in DEPRECATED_CHECKPOINT_KEYS]):
+            raise ValueError(
+                "The checkpoint you're attempting to load follows an"
+                " outdated schema. You can upgrade to the current schema by running"
+                " `python -m pytorch_lightning.utilities.upgrade_checkpoint --file model.ckpt`"
+                " where `model.ckpt` is your checkpoint file."
+            )


should this validation be done in resume_start ?

good question. maybe we could.
one thought though, in the future we will have a way to configure what to load, so if these functions get called individually we may want to have the validation together with the particular objects that are being restored.

carmocca · 2021-06-09T22:07:17Z

pytorch_lightning/trainer/connectors/checkpoint_connector.py

+        """ Restores all callbacks from the pre-loaded checkpoint. """
+        if not self._loaded_checkpoint:
+            return


I think Ananth's suggestion is good.

Also, could any other class want to call the start and end methods?

pytorch_lightning/trainer/connectors/checkpoint_connector.py

carmocca · 2021-06-09T22:15:27Z

pytorch_lightning/trainer/connectors/checkpoint_connector.py

+        if "optimizer_states" not in self._loaded_checkpoint or "lr_schedulers" not in self._loaded_checkpoint:
+            raise KeyError(
+                "Trying to restore training state but checkpoint contains only the model."
+                " This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`."
+            )


Do this chain of PRs plan to tackle the issue of restoring part of the checkpoint?

#5339

not the aim directly , but it will definitely help, and we can directly continue with it after these PRs. There will be nothing standing in the way as far as I can tell :)

tchaton

Love it ! Really nice cleanup.

pytorch_lightning/trainer/connectors/checkpoint_connector.py

ethanwharris

LGTM, small comment

pytorch_lightning/trainer/connectors/checkpoint_connector.py

awaelchli added 2 commits June 9, 2021 15:23

split restore training_state

10259d7

reduce diff

37a1d51

awaelchli changed the title ~~split CheckpointConnector.restore_training_state into logical parts [1 / 2]~~ split CheckpointConnector.restore_training_state into logical parts [1 / 2] Jun 9, 2021

awaelchli changed the title ~~split CheckpointConnector.restore_training_state into logical parts [1 / 2]~~ split restore_training_state into logical parts [1 / 2] Jun 9, 2021

awaelchli added the refactor label Jun 9, 2021

awaelchli added this to the v1.4 milestone Jun 9, 2021

awaelchli added checkpointing Related to checkpointing feature Is an improvement or enhancement labels Jun 9, 2021

awaelchli mentioned this pull request Jun 9, 2021

split restore_training_state into logical parts [2 / 2] #7900

Merged

11 tasks

awaelchli marked this pull request as ready for review June 9, 2021 14:43

awaelchli requested review from Borda, carmocca, SeanNaren and tchaton as code owners June 9, 2021 14:43

awaelchli requested a review from ananthsub June 9, 2021 20:43

ananthsub reviewed Jun 9, 2021

View reviewed changes

carmocca approved these changes Jun 9, 2021

View reviewed changes

awaelchli added 2 commits June 10, 2021 00:25

update misconfig exception str

8a7e3d1

Merge branch 'master' into feature/resume-6-1

d93394e

tchaton approved these changes Jun 10, 2021

View reviewed changes

tchaton added the ready PRs ready to be merged label Jun 10, 2021

mergify bot requested a review from a team June 10, 2021 13:34

Borda approved these changes Jun 10, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/checkpoint_connector.py Outdated Show resolved Hide resolved

any

8d32cb5

ethanwharris approved these changes Jun 10, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/checkpoint_connector.py Outdated Show resolved Hide resolved

update the message

ee0867a

awaelchli enabled auto-merge (squash) June 10, 2021 14:53

awaelchli merged commit d209b68 into master Jun 10, 2021

awaelchli deleted the feature/resume-6-1 branch June 10, 2021 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split `restore_training_state` into logical parts [1 / 2] #7901

split `restore_training_state` into logical parts [1 / 2] #7901

awaelchli commented Jun 9, 2021 •

edited

Loading

codecov bot commented Jun 9, 2021 •

edited

Loading

ananthsub Jun 9, 2021 •

edited

Loading

awaelchli Jun 9, 2021 •

edited

Loading

awaelchli Jun 9, 2021

carmocca Jun 9, 2021

awaelchli Jun 9, 2021

awaelchli Jun 9, 2021

ananthsub Jun 9, 2021

awaelchli Jun 9, 2021 •

edited

Loading

carmocca Jun 9, 2021

carmocca Jun 9, 2021

awaelchli Jun 9, 2021 •

edited

Loading

tchaton left a comment

ethanwharris left a comment

split restore_training_state into logical parts [1 / 2] #7901

split restore_training_state into logical parts [1 / 2] #7901

Conversation

awaelchli commented Jun 9, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Jun 9, 2021 • edited Loading

Codecov Report

ananthsub Jun 9, 2021 • edited Loading

Choose a reason for hiding this comment

awaelchli Jun 9, 2021 • edited Loading

Choose a reason for hiding this comment

awaelchli Jun 9, 2021

Choose a reason for hiding this comment

carmocca Jun 9, 2021

Choose a reason for hiding this comment

awaelchli Jun 9, 2021

Choose a reason for hiding this comment

awaelchli Jun 9, 2021

Choose a reason for hiding this comment

ananthsub Jun 9, 2021

Choose a reason for hiding this comment

awaelchli Jun 9, 2021 • edited Loading

Choose a reason for hiding this comment

carmocca Jun 9, 2021

Choose a reason for hiding this comment

carmocca Jun 9, 2021

Choose a reason for hiding this comment

awaelchli Jun 9, 2021 • edited Loading

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

ethanwharris left a comment

Choose a reason for hiding this comment

split `restore_training_state` into logical parts [1 / 2] #7901

split `restore_training_state` into logical parts [1 / 2] #7901

awaelchli commented Jun 9, 2021 •

edited

Loading

codecov bot commented Jun 9, 2021 •

edited

Loading

ananthsub Jun 9, 2021 •

edited

Loading

awaelchli Jun 9, 2021 •

edited

Loading

awaelchli Jun 9, 2021 •

edited

Loading

awaelchli Jun 9, 2021 •

edited

Loading