[bug-fix] Trainer.test points to latest best_model_path #5161

tchaton · 2020-12-16T19:38:35Z

What does this PR do?

The test for PipeRCP was failing. This PR also resolves it.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

Borda · 2020-12-21T07:52:36Z

tests/special_tests.sh

@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Running special tests
+set -e


what does this do?

I noticed special_tests.sh doesn't return an error when a test fails. I found set -e might.

pytorch_lightning/callbacks/model_checkpoint.py

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

tests/checkpointing/test_trainer_checkpoint.py

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

rohitgr7 · 2020-12-23T18:50:27Z

pytorch_lightning/callbacks/model_checkpoint.py

        }

    def on_load_checkpoint(self, checkpointed_state: Dict[str, Any]):
        self.best_model_score = checkpointed_state["best_model_score"]
        self.best_model_path = checkpointed_state["best_model_path"]
+        self.dirpath = checkpointed_state.get("dirpath", self.dirpath)


what if dirpath is changed when Trainer is reinitialized with a new checkpoint callback??

also, I don't see it being used anywhere.

Great question ! I had a chat with Adrian. It is more a design choice. Let's say we want to fine-tune a model from a given checkpoint. It would make sense for the new checkpoint to be saved in the same folder. Happy to brainstorm on this one.

I think there are more things we need to restore like save_top_k, best_k_models, etc. So I'd suggest to keep it for another PR. Also need to take care of the conditions when the new checkpoint instance have different dirpath when doing finetuning. There is an open issue for the same.

rohitgr7 · 2020-12-23T19:39:17Z

pytorch_lightning/trainer/connectors/callback_connector.py

+    def resolve_resume_from_checkpoint(self):
+        if not self._trainer_has_checkpoint_callbacks():
+            return self.trainer.resume_from_checkpoint
+        checkpoint_callbacks = self.trainer.checkpoint_callbacks[0]
+        if os.path.exists(checkpoint_callbacks.best_model_path):
+            resume_from_checkpoint_options = [
+                checkpoint_callbacks.best_model_path,
+                self.trainer.resume_from_checkpoint
+            ]
+            resume_from_checkpoint_options.sort()
+            return resume_from_checkpoint_options[-1]
+        return self.trainer.resume_from_checkpoint


I don't think this is correct. Here if I set test(ckpt_path=some_checkpoint_path), it will possibly reload the best_model_path again during restore. Why does it even do reload from the resume_checkpoint while doing testing?

I think to better resolve this issue this should be fixed/refactored correctly:
https://github.com/PyTorchLightning/pytorch-lightning/blob/176735097ab5be9ee21d3e7a3dedc174f3e0dd3f/pytorch_lightning/accelerators/gpu_accelerator.py#L61-L69
since while training/testing setup_training is called, which contains few things that are required during both training & testing and others required only during training(for eg restoring checkpoint).

Hey @rohitgr7.

Here was the problem I was trying to resolve.

Load a checkpoint from resume_from_checkpoint .

Fine-tune the model for several epochs which might create a new best_model_path checkpoint

When calling trainer.test(), it should use the new best_model_path and not resume_from_checkpoint.

If I understand properly, you are saying we should skip this restore from resume_from_checkpoint in .test and load directly from best_model_path ?

Best,
T.C

yes, best checkpoint or any other checkpoint, it will be loaded here https://github.com/PyTorchLightning/pytorch-lightning/blob/d1e97a4f114a285349e31e330c7bf8937bc1ee04/pytorch_lightning/trainer/trainer.py#L770-L785, so we can just skip the .restore call. Also a few more hooks are called while doing .test like on_pretrain_routine_start and on_pretrain_routine_end since it calls setup_training when doing either train/test, which is incorrect too IMO. So I'd suggest to split/refactor setup_training a bit.

codecov · 2020-12-23T19:51:52Z

Codecov Report

Merging #5161 (16ccc66) into master (062800a) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #5161   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         134     134           
  Lines        9970    9976    +6     
======================================
+ Hits         9286    9294    +8     
+ Misses        684     682    -2

tests/checkpointing/test_trainer_checkpoint.py

pytorch_lightning/trainer/connectors/callback_connector.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

justusschock

One general question: If I pretrain on one PC, does this mean, I cannot finetune on another if the filesustem isn't the same?

pep8speaks · 2021-01-04T17:18:54Z

Hello @tchaton! Thanks for updating this PR.

In the file pytorch_lightning/plugins/rpc_plugin.py:

Line 26:121: E501 line too long (132 > 120 characters)

Comment last updated at 2021-01-05 08:52:09 UTC

pytorch_lightning/trainer/training_loop.py

tests/checkpointing/test_trainer_checkpoint.py

…hub.com/PyTorchLightning/pytorch-lightning into bugfix/5091_resume_from_checkpoint_test

tests/plugins/test_ddp_sequential_plugin.py

Borda

missing chnagelog

pytorch_lightning/plugins/rpc_plugin.py

tests/plugins/test_ddp_sequential_plugin.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

tchaton · 2021-01-04T20:41:00Z

missing chnagelog

Done !

* resolve bug * update code * add set -e * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * update test * Update tests/checkpointing/test_trainer_checkpoint.py Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> * Update tests/checkpointing/test_trainer_checkpoint.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * update on comments * resolve test * convert to set * update * add error triggering * update * update on comments * update * resolve import * update * update * Update pytorch_lightning/plugins/rpc_plugin.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * update Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-62-109.ec2.internal> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> (cherry picked from commit d5b3678)

ananthsub · 2021-01-12T22:41:17Z

@tchaton by not restoring when testing, this use case breaks:

model = BoringModel()
callback_that_implements_on_load_checkpoint = MyCallback()
trainer = Trainer(
    default_root_dir=root_dir,
    max_steps=1,
    callbacks=[callback_that_implements_on_load_checkpoint],
    resume_from_checkpoint=some_dummy_path,
)
trainer.test(model)

this is because we skip calling on_load_checkpoint() for all callbacks here: https://github.com/PyTorchLightning/pytorch-lightning/blob/1f6236accce78303249c55de656b71501e607d1a/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L149-L150

A concrete use case to motivate this is we developed an Exponential Moving Average Callback. We load the EMA state in a callback with the on_load_checkpoint hook to set the state. Then we can use the model with EMA weights for testing.

I think we should be able to restore other states when resuming for testing. WDYT?

rohitgr7 · 2021-01-13T19:12:53Z

seems reasonable to load the trainer & callback states at least 👍, or should we have trainer.test(..., restore_states=True/False)??

resolve bug

02f5f85

tchaton requested review from awaelchli, Borda, justusschock, SeanNaren and williamFalcon as code owners December 16, 2020 19:38

tchaton self-assigned this Dec 16, 2020

tchaton added this to the 1.1.x milestone Dec 16, 2020

tchaton added the checkpointing Related to checkpointing label Dec 16, 2020

tchaton added 4 commits December 20, 2020 19:53

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

069dcd1

update code

b60a0ad

add set -e

c42149e

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

96e167e

Borda reviewed Dec 21, 2020

View reviewed changes

awaelchli reviewed Dec 21, 2020

View reviewed changes

pytorch_lightning/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

tchaton and others added 4 commits December 21, 2020 11:54

Update pytorch_lightning/callbacks/model_checkpoint.py

9e433aa

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

725d38d

update test

17cb6a1

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

4106da2

SeanNaren approved these changes Dec 23, 2020

View reviewed changes

tests/checkpointing/test_trainer_checkpoint.py Outdated Show resolved Hide resolved

tchaton and others added 2 commits December 23, 2020 12:42

Update tests/checkpointing/test_trainer_checkpoint.py

014f79c

Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

f266c6d

SkafteNicki approved these changes Dec 23, 2020

View reviewed changes

rohitgr7 reviewed Dec 23, 2020

View reviewed changes

carmocca reviewed Dec 24, 2020

View reviewed changes

tchaton and others added 4 commits December 28, 2020 12:02

Update tests/checkpointing/test_trainer_checkpoint.py

f09af65

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

update on comments

a2a9fa0

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

01aa10c

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

4914735

justusschock approved these changes Jan 4, 2021

View reviewed changes

update

53455af

rohitgr7 reviewed Jan 4, 2021

View reviewed changes

tchaton added 7 commits January 4, 2021 18:46

update on comments

fa8d952

Merge branch 'bugfix/5091_resume_from_checkpoint_test' of https://git…

0cf5e5c

…hub.com/PyTorchLightning/pytorch-lightning into bugfix/5091_resume_from_checkpoint_test

update

ef75de5

resolve import

d85662d

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

a09ec3d

update

6c4948c

Merge branch 'bugfix/5091_resume_from_checkpoint_test' of https://git…

a893c7f

…hub.com/PyTorchLightning/pytorch-lightning into bugfix/5091_resume_from_checkpoint_test

rohitgr7 approved these changes Jan 4, 2021

View reviewed changes

tests/plugins/test_ddp_sequential_plugin.py Outdated Show resolved Hide resolved

Borda approved these changes Jan 4, 2021

View reviewed changes

pytorch_lightning/plugins/rpc_plugin.py Outdated Show resolved Hide resolved

tests/plugins/test_ddp_sequential_plugin.py Outdated Show resolved Hide resolved

tchaton and others added 4 commits January 4, 2021 20:32

update

3197762

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

e44a328

Update pytorch_lightning/plugins/rpc_plugin.py

b8f64bf

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

update

34986bb

tchaton enabled auto-merge (squash) January 4, 2021 20:45

carmocca approved these changes Jan 4, 2021

View reviewed changes

tchaton added 2 commits January 4, 2021 22:38

add _module_available

9252a06

Merge branch 'master' into bugfix/5091_resume_from_checkpoint_test

16ccc66

tchaton merged commit d5b3678 into master Jan 5, 2021

tchaton deleted the bugfix/5091_resume_from_checkpoint_test branch January 5, 2021 10:02

carmocca mentioned this pull request Jan 5, 2021

Prepare 1.1.3 release #5365

Merged

carmocca mentioned this pull request Jan 7, 2021

Refactor setup_training and remove test_mode #5388

Merged

12 tasks

rohitgr7 mentioned this pull request Jan 16, 2021

Load callback states while testing. #5542

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug-fix] Trainer.test points to latest best_model_path #5161

[bug-fix] Trainer.test points to latest best_model_path #5161

tchaton commented Dec 16, 2020 •

edited

Borda Dec 21, 2020

tchaton Dec 21, 2020

rohitgr7 Dec 23, 2020

rohitgr7 Dec 23, 2020

tchaton Dec 28, 2020

rohitgr7 Dec 28, 2020 •

edited

rohitgr7 Dec 23, 2020 •

edited

tchaton Dec 28, 2020

rohitgr7 Dec 28, 2020

codecov bot commented Dec 23, 2020 •

edited

justusschock left a comment

pep8speaks commented Jan 4, 2021 •

edited

Borda left a comment

tchaton commented Jan 4, 2021

ananthsub commented Jan 12, 2021 •

edited

rohitgr7 commented Jan 13, 2021

[bug-fix] Trainer.test points to latest best_model_path #5161

[bug-fix] Trainer.test points to latest best_model_path #5161

Conversation

tchaton commented Dec 16, 2020 • edited

What does this PR do?

Before submitting

PR review

Did you have fun?

Borda Dec 21, 2020

Choose a reason for hiding this comment

tchaton Dec 21, 2020

Choose a reason for hiding this comment

rohitgr7 Dec 23, 2020

Choose a reason for hiding this comment

rohitgr7 Dec 23, 2020

Choose a reason for hiding this comment

tchaton Dec 28, 2020

Choose a reason for hiding this comment

rohitgr7 Dec 28, 2020 • edited

Choose a reason for hiding this comment

rohitgr7 Dec 23, 2020 • edited

Choose a reason for hiding this comment

tchaton Dec 28, 2020

Choose a reason for hiding this comment

rohitgr7 Dec 28, 2020

Choose a reason for hiding this comment

codecov bot commented Dec 23, 2020 • edited

Codecov Report

justusschock left a comment

Choose a reason for hiding this comment

pep8speaks commented Jan 4, 2021 • edited

Comment last updated at 2021-01-05 08:52:09 UTC

Borda left a comment

Choose a reason for hiding this comment

tchaton commented Jan 4, 2021

ananthsub commented Jan 12, 2021 • edited

rohitgr7 commented Jan 13, 2021

tchaton commented Dec 16, 2020 •

edited

rohitgr7 Dec 28, 2020 •

edited

rohitgr7 Dec 23, 2020 •

edited

codecov bot commented Dec 23, 2020 •

edited

pep8speaks commented Jan 4, 2021 •

edited

ananthsub commented Jan 12, 2021 •

edited