Fix checkpoint callback & Trainer.test(_) issue for TPUs #6654

kaushikb11 · 2021-03-23T14:03:19Z

What does this PR do?

Fix checkpoint callback issue for TPUs when set False & `trainer.test(..).
Fixes #6230

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

tests/models/test_tpu.py

CHANGELOG.md

ananthsub · 2021-03-24T08:20:46Z

pytorch_lightning/plugins/training_type/tpu_spawn.py

+        checkpoint_callback = self.lightning_module.trainer.checkpoint_callback
+        best_model_path = checkpoint_callback.best_model_path if checkpoint_callback else None


what happens if there are multiple checkpoint callbacks attached? should we save once per path?

@awaelchli @carmocca this is gonna be amplified if people are tracking multiple versions of "best model paths" at the same time in an example like this

trainer = Trainer(...., callbacks=[checkpoint1, checkpoint2]) trainer.fit(module) trainer.test() <--- what checkpoint path path is used for running this?

should this raise an error due to ambiguity?

I think I'd rather use the first best path and log the path used when running test

tchaton

LGTM !

tests/models/test_tpu.py

SeanNaren

Nice! on a side note how far are we from enabling TPU tests again?

kaushikb11 · 2021-03-24T19:29:23Z

@SeanNaren Thanks for bringing it up! We could keep a look on it for next couple of days, I have seen instances when the TPU Pods used to get killed mid testing. Afk, will create an Issue and track it.

Borda

have you tested it locally as TPU is put for now...?

pytorch_lightning/plugins/training_type/tpu_spawn.py

kaushikb11 · 2021-03-25T08:40:35Z

@Borda

All tests are passing. Had to do a tweek. Will investigate it soon.

codecov · 2021-03-25T08:48:18Z

Codecov Report

Merging #6654 (4c69b62) into master (d471fa3) will decrease coverage by 5%.
The diff coverage is 29%.

@@           Coverage Diff           @@
##           master   #6654    +/-   ##
=======================================
- Coverage      91%     86%    -5%     
=======================================
  Files         192     192            
  Lines       12227   12373   +146     
=======================================
- Hits        11111   10654   -457     
- Misses       1116    1719   +603

pytorch_lightning/plugins/training_type/tpu_spawn.py

tests/models/test_tpu.py

pytorch_lightning/trainer/trainer.py

…h-lightning into tpu/checkpoint

* Fix checkpoint callback issue for TPUs * update changelog * add barrier * apply code suggestions * update trainer test * remove spaces * fix tpu tests * Apply suggestions from code review * add comment Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f

Fix checkpoint callback issue for TPUs

3f6fe20

kaushikb11 requested review from awaelchli, Borda, carmocca, justusschock, SeanNaren, tchaton and williamFalcon as code owners March 23, 2021 14:03

kaushikb11 added this to the 1.2.x milestone Mar 23, 2021

update changelog

c2dc663

mergify bot added the has conflicts label Mar 23, 2021

carmocca reviewed Mar 23, 2021

View reviewed changes

tests/models/test_tpu.py Outdated Show resolved Hide resolved

tests/models/test_tpu.py Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

ananthsub reviewed Mar 24, 2021

View reviewed changes

tchaton approved these changes Mar 24, 2021

View reviewed changes

tests/models/test_tpu.py Show resolved Hide resolved

add barrier

6541db6

kaushikb11 changed the title ~~Fix checkpoint callback issue for TPUs~~ Fix checkpoint callback & Trainer.test(_) issue for TPUs Mar 24, 2021

Merge branch 'master' into tpu/checkpoint

9f6aa40

mergify bot removed the has conflicts label Mar 24, 2021

kaushikb11 added 2 commits March 24, 2021 23:57

apply code suggestions

d47fa9b

update trainer test

312b84e

SeanNaren approved these changes Mar 24, 2021

View reviewed changes

kaushikb11 added the bug Something isn't working label Mar 24, 2021

Borda reviewed Mar 24, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/tpu_spawn.py Outdated Show resolved Hide resolved

pytorch_lightning/plugins/training_type/tpu_spawn.py Outdated Show resolved Hide resolved

kaushikb11 added 2 commits March 25, 2021 10:52

remove spaces

6a4ee36

fix tpu tests

38dc8e2

Borda reviewed Mar 25, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/tpu_spawn.py Show resolved Hide resolved

Borda reviewed Mar 25, 2021

View reviewed changes

tests/models/test_tpu.py Outdated Show resolved Hide resolved

Apply suggestions from code review

e18dfe4

Borda approved these changes Mar 25, 2021

View reviewed changes

Borda enabled auto-merge (squash) March 25, 2021 09:55

tchaton reviewed Mar 25, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved

kaushikb11 added 2 commits March 25, 2021 15:46

add comment

80f15c1

Merge branch 'tpu/checkpoint' of https://github.com/kaushikb11/pytorc…

4c69b62

…h-lightning into tpu/checkpoint

Borda disabled auto-merge March 25, 2021 10:27

Borda enabled auto-merge (squash) March 25, 2021 10:27

Borda merged commit 2cbdc01 into Lightning-AI:master Mar 25, 2021

carmocca mentioned this pull request Mar 29, 2021

1.2.x cherries 🍒 #6083

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix checkpoint callback & Trainer.test(_) issue for TPUs #6654

Fix checkpoint callback & Trainer.test(_) issue for TPUs #6654

kaushikb11 commented Mar 23, 2021 •

edited

Loading

ananthsub Mar 24, 2021 •

edited

Loading

carmocca Mar 24, 2021

tchaton left a comment

SeanNaren left a comment

kaushikb11 commented Mar 24, 2021

Borda left a comment

kaushikb11 commented Mar 25, 2021

codecov bot commented Mar 25, 2021 •

edited

Loading

		checkpoint_callback = self.lightning_module.trainer.checkpoint_callback
		best_model_path = checkpoint_callback.best_model_path if checkpoint_callback else None

Fix checkpoint callback & Trainer.test(_) issue for TPUs #6654

Fix checkpoint callback & Trainer.test(_) issue for TPUs #6654

Conversation

kaushikb11 commented Mar 23, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

ananthsub Mar 24, 2021 • edited Loading

Choose a reason for hiding this comment

carmocca Mar 24, 2021

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

SeanNaren left a comment

Choose a reason for hiding this comment

kaushikb11 commented Mar 24, 2021

Borda left a comment

Choose a reason for hiding this comment

kaushikb11 commented Mar 25, 2021

codecov bot commented Mar 25, 2021 • edited Loading

Codecov Report

kaushikb11 commented Mar 23, 2021 •

edited

Loading

ananthsub Mar 24, 2021 •

edited

Loading

codecov bot commented Mar 25, 2021 •

edited

Loading