[RFC] Deprecate `should_rank_save_checkpoint` #9074

ananthsub · 2021-08-24T06:59:33Z

Proposed refactoring or deprecation

Now that the checkpoint is better consolidated in the training type plugin, we no longer need this property, as this becomes an internal implementation detail of the training type.

Users (via the checkpoint callback) only need to call trainer.save_checkpoint and assume the plugin will handle checks surrounding this for them

Motivation

Ensure consolidation of saving logic in one place (e.g. all fsspec code for checkpoint paths shoud sit in one place vs. being scattered around the codebase)
API simpliication: fewer properties exposed on the Trainer

Pitch

Deprecate this property: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/trainer/properties.py#L117-L120

Deprecate this property: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/plugins/training_type/training_type_plugin.py#L313-L316

Move the directory creation from here: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/callbacks/model_checkpoint.py#L511-L514

Into here: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/plugins/io/torch_plugin.py#L30-L41

@SeanNaren - this is where we should also have a rm_checkpoint on the Checkpoint IO plugin such that we can deprecate this from the model checkpoint: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/callbacks/model_checkpoint.py#L502-L505

One thing I'm not sure about: because the should_rank_save_checkpoint is exposed from the accelerator to the trainer as part of the public trainer API, does this need to go through a full deprecation cycle? Or is a breaking change as part of the plugins API permissible?

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

jjenniferdai · 2021-09-08T01:53:42Z

outlined some more with @ananthsub, draft proposal:

utilities:

Add rm_checkpoint fn in CheckpointIO (only fs.rm utility)
Add rm_checkpoint fn in TrainingTypePlugin.
-- This checks should_rank_save_checkpoint: if true, call CheckpointIO rm_checkpoint.

model_checkpoint callback:
Motivation 1: Let training type plugins handle all should_rank_save_checkpoint logic (fewer bugs, cleaner code).
Motivation 2: Consolidate/migrate fsspec code into CheckpointIO.

_del_model will be replaced by TrainingTypePlugin rm_checkpoint.
https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/callbacks/model_checkpoint.py#L502-L505
_save_model fs directory creation —> move to CheckpointIO save_checkpoint.
https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/callbacks/model_checkpoint.py#L511-L514
__resolve_ckpt_dir fs directory creation: currently I removed this, but wondering if there is use-case to keep this and makedir before _save_model calls
https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/callbacks/model_checkpoint.py#L651-L652

deprecate Trainer property should_rank_save_checkpoint
https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/trainer/properties.py#L117-L120

SeanNaren · 2021-09-08T08:58:51Z

Thanks @ananthsub I like this clean up a lot!

Regarding deprecation of the property I'm unsure. Personally I wouldn't mind a breaking change here since we do label the plugin/accelerator API as experimental iirc, and this variable is primarily used internally in Lightning. but cc @tchaton for any additional thoughts :)

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor checkpointing Related to checkpointing labels Aug 24, 2021

ananthsub mentioned this issue Sep 2, 2021

Update should_rank_save_checkpoint for multi nodes #9293

Closed

12 tasks

kaushikb11 self-assigned this Sep 3, 2021

ananthsub added this to To do in Sprint Q3-6: 6 Sep - 17 Sep via automation Sep 8, 2021

This was referenced Sep 8, 2021

Add remove_checkpoint to CheckpointIO plugin to simplify ModelCheckpo… #9373

Merged

Remove should_rank_save_checkpoint property from Trainer #9433

Merged

tchaton added this to To do in Sprint Q3-7: 20 Sep - 1 Oct Sep 20, 2021

kaushikb11 moved this from To do to In progress in Sprint Q3-7: 20 Sep - 1 Oct Sep 20, 2021

jjenniferdai mentioned this issue Sep 20, 2021

Optimize non-empty directory warning check in model checkpoint callback #9615

Merged

12 tasks

tchaton closed this as completed in #9433 Oct 13, 2021

Sprint Q3-6: 6 Sep - 17 Sep automation moved this from To do to Done Oct 13, 2021

Sprint Q3-7: 20 Sep - 1 Oct automation moved this from In progress to Done Oct 13, 2021

jjenniferdai mentioned this issue Dec 15, 2021

Deprecate Trainer.should_rank_save_checkpoint property #11068

Merged

12 tasks

carmocca mentioned this issue Dec 15, 2021

Remove should_rank_save_checkpoint property from TTP #11070

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Deprecate `should_rank_save_checkpoint` #9074

[RFC] Deprecate `should_rank_save_checkpoint` #9074

ananthsub commented Aug 24, 2021

jjenniferdai commented Sep 8, 2021 •

edited

SeanNaren commented Sep 8, 2021

[RFC] Deprecate should_rank_save_checkpoint #9074

[RFC] Deprecate should_rank_save_checkpoint #9074

Comments

ananthsub commented Aug 24, 2021

Proposed refactoring or deprecation

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

jjenniferdai commented Sep 8, 2021 • edited

SeanNaren commented Sep 8, 2021

[RFC] Deprecate `should_rank_save_checkpoint` #9074

[RFC] Deprecate `should_rank_save_checkpoint` #9074

jjenniferdai commented Sep 8, 2021 •

edited