Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Deprecate should_rank_save_checkpoint #9074

Closed
ananthsub opened this issue Aug 24, 2021 · 2 comments · Fixed by #9433
Closed

[RFC] Deprecate should_rank_save_checkpoint #9074

ananthsub opened this issue Aug 24, 2021 · 2 comments · Fixed by #9433
Assignees
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement help wanted Open to be worked on refactor

Comments

@ananthsub
Copy link
Contributor

Proposed refactoring or deprecation

Now that the checkpoint is better consolidated in the training type plugin, we no longer need this property, as this becomes an internal implementation detail of the training type.

Users (via the checkpoint callback) only need to call trainer.save_checkpoint and assume the plugin will handle checks surrounding this for them

Motivation

  • Ensure consolidation of saving logic in one place (e.g. all fsspec code for checkpoint paths shoud sit in one place vs. being scattered around the codebase)
  • API simpliication: fewer properties exposed on the Trainer

Pitch

Deprecate this property: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/trainer/properties.py#L117-L120

Deprecate this property: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/plugins/training_type/training_type_plugin.py#L313-L316

Move the directory creation from here: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/callbacks/model_checkpoint.py#L511-L514

Into here: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/plugins/io/torch_plugin.py#L30-L41

@SeanNaren - this is where we should also have a rm_checkpoint on the Checkpoint IO plugin such that we can deprecate this from the model checkpoint: https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/callbacks/model_checkpoint.py#L502-L505

One thing I'm not sure about: because the should_rank_save_checkpoint is exposed from the accelerator to the trainer as part of the public trainer API, does this need to go through a full deprecation cycle? Or is a breaking change as part of the plugins API permissible?

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ananthsub ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor checkpointing Related to checkpointing labels Aug 24, 2021
@kaushikb11 kaushikb11 self-assigned this Sep 3, 2021
@jjenniferdai
Copy link
Contributor

jjenniferdai commented Sep 8, 2021

outlined some more with @ananthsub, draft proposal:

  1. utilities:
  • Add rm_checkpoint fn in CheckpointIO (only fs.rm utility)
  • Add rm_checkpoint fn in TrainingTypePlugin.
    -- This checks should_rank_save_checkpoint: if true, call CheckpointIO rm_checkpoint.
  1. model_checkpoint callback:
    Motivation 1: Let training type plugins handle all should_rank_save_checkpoint logic (fewer bugs, cleaner code).
    Motivation 2: Consolidate/migrate fsspec code into CheckpointIO.
  1. deprecate Trainer property should_rank_save_checkpoint
    https://github.com/PyTorchLightning/pytorch-lightning/blob/538e743f17c7da4624c902f762922e2837661818/pytorch_lightning/trainer/properties.py#L117-L120

@ananthsub ananthsub added this to To do in Sprint Q3-6: 6 Sep - 17 Sep via automation Sep 8, 2021
@SeanNaren
Copy link
Contributor

Thanks @ananthsub I like this clean up a lot!

Regarding deprecation of the property I'm unsure. Personally I wouldn't mind a breaking change here since we do label the plugin/accelerator API as experimental iirc, and this variable is primarily used internally in Lightning. but cc @tchaton for any additional thoughts :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement help wanted Open to be worked on refactor
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants