Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC Resubmit resume on most recent epoch checkpoint #13773

Open
SpontaneousDuck opened this issue Jul 20, 2022 · 11 comments
Open

HPC Resubmit resume on most recent epoch checkpoint #13773

SpontaneousDuck opened this issue Jul 20, 2022 · 11 comments
Labels
checkpointing Related to checkpointing environment: slurm feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@SpontaneousDuck
Copy link
Contributor

SpontaneousDuck commented Jul 20, 2022

馃殌 Feature

Allow end-of-epoch checkpoints for resuming killed and resubmitted training jobs in a SLURM environment.

Motivation

Mid-epoch checkpointing does not appear to work with my model, even with fault-tolerant training I still get some weird results. Since I am training on a smaller dataset with a larger number of epochs, it would be really useful for me to be able to resume from the most recent checkpoint I saved using the normal end-of-epoch checkpoints.

Pitch

Instead of forcing users into the checkpoint process defined by the SLURMEnvironment plugin, allowing user's to customize the pause/resume operation would be a useful feature. Maybe add this as an option to the SLURMEnvironment plugin? Since my SLURM job ID is the same after resubmission, the default_root_dir is being set to the same as the previous job so the newest checkpoint should be easy to find.

Alternatives

Just an option for resuming end-of-epoch checkpoints would solve my problem. Allowing hooks for full customization of this function would be the most customizable version but put the most work on the user.

cc @Borda @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj

@SpontaneousDuck SpontaneousDuck added the needs triage Waiting to be triaged by maintainers label Jul 20, 2022
@carmocca carmocca added question Further information is requested environment: slurm and removed needs triage Waiting to be triaged by maintainers labels Aug 5, 2022
@carmocca
Copy link
Contributor

carmocca commented Aug 5, 2022

This is already supported by doing:

from pytorch_lightning import Trainer
from pytorch_lightning.plugins.environments import SLURMEnvironment

trainer = Trainer(plugins=[SLURMEnvironment(auto_requeue=False)])

trainer.fit(..., ckpt_path="last")

@SpontaneousDuck
Copy link
Contributor Author

Won't this disable the auto re-queueing functionality? I am looking to use the two features together.

@carmocca
Copy link
Contributor

Yes, it would. I think I missed your what you need. You want to use auto_requeue and have ckpt_path="last" pick up the most recent checkpoint. Does it not work?

@SpontaneousDuck
Copy link
Contributor Author

So from what I can tell, the checkpoint resume feature will select a checkpoint in the following order:

https://github.com/Lightning-AI/lightning/blob/34f98836fb452674f43f66babb2325ec17e8e192/src/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L123-L129

https://github.com/Lightning-AI/lightning/blob/34f98836fb452674f43f66babb2325ec17e8e192/src/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L78

When lighting resumes, it appears like it will always select the HPC checkpoint file if there is one available. If autorequeue=True, an HPC checkpoint will always be created and thus that checkpoint will always be restored from. I can't see a way to override this logic currently. Am I missing something?

@carmocca
Copy link
Contributor

carmocca commented Aug 25, 2022

Thanks, I see what's your problem now.

This has been like this since the very beginning, and customizing would require a bit of a refactor. My proposal would be:

  1. Add trainer.fit(ckpt_path="hpc") where hpc checkpoints are selected. This would be the alternative to the existing behavior where they are always preferred. cc @awaelchli
  2. When SLURM is used, ckpt_path=None (default setting) selects ckpt_path="hpc" for backwards compatibility. Then you would be able to manually override it.

Additionally, we would want trainer.fit(ckpt_path="last") also track hpc checkpoints. I'm not sure if the current behaviour does it cc @otaj

And another improvement that came to mind is that we should check if the "last", "best", and "hpc" directories/files exist to warn the users of the potential conflict in naming.

@carmocca carmocca added feature Is an improvement or enhancement help wanted Open to be worked on checkpointing Related to checkpointing and removed question Further information is requested labels Aug 25, 2022
@carmocca carmocca added this to the pl:future milestone Aug 25, 2022
@SpontaneousDuck
Copy link
Contributor Author

That sounds like a great solution to me! From what I can tell, trainer.fit(ckpt_path="last") does not automatically pick up the last checkpoint though since no training has occurred in the process at the beginning of a requeue. It looks like setting "last" only pulls the last_model_path from the ModelCheckpoint callback (shown in the code block below) which appears to only be set when the callback saves a checkpoint. I cannot get "last" to load the most recent checkpoint in my code currently with a fresh instance of Trainer and ModelCheckpoint. Am I missing something to get the trainer to pick up my previous checkpoints?

https://github.com/Lightning-AI/lightning/blob/807435885ea265580fee9f4e69c063eace46def2/src/pytorch_lightning/trainer/trainer.py#L1432-L1437

@SpontaneousDuck
Copy link
Contributor Author

Ah, so it appears there is not really any logic in Lightning for auto-resuming from checkpoints saved with ModelCheckpoint. The user is expected to pass in the checkpoint file they want with load_from_checkpoint or ckpt_path. Checkpoint auto-selection from a clean training run if the Trainer is in HPC mode. In this case, only checkpoints in the folder starting with hpc_ckpt_ are considered as shown in the function below.

https://github.com/Lightning-AI/lightning/blob/d4bcafad7a64d7c39598fa7e4e33b81a1be31828/src/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L59-L68

I took a first stab at accomplishing what I need by adding a flag to the SLURMEnvironment plugin to allow disabling the auto creation of checkpoints before restarting. The HPC resume process will also choose the most recent checkpoint in the weight save folder instead of just the highest numbered HPC checkpoint. This would allow me to disable the mid-epoch checkpoints created by lightning and the most recent checkpoint I created to be loaded on initialization of the auto-requeued job. How does this look?

https://github.com/SpontaneousDuck/pytorch-lightning/tree/hpc_checkpoints

@carmocca
Copy link
Contributor

From what I can tell, trainer.fit(ckpt_path="last") does not automatically pick up the last checkpoint...

Yes. I believe there's an issue with the loading order. Also reported in #14466

@12michi34
Copy link

@SpontaneousDuck just found this thread. is your workaround still available to look at ? Tried the link above and got only a 404.
Would want to do the same as you described : use an epoch checkpoint instead of a mid epoch hpc one.

@SpontaneousDuck
Copy link
Contributor Author

Sorry I do not have the code anymore and Pytorch Lightning has changed a bunch since then. Essentially what I did was remove the code saving a checkpoint when stopping and point the checkpoint loading code to the path for my checkpoint callback. Should be pretty simple to recreate!

@12michi34
Copy link

No worries. With your description I think I can sort it out, thanx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing environment: slurm feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants