HPC Resubmit resume on most recent epoch checkpoint #13773

SpontaneousDuck · 2022-07-20T23:54:27Z

🚀 Feature

Allow end-of-epoch checkpoints for resuming killed and resubmitted training jobs in a SLURM environment.

Motivation

Mid-epoch checkpointing does not appear to work with my model, even with fault-tolerant training I still get some weird results. Since I am training on a smaller dataset with a larger number of epochs, it would be really useful for me to be able to resume from the most recent checkpoint I saved using the normal end-of-epoch checkpoints.

Pitch

Instead of forcing users into the checkpoint process defined by the SLURMEnvironment plugin, allowing user's to customize the pause/resume operation would be a useful feature. Maybe add this as an option to the SLURMEnvironment plugin? Since my SLURM job ID is the same after resubmission, the default_root_dir is being set to the same as the previous job so the newest checkpoint should be easy to find.

Alternatives

Just an option for resuming end-of-epoch checkpoints would solve my problem. Allowing hooks for full customization of this function would be the most customizable version but put the most work on the user.

cc @Borda @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj

The text was updated successfully, but these errors were encountered:

carmocca · 2022-08-05T17:26:18Z

This is already supported by doing:

from pytorch_lightning import Trainer
from pytorch_lightning.plugins.environments import SLURMEnvironment

trainer = Trainer(plugins=[SLURMEnvironment(auto_requeue=False)])

trainer.fit(..., ckpt_path="last")

SpontaneousDuck · 2022-08-06T22:59:44Z

Won't this disable the auto re-queueing functionality? I am looking to use the two features together.

carmocca · 2022-08-19T23:23:03Z

Yes, it would. I think I missed your what you need. You want to use auto_requeue and have ckpt_path="last" pick up the most recent checkpoint. Does it not work?

SpontaneousDuck · 2022-08-24T15:38:57Z

So from what I can tell, the checkpoint resume feature will select a checkpoint in the following order:

https://github.com/Lightning-AI/lightning/blob/34f98836fb452674f43f66babb2325ec17e8e192/src/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L123-L129

https://github.com/Lightning-AI/lightning/blob/34f98836fb452674f43f66babb2325ec17e8e192/src/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L78

When lighting resumes, it appears like it will always select the HPC checkpoint file if there is one available. If autorequeue=True, an HPC checkpoint will always be created and thus that checkpoint will always be restored from. I can't see a way to override this logic currently. Am I missing something?

carmocca · 2022-08-25T15:19:34Z

Thanks, I see what's your problem now.

This has been like this since the very beginning, and customizing would require a bit of a refactor. My proposal would be:

Add trainer.fit(ckpt_path="hpc") where hpc checkpoints are selected. This would be the alternative to the existing behavior where they are always preferred. cc @awaelchli
When SLURM is used, ckpt_path=None (default setting) selects ckpt_path="hpc" for backwards compatibility. Then you would be able to manually override it.

Additionally, we would want trainer.fit(ckpt_path="last") also track hpc checkpoints. I'm not sure if the current behaviour does it cc @otaj

And another improvement that came to mind is that we should check if the "last", "best", and "hpc" directories/files exist to warn the users of the potential conflict in naming.

SpontaneousDuck · 2022-08-25T20:29:26Z

That sounds like a great solution to me! From what I can tell, trainer.fit(ckpt_path="last") does not automatically pick up the last checkpoint though since no training has occurred in the process at the beginning of a requeue. It looks like setting "last" only pulls the last_model_path from the ModelCheckpoint callback (shown in the code block below) which appears to only be set when the callback saves a checkpoint. I cannot get "last" to load the most recent checkpoint in my code currently with a fresh instance of Trainer and ModelCheckpoint. Am I missing something to get the trainer to pick up my previous checkpoints?

https://github.com/Lightning-AI/lightning/blob/807435885ea265580fee9f4e69c063eace46def2/src/pytorch_lightning/trainer/trainer.py#L1432-L1437

SpontaneousDuck · 2022-08-26T22:40:24Z

Ah, so it appears there is not really any logic in Lightning for auto-resuming from checkpoints saved with ModelCheckpoint. The user is expected to pass in the checkpoint file they want with load_from_checkpoint or ckpt_path. Checkpoint auto-selection from a clean training run if the Trainer is in HPC mode. In this case, only checkpoints in the folder starting with hpc_ckpt_ are considered as shown in the function below.

https://github.com/Lightning-AI/lightning/blob/d4bcafad7a64d7c39598fa7e4e33b81a1be31828/src/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L59-L68

I took a first stab at accomplishing what I need by adding a flag to the SLURMEnvironment plugin to allow disabling the auto creation of checkpoints before restarting. The HPC resume process will also choose the most recent checkpoint in the weight save folder instead of just the highest numbered HPC checkpoint. This would allow me to disable the mid-epoch checkpoints created by lightning and the most recent checkpoint I created to be loaded on initialization of the auto-requeued job. How does this look?

https://github.com/SpontaneousDuck/pytorch-lightning/tree/hpc_checkpoints

carmocca · 2022-08-31T17:55:07Z

From what I can tell, trainer.fit(ckpt_path="last") does not automatically pick up the last checkpoint...

Yes. I believe there's an issue with the loading order. Also reported in #14466

12michi34 · 2023-03-31T16:45:33Z

@SpontaneousDuck just found this thread. is your workaround still available to look at ? Tried the link above and got only a 404.
Would want to do the same as you described : use an epoch checkpoint instead of a mid epoch hpc one.

SpontaneousDuck · 2023-04-01T14:10:31Z

Sorry I do not have the code anymore and Pytorch Lightning has changed a bunch since then. Essentially what I did was remove the code saving a checkpoint when stopping and point the checkpoint loading code to the path for my checkpoint callback. Should be pretty simple to recreate!

12michi34 · 2023-04-03T07:46:32Z

No worries. With your description I think I can sort it out, thanx!

SpontaneousDuck added the needs triage Waiting to be triaged by maintainers label Jul 20, 2022

carmocca added question Further information is requested environment: slurm and removed needs triage Waiting to be triaged by maintainers labels Aug 5, 2022

carmocca added feature Is an improvement or enhancement help wanted Open to be worked on checkpointing Related to checkpointing and removed question Further information is requested labels Aug 25, 2022

carmocca added this to the pl:future milestone Aug 25, 2022

otaj mentioned this issue Sep 27, 2022

Introduce ckpt_path="hpc" keyword for checkpoint loading #14911

Merged

12 tasks

RoiEXLab mentioned this issue Feb 6, 2023

What's the intended way of resuming training on a SLURM cluster? #16639

Open

rustamzh mentioned this issue Apr 16, 2024

Turn off hpc checkpoint saving in SLURM environment if trainer.fit(..., ckpt_path="last") #19782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC Resubmit resume on most recent epoch checkpoint #13773

HPC Resubmit resume on most recent epoch checkpoint #13773

SpontaneousDuck commented Jul 20, 2022 •

edited by github-actions bot

carmocca commented Aug 5, 2022

SpontaneousDuck commented Aug 6, 2022

carmocca commented Aug 19, 2022

SpontaneousDuck commented Aug 24, 2022

carmocca commented Aug 25, 2022 •

edited

SpontaneousDuck commented Aug 25, 2022

SpontaneousDuck commented Aug 26, 2022

carmocca commented Aug 31, 2022

12michi34 commented Mar 31, 2023

SpontaneousDuck commented Apr 1, 2023

12michi34 commented Apr 3, 2023

HPC Resubmit resume on most recent epoch checkpoint #13773

HPC Resubmit resume on most recent epoch checkpoint #13773

Comments

SpontaneousDuck commented Jul 20, 2022 • edited by github-actions bot

🚀 Feature

Motivation

Pitch

Alternatives

carmocca commented Aug 5, 2022

SpontaneousDuck commented Aug 6, 2022

carmocca commented Aug 19, 2022

SpontaneousDuck commented Aug 24, 2022

carmocca commented Aug 25, 2022 • edited

SpontaneousDuck commented Aug 25, 2022

SpontaneousDuck commented Aug 26, 2022

carmocca commented Aug 31, 2022

12michi34 commented Mar 31, 2023

SpontaneousDuck commented Apr 1, 2023

12michi34 commented Apr 3, 2023

SpontaneousDuck commented Jul 20, 2022 •

edited by github-actions bot

carmocca commented Aug 25, 2022 •

edited