-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPC Resubmit resume on most recent epoch checkpoint #13773
Comments
This is already supported by doing: from pytorch_lightning import Trainer
from pytorch_lightning.plugins.environments import SLURMEnvironment
trainer = Trainer(plugins=[SLURMEnvironment(auto_requeue=False)])
trainer.fit(..., ckpt_path="last") |
Won't this disable the auto re-queueing functionality? I am looking to use the two features together. |
Yes, it would. I think I missed your what you need. You want to use |
So from what I can tell, the checkpoint resume feature will select a checkpoint in the following order: When lighting resumes, it appears like it will always select the HPC checkpoint file if there is one available. If |
Thanks, I see what's your problem now. This has been like this since the very beginning, and customizing would require a bit of a refactor. My proposal would be:
Additionally, we would want And another improvement that came to mind is that we should check if the "last", "best", and "hpc" directories/files exist to warn the users of the potential conflict in naming. |
That sounds like a great solution to me! From what I can tell, |
Ah, so it appears there is not really any logic in Lightning for auto-resuming from checkpoints saved with I took a first stab at accomplishing what I need by adding a flag to the SLURMEnvironment plugin to allow disabling the auto creation of checkpoints before restarting. The HPC resume process will also choose the most recent checkpoint in the weight save folder instead of just the highest numbered HPC checkpoint. This would allow me to disable the mid-epoch checkpoints created by lightning and the most recent checkpoint I created to be loaded on initialization of the auto-requeued job. How does this look? https://github.com/SpontaneousDuck/pytorch-lightning/tree/hpc_checkpoints |
Yes. I believe there's an issue with the loading order. Also reported in #14466 |
@SpontaneousDuck just found this thread. is your workaround still available to look at ? Tried the link above and got only a 404. |
Sorry I do not have the code anymore and Pytorch Lightning has changed a bunch since then. Essentially what I did was remove the code saving a checkpoint when stopping and point the checkpoint loading code to the path for my checkpoint callback. Should be pretty simple to recreate! |
No worries. With your description I think I can sort it out, thanx! |
馃殌 Feature
Allow end-of-epoch checkpoints for resuming killed and resubmitted training jobs in a SLURM environment.
Motivation
Mid-epoch checkpointing does not appear to work with my model, even with fault-tolerant training I still get some weird results. Since I am training on a smaller dataset with a larger number of epochs, it would be really useful for me to be able to resume from the most recent checkpoint I saved using the normal end-of-epoch checkpoints.
Pitch
Instead of forcing users into the checkpoint process defined by the
SLURMEnvironment
plugin, allowing user's to customize the pause/resume operation would be a useful feature. Maybe add this as an option to theSLURMEnvironment
plugin? Since my SLURM job ID is the same after resubmission, thedefault_root_dir
is being set to the same as the previous job so the newest checkpoint should be easy to find.Alternatives
Just an option for resuming end-of-epoch checkpoints would solve my problem. Allowing hooks for full customization of this function would be the most customizable version but put the most work on the user.
cc @Borda @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj
The text was updated successfully, but these errors were encountered: