-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find last checkpoints on restart #14907
Conversation
src/pytorch_lightning/trainer/connectors/checkpoint_connector.py
Outdated
Show resolved
Hide resolved
src/pytorch_lightning/trainer/connectors/checkpoint_connector.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Waiting on a test though!
I found some tests related to this in |
@awaelchli, @carmocca there is a test now. Btw, writing it helped me uncover a small bug with the code, see d3a3b35 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
* use more recent lightning cloud launcher * allow LightningApp to use custom cloud compute for flows * feedback from adrian * adjust other cloud tests * update * update * update commens * Update src/lightning_app/core/app.py Co-authored-by: Sherin Thomas <sherin@grid.ai> * Close profiler when `StopIteration` is raised (#14945) * Find last checkpoints on restart (#14907) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Remove unused gcsfs dependency (#14962) * Update hpu mixed precision link (#14974) Signed-off-by: Jerome <janand@habana.ai> * Bump version of fsspec (#14975) fsspec verbump * Fix TPU test CI (#14926) * Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107e. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813a. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Trainer: fix support for non-distributed PyTorch (#14971) * Trainer: fix non-distributed use * Update CHANGELOG * fixes typing errors in rich_progress.py (#14963) * revert default cloud compute rename * allow LightningApp to use custom cloud compute for flows * feedback from adrian * update * resolve merge with master conflict * remove preemptible * update CHANGELOG * add basic flow cloud compute documentation * fix docs build * add missing symlink * try to fix sphinx * another attempt for docs * fix new test Signed-off-by: Jerome <janand@habana.ai> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Sherin Thomas <sherin@grid.ai> Co-authored-by: Ziyad Sheebaelhamd <47150407+ziyadsheeba@users.noreply.github.com> Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com> Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com> Co-authored-by: DP <10988155+donlapark@users.noreply.github.com>
* use more recent lightning cloud launcher * allow LightningApp to use custom cloud compute for flows * feedback from adrian * adjust other cloud tests * update * update * update commens * Update src/lightning_app/core/app.py Co-authored-by: Sherin Thomas <sherin@grid.ai> * Close profiler when `StopIteration` is raised (#14945) * Find last checkpoints on restart (#14907) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Remove unused gcsfs dependency (#14962) * Update hpu mixed precision link (#14974) Signed-off-by: Jerome <janand@habana.ai> * Bump version of fsspec (#14975) fsspec verbump * Fix TPU test CI (#14926) * Fix TPU test CI * +x first * Lite first to uncovert errors faster * Fixes * One more * Simplify XLALauncher wrapping to avoid pickle error * debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug commit successful. Trying local definitions * Require tpu for mock test * ValueError: The number of devices must be either 1 or 8, got 4 instead * Fix mock test * Simplify call, rely on defaults * Skip OSError for now. Maybe upgrading will help * Simplify launch tests, move some to lite * Stricter typing * RuntimeError: Accessing the XLA device before processes have spawned is not allowed. * Revert "RuntimeError: Accessing the XLA device before processes have spawned is not allowed." This reverts commit f65107e. * Alternative boring solution to the reverted commit * Fix failing test on CUDA machine * Workarounds * Try latest mkl * Revert "Try latest mkl" This reverts commit d06813a. * Wrong exception * xfail * Mypy * Comment change * Spawn launch refactor * Accept that we cannot lazy init now * Fix mypy and launch test failures * The base dockerfile already includes mkl-2022.1.0 - what if we use it? * try a different mkl version * Revert mkl version changes Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Trainer: fix support for non-distributed PyTorch (#14971) * Trainer: fix non-distributed use * Update CHANGELOG * fixes typing errors in rich_progress.py (#14963) * revert default cloud compute rename * allow LightningApp to use custom cloud compute for flows * feedback from adrian * update * resolve merge with master conflict * remove preemptible * update CHANGELOG * add basic flow cloud compute documentation * fix docs build * add missing symlink * try to fix sphinx * another attempt for docs * fix new test Signed-off-by: Jerome <janand@habana.ai> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Sherin Thomas <sherin@grid.ai> Co-authored-by: Ziyad Sheebaelhamd <47150407+ziyadsheeba@users.noreply.github.com> Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com> Co-authored-by: awaelchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com> Co-authored-by: DP <10988155+donlapark@users.noreply.github.com>
What does this PR do?
Fixes #14466
Although #12816 enabled support for passing
"last"
as an argument for reloading checkpoint, reloading happened only for FT checkpoints and checkpoints that were created within the same run.This PR adds support for reloading last checkpoint on restart of the run.
Note, that this makes
"best"
and"last"
keywords divergent in their behavior -"best"
doesn't reload the best available checkpoint when starting a new run for training.Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃
cc @Borda @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @carmocca @jjenniferdai @akihironitta