-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NeMo-UX] Integrate experiment manager features with NeMo-UX APIs #9460
Conversation
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
dist_ckpt = ckpt_to_dir(filepath) | ||
shutil.rmtree(dist_ckpt, ignore_errors=True) | ||
logging.info(f"Removed distributed checkpoint: {dist_ckpt}") | ||
except: |
Check notice
Code scanning / CodeQL
Except block handles 'BaseException' Note
marker_path = ModelCheckpoint.format_checkpoint_unfinished_marker_path(checkpoint_path) | ||
if marker_path.exists(): | ||
marker_path.unlink() | ||
except: |
Check notice
Code scanning / CodeQL
Except block handles 'BaseException' Note
continue | ||
index = checkpoint.find(self.monitor) + len(self.monitor) + 1 # Find monitor in str + 1 for '=' | ||
if index != len(self.monitor): | ||
match = re.search('[A-z]', checkpoint[index:]) |
Check warning
Code scanning / CodeQL
Overly permissive regular expression range Medium
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
7ee166d
to
0ee7236
Compare
0ee7236
to
309cec3
Compare
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
import os | ||
import re | ||
import shutil | ||
from dataclasses import dataclass |
Check notice
Code scanning / CodeQL
Unused import Note
@@ -8,6 +8,7 @@ | |||
from lightning_fabric.plugins.io.checkpoint_io import CheckpointIO | |||
from lightning_fabric.utilities.cloud_io import get_filesystem | |||
from lightning_fabric.utilities.types import _PATH | |||
from megatron.core.dist_checkpointing.strategies import tensorstore |
Check notice
Code scanning / CodeQL
Unused import Note
…IDIA#9460) * [WIP] move experiement manager features into PTL * cleanup and minor refactoring * add async checkpointing support, some cleanup of modelcheckpoint and setup_nemo * more cleanup * cleanup, reorganization, minor debugging * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * Proposal to have AutoResume & Experiment * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * small fix * small bug fixes and cleanup * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove async checkpointing support. Support will be added in a subsequent PR * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove unneeded import * bug fix * remove deprecated prefix * rename Experiment to NeMoLogger * add option to instantiate model checkpoint callback inside of nemo_logger setup * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * Proposal to move ModelCheckpoint into NeMoLogger * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * minor fixes * fix merge conflict * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove unused imports --------- Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> Co-authored-by: ashors1 <ashors1@users.noreply.github.com> Co-authored-by: Marc Romeyn <mromeijn@nvidia.com> Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
…IDIA#9460) * [WIP] move experiement manager features into PTL * cleanup and minor refactoring * add async checkpointing support, some cleanup of modelcheckpoint and setup_nemo * more cleanup * cleanup, reorganization, minor debugging * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * Proposal to have AutoResume & Experiment * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * small fix * small bug fixes and cleanup * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove async checkpointing support. Support will be added in a subsequent PR * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove unneeded import * bug fix * remove deprecated prefix * rename Experiment to NeMoLogger * add option to instantiate model checkpoint callback inside of nemo_logger setup * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * Proposal to move ModelCheckpoint into NeMoLogger * Apply isort and black reformatting Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> * minor fixes * fix merge conflict * Apply isort and black reformatting Signed-off-by: ashors1 <ashors1@users.noreply.github.com> * remove unused imports --------- Signed-off-by: ashors1 <ashors1@users.noreply.github.com> Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com> Co-authored-by: ashors1 <ashors1@users.noreply.github.com> Co-authored-by: Marc Romeyn <mromeijn@nvidia.com> Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information