Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NeMo-UX] Integrate experiment manager features with NeMo-UX APIs #9460

Merged
merged 27 commits into from
Jun 14, 2024

Conversation

ashors1
Copy link
Collaborator

@ashors1 ashors1 commented Jun 13, 2024

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@ashors1 ashors1 requested a review from marcromeyn June 13, 2024 05:22
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
dist_ckpt = ckpt_to_dir(filepath)
shutil.rmtree(dist_ckpt, ignore_errors=True)
logging.info(f"Removed distributed checkpoint: {dist_ckpt}")
except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException' Note

Except block directly handles BaseException.
marker_path = ModelCheckpoint.format_checkpoint_unfinished_marker_path(checkpoint_path)
if marker_path.exists():
marker_path.unlink()
except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException' Note

Except block directly handles BaseException.
continue
index = checkpoint.find(self.monitor) + len(self.monitor) + 1 # Find monitor in str + 1 for '='
if index != len(self.monitor):
match = re.search('[A-z]', checkpoint[index:])

Check warning

Code scanning / CodeQL

Overly permissive regular expression range Medium

Suspicious character range that is equivalent to [A-Z[]^_`a-z].
nemo/lightning/io/pl.py Fixed Show fixed Hide fixed
nemo/lightning/io/pl.py Fixed Show fixed Hide fixed
nemo/lightning/io/pl.py Fixed Show fixed Hide fixed
nemo/lightning/pytorch/resume.py Fixed Show fixed Hide fixed
nemo/lightning/pytorch/resume.py Fixed Show fixed Hide fixed
nemo/lightning/pytorch/trainer.py Fixed Show fixed Hide fixed
marcromeyn and others added 2 commits June 13, 2024 15:14
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
@ashors1 ashors1 force-pushed the ashors/nemo-ux-remove-exp-manager branch from 7ee166d to 0ee7236 Compare June 13, 2024 23:04
@ashors1 ashors1 force-pushed the ashors/nemo-ux-remove-exp-manager branch from 0ee7236 to 309cec3 Compare June 13, 2024 23:13
ashors1 and others added 3 commits June 14, 2024 09:34
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
import os
import re
import shutil
from dataclasses import dataclass

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'dataclass' is not used.
@@ -8,6 +8,7 @@
from lightning_fabric.plugins.io.checkpoint_io import CheckpointIO
from lightning_fabric.utilities.cloud_io import get_filesystem
from lightning_fabric.utilities.types import _PATH
from megatron.core.dist_checkpointing.strategies import tensorstore

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'tensorstore' is not used.
@marcromeyn marcromeyn marked this pull request as ready for review June 14, 2024 18:57
@marcromeyn marcromeyn self-requested a review June 14, 2024 18:59
@marcromeyn marcromeyn merged commit 77dbb00 into main Jun 14, 2024
112 checks passed
@marcromeyn marcromeyn deleted the ashors/nemo-ux-remove-exp-manager branch June 14, 2024 21:10
JesusPaz pushed a commit to JesusPaz/NeMo that referenced this pull request Jun 18, 2024
…IDIA#9460)

* [WIP] move experiement manager features into PTL

* cleanup and minor refactoring

* add async checkpointing support, some cleanup of modelcheckpoint and setup_nemo

* more cleanup

* cleanup, reorganization, minor debugging

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* Proposal to have AutoResume & Experiment

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

* small fix

* small bug fixes and cleanup

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove async checkpointing support. Support will be added in a subsequent PR

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove unneeded import

* bug fix

* remove deprecated prefix

* rename Experiment to NeMoLogger

* add option to instantiate model checkpoint callback inside of nemo_logger setup

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* Proposal to move ModelCheckpoint into NeMoLogger

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

* minor fixes

* fix merge conflict

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove unused imports

---------

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: Marc Romeyn <mromeijn@nvidia.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…IDIA#9460)

* [WIP] move experiement manager features into PTL

* cleanup and minor refactoring

* add async checkpointing support, some cleanup of modelcheckpoint and setup_nemo

* more cleanup

* cleanup, reorganization, minor debugging

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* Proposal to have AutoResume & Experiment

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

* small fix

* small bug fixes and cleanup

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove async checkpointing support. Support will be added in a subsequent PR

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove unneeded import

* bug fix

* remove deprecated prefix

* rename Experiment to NeMoLogger

* add option to instantiate model checkpoint callback inside of nemo_logger setup

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* Proposal to move ModelCheckpoint into NeMoLogger

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

* minor fixes

* fix merge conflict

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove unused imports

---------

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: Marc Romeyn <mromeijn@nvidia.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
@ko3n1g ko3n1g mentioned this pull request Jul 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants