Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set MLFlowLogger status to FAILED when training raises an error #12292

Merged
merged 29 commits into from Sep 20, 2022

Conversation

ritsuki1227
Copy link
Contributor

@ritsuki1227 ritsuki1227 commented Mar 10, 2022

What does this PR do?

If a trainer with MLFlowLogger raises an error, the user should be able to see the MLflow's screen to check the training has been failed.
MLflow's status remains "RUNNING" even after trainer.fit raises an error in the current implementation, so the user cannot know whether the training is still in progress or failed.

Fixes #12291

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@ritsuki1227 ritsuki1227 changed the title bugfix: update MLFlowLogger's status to be FAILED when trainig raises… bugfix: update MLFlowLogger's status to be FAILED when training raises an error Mar 10, 2022
@ananthsub ananthsub added logger: mlflow logger Related to the Loggers labels Mar 10, 2022
@ritsuki1227 ritsuki1227 marked this pull request as draft March 12, 2022 18:35
@ritsuki1227
Copy link
Contributor Author

The commit has succeeded to fix the bug, but failed to pass an existing unit test, which I couldn't find a solution to.
Can anyone help me with the draft PR?

@ritsuki1227 ritsuki1227 marked this pull request as ready for review March 16, 2022 17:00
@ritsuki1227
Copy link
Contributor Author

I would appreciate it if you approve running the workflows 🙂

pytorch_lightning/loggers/mlflow.py Outdated Show resolved Hide resolved
pytorch_lightning/loggers/tensorboard.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
tests/loggers/test_mlflow.py Outdated Show resolved Hide resolved
@awaelchli
Copy link
Member

I would appreciate it if you approve running the workflows 🙂

Most of team was OOO. Sorry about the delay.

The commit has succeeded to fix the bug, but failed to pass an existing unit test, which I couldn't find a solution to.
Can anyone help me with the draft PR?

Which test was it?

@ritsuki1227
Copy link
Contributor Author

ritsuki1227 commented Mar 20, 2022

@awaelchli
Thank you for your review and modification!

Which test was it?

It was tests/utilities/test_cli.py::test_cli_distributed_save_config_callback.
Although the latest implementation has passed the test, the failed test can be reproduced on commit bee913f.

@ritsuki1227
Copy link
Contributor Author

@awaelchli @daniellepintz
Do you have any update on this?
I understand some PRs may be pending because of the 1.6 release.

@carmocca carmocca added this to the 1.7 milestone Apr 6, 2022
@carmocca carmocca added the feature Is an improvement or enhancement label Apr 6, 2022
@awaelchli
Copy link
Member

@ritsuki1227 I created a fix in #14762 to unblock this PR for the remaining issues.

@mergify mergify bot removed the has conflicts label Sep 18, 2022
@mergify mergify bot added ready PRs ready to be merged has conflicts and removed ready PRs ready to be merged labels Sep 18, 2022
Copy link
Contributor

@rohitgr7 rohitgr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind add a changelog?

src/pytorch_lightning/loggers/wandb.py Outdated Show resolved Hide resolved
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Sep 19, 2022
@awaelchli
Copy link
Member

@ritsuki1227 before we merge, would you mind checking again your MLFlow use case, that the status is reported as you expect? Thank you!

@ritsuki1227
Copy link
Contributor Author

@awaelchli Thank you for your commits! I've checked mlflow logger worked correctly on the use case.

@awaelchli
Copy link
Member

Awesome, let's get it in then! Thanks a lot for your help in making the integration with PL better!

@awaelchli awaelchli merged commit 6855f65 into Lightning-AI:master Sep 20, 2022
rohitgr7 pushed a commit that referenced this pull request Sep 24, 2022
…2292)

Co-authored-by: Ritsuki Yamada <ritsuki.yamada@uzabase.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: Jirka <jirka.borovec@seznam.cz>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This PR is from the community feature Is an improvement or enhancement logger: mlflow logger Related to the Loggers pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

MLFlowLogger's status is "RUNNING" even after training failed
6 participants