-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[BugFix] Resolve DDP hanging bug within ModelCheckpoint and val_loss #6004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @tchaton! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-02-17 13:12:13 UTC |
Codecov Report
@@ Coverage Diff @@
## master #6004 +/- ##
=======================================
- Coverage 93% 90% -3%
=======================================
Files 160 170 +10
Lines 11340 11786 +446
=======================================
+ Hits 10550 10637 +87
- Misses 790 1149 +359 |
# when `val_loss` is being logged and no ModelCheckpoint is being provided | ||
# `val_loss` or `checkpoint_on` will be selected for monitor and need to be reduced to | ||
# prevent processes divergence | ||
if self.monitor in ("val_loss", "checkpoint_on"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we reduce it always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics already perform reduction during compute or self.log.
I think the issue only happens with legacy metrics.
@awaelchli Should it be always be done ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. I thought the metrics that the checkpoint gets from the trainer/logger connector are already reduced? We shouldn't let the checkpoint have the responsibility to reduce metrics or to assume how. the mean is not always correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get a test case of this legacy metric? It will help debug and figure out why this reduction is necessary
loss should always be reduced I think, so I'm a bit puzzled how this would fix an issue
Note that we should deprecate and remove |
Sounds good ! |
Seems like this didn't fix the issue in #5865. We're really missing a test case here for us to debug, so might be a good idea to get the user to help out here. |
Pull request was closed
What does this PR do?
Currently, the training can hang under the following conditions:
monitor
is being provided to ModelCheckPointInternally in ModelCheckpoint, self.monitor will become val_loss.
And the code will diverge slightly later if check_monitor_top_k returns different value for both processes.
Solution:
val_loss
andcheckpoint_on
should be reduced.Fixes #5865
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃