-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix log_dir tracking in case of multiple Trainer instances + DDP #7403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7403 +/- ##
=======================================
+ Coverage 88% 92% +4%
=======================================
Files 218 218
Lines 14401 14397 -4
=======================================
+ Hits 12680 13286 +606
+ Misses 1721 1111 -610 |
4e3a317 to
cfcca32
Compare
|
Hello @awaelchli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-07-22 19:14:57 UTC |
572f465 to
565d1ec
Compare
bdf7847 to
6f84df6
Compare
c4c6d4b to
c91c1af
Compare
noob q: where does this happen? |
|
@carmocca for example: trainer = Trainer(logger=True)
if trainer.local_rank > 0:
time.sleep(2)
# rank 0 gets here earlier while 1 is sleeping
trainer.logger.log_hyperparameters(...) # or any other operation that requires the version to write the logs out
# now:
print(trainer.logger.version) # prints version_0 on rank 0 and version_1 on rank 1
# trainer internally will correctly log to version_0
trainer.fit(model) |
Do you think we should print a warning when this is accessed outside of the trainer running scope? |
|
yes it would and I tried to come up with something like that but the problem is that the logger does not know anything about the trainer. such a warning would only apply if world_size > 1 and distributed is not initialized. such information can only be accessed through the trainer |
|
Hey @awaelchli, Would adding a setup function to the logger help to resolve this problem where we would block version access before setup has been called ? Best, |
|
@tchaton Yes, I think it's a good idea to investigate this idea. Blocking the access will depend on the logger type. TensorBoard and TestTube are the only ones with the problematic auto-increment version functionality. The other loggers handle it differently. |
What does this PR do?
Fixes #4866 (2nd attempt)
The log dir gets out of sync if multiple Trainers get created in DDP.
Proposed solution:
We remove the
PL_EXP_VERSIONenv variable completely. This will solve the problem in the reported issue completely.Since this env variable was a pseudo-mechanism for synchronizing the logging version, we lose that functionality in the case when the user tries to manually log too early.
Too early means, the user could run into a race condition where multiple ranks do the auto-increment and end up with different version numbers -> different log dirs.
However, then on the other hand this anyway happens if the user instantiates a custom logger and tries the same, and this case was never covered before anyway.
A script for testing this branch can be found here: https://gist.github.com/awaelchli/8840b58c340d6b532cc9075248effe3e
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃