Skip to content

Conversation

@awaelchli
Copy link
Contributor

@awaelchli awaelchli commented May 6, 2021

What does this PR do?

Fixes #4866 (2nd attempt)

The log dir gets out of sync if multiple Trainers get created in DDP.

Proposed solution:
We remove the PL_EXP_VERSION env variable completely. This will solve the problem in the reported issue completely.
Since this env variable was a pseudo-mechanism for synchronizing the logging version, we lose that functionality in the case when the user tries to manually log too early.
Too early means, the user could run into a race condition where multiple ranks do the auto-increment and end up with different version numbers -> different log dirs.
However, then on the other hand this anyway happens if the user instantiates a custom logger and tries the same, and this case was never covered before anyway.

A script for testing this branch can be found here: https://gist.github.com/awaelchli/8840b58c340d6b532cc9075248effe3e

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
    • suggestions welcome, not sure if there is an efficient way to test anything here
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@awaelchli awaelchli added bug Something isn't working logger Related to the Loggers labels May 6, 2021
@codecov
Copy link

codecov bot commented May 6, 2021

Codecov Report

Merging #7403 (bc07789) into master (b7dbcc3) will increase coverage by 4%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #7403    +/-   ##
=======================================
+ Coverage      88%     92%    +4%     
=======================================
  Files         218     218            
  Lines       14401   14397     -4     
=======================================
+ Hits        12680   13286   +606     
+ Misses       1721    1111   -610     

@awaelchli awaelchli added this to the v1.4 milestone May 6, 2021
@Borda Borda modified the milestones: v1.4, v1.3, v1.3.x May 6, 2021
@awaelchli awaelchli force-pushed the bugfix/ddp-logdir branch from 4e3a317 to cfcca32 Compare July 2, 2021 09:41
@pep8speaks
Copy link

pep8speaks commented Jul 2, 2021

Hello @awaelchli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-22 19:14:57 UTC

@awaelchli awaelchli force-pushed the bugfix/ddp-logdir branch from 572f465 to 565d1ec Compare July 2, 2021 09:44
@Borda Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021
@edenlightning edenlightning modified the milestones: v1.4, v1.3.x Jul 6, 2021
@awaelchli awaelchli force-pushed the bugfix/ddp-logdir branch 2 times, most recently from bdf7847 to 6f84df6 Compare July 20, 2021 08:06
@awaelchli awaelchli force-pushed the bugfix/ddp-logdir branch from c4c6d4b to c91c1af Compare July 20, 2021 09:22
@awaelchli awaelchli added the distributed Generic distributed-related topic label Jul 20, 2021
@carmocca
Copy link
Contributor

the user could run into a race condition where multiple ranks do the auto-increment

noob q: where does this happen?

@awaelchli
Copy link
Contributor Author

@carmocca for example:

trainer = Trainer(logger=True)

if trainer.local_rank > 0:
    time.sleep(2)

# rank 0 gets here earlier while 1 is sleeping
trainer.logger.log_hyperparameters(...)  # or any other operation that requires the version to write the logs out

# now: 
print(trainer.logger.version) # prints version_0 on rank 0 and version_1 on rank 1

# trainer internally will correctly log to version_0
trainer.fit(model)

@carmocca
Copy link
Contributor

trainer.logger.version

Do you think we should print a warning when this is accessed outside of the trainer running scope?

@awaelchli
Copy link
Contributor Author

yes it would and I tried to come up with something like that but the problem is that the logger does not know anything about the trainer. such a warning would only apply if world_size > 1 and distributed is not initialized. such information can only be accessed through the trainer

@mergify mergify bot added ready PRs ready to be merged has conflicts labels Jul 20, 2021
@mergify mergify bot removed the has conflicts label Jul 20, 2021
@tchaton
Copy link
Contributor

tchaton commented Jul 20, 2021

Hey @awaelchli,

Would adding a setup function to the logger help to resolve this problem where we would block version access before setup has been called ?

Best,
T.C

@awaelchli
Copy link
Contributor Author

@tchaton Yes, I think it's a good idea to investigate this idea.

Blocking the access will depend on the logger type. TensorBoard and TestTube are the only ones with the problematic auto-increment version functionality. The other loggers handle it differently.

@edenlightning edenlightning modified the milestones: v1.3.x, v1.4 Jul 21, 2021
@mergify mergify bot removed the has conflicts label Jul 22, 2021
@awaelchli awaelchli merged commit 0ad7f3a into master Jul 23, 2021
@awaelchli awaelchli deleted the bugfix/ddp-logdir branch July 23, 2021 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working distributed Generic distributed-related topic logger Related to the Loggers ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DDP Logdir Multiple Runs Bug

7 participants