Fix log_dir tracking in case of multiple Trainer instances + DDP #7403

awaelchli · 2021-05-06T12:35:54Z

What does this PR do?

Fixes #4866 (2nd attempt)

The log dir gets out of sync if multiple Trainers get created in DDP.

Proposed solution:
We remove the PL_EXP_VERSION env variable completely. This will solve the problem in the reported issue completely.
Since this env variable was a pseudo-mechanism for synchronizing the logging version, we lose that functionality in the case when the user tries to manually log too early.
Too early means, the user could run into a race condition where multiple ranks do the auto-increment and end up with different version numbers -> different log dirs.
However, then on the other hand this anyway happens if the user instantiates a custom logger and tries the same, and this case was never covered before anyway.

A script for testing this branch can be found here: https://gist.github.com/awaelchli/8840b58c340d6b532cc9075248effe3e

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
- suggestions welcome, not sure if there is an efficient way to test anything here
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-05-06T12:37:24Z

Codecov Report

Merging #7403 (bc07789) into master (b7dbcc3) will increase coverage by 4%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #7403    +/-   ##
=======================================
+ Coverage      88%     92%    +4%     
=======================================
  Files         218     218            
  Lines       14401   14397     -4     
=======================================
+ Hits        12680   13286   +606     
+ Misses       1721    1111   -610

pep8speaks · 2021-07-02T09:41:45Z

Hello @awaelchli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-22 19:14:57 UTC

carmocca · 2021-07-20T12:26:40Z

the user could run into a race condition where multiple ranks do the auto-increment

noob q: where does this happen?

awaelchli · 2021-07-20T12:35:31Z

@carmocca for example:

trainer = Trainer(logger=True)

if trainer.local_rank > 0:
    time.sleep(2)

# rank 0 gets here earlier while 1 is sleeping
trainer.logger.log_hyperparameters(...)  # or any other operation that requires the version to write the logs out

# now: 
print(trainer.logger.version) # prints version_0 on rank 0 and version_1 on rank 1

# trainer internally will correctly log to version_0
trainer.fit(model)

carmocca · 2021-07-20T12:41:59Z

trainer.logger.version

Do you think we should print a warning when this is accessed outside of the trainer running scope?

awaelchli · 2021-07-20T12:43:58Z

yes it would and I tried to come up with something like that but the problem is that the logger does not know anything about the trainer. such a warning would only apply if world_size > 1 and distributed is not initialized. such information can only be accessed through the trainer

tchaton · 2021-07-20T14:55:39Z

Hey @awaelchli,

Would adding a setup function to the logger help to resolve this problem where we would block version access before setup has been called ?

Best,
T.C

awaelchli · 2021-07-21T08:45:31Z

@tchaton Yes, I think it's a good idea to investigate this idea.

Blocking the access will depend on the logger type. TensorBoard and TestTube are the only ones with the problematic auto-increment version functionality. The other loggers handle it differently.

awaelchli added bug Something isn't working logger Related to the Loggers labels May 6, 2021

awaelchli added this to the v1.4 milestone May 6, 2021

Borda modified the milestones: v1.4, v1.3, v1.3.x May 6, 2021

Borda assigned awaelchli Jun 15, 2021

awaelchli force-pushed the bugfix/ddp-logdir branch from 4e3a317 to cfcca32 Compare July 2, 2021 09:41

awaelchli force-pushed the bugfix/ddp-logdir branch from 572f465 to 565d1ec Compare July 2, 2021 09:44

Borda modified the milestones: v1.3.x, v1.4 Jul 6, 2021

edenlightning modified the milestones: v1.4, v1.3.x Jul 6, 2021

awaelchli force-pushed the bugfix/ddp-logdir branch 2 times, most recently from bdf7847 to 6f84df6 Compare July 20, 2021 08:06

awaelchli added 2 commits July 20, 2021 11:12

apply fix

24b269d

changelog

c91c1af

awaelchli force-pushed the bugfix/ddp-logdir branch from c4c6d4b to c91c1af Compare July 20, 2021 09:22

unused import

4e13cf8

This was referenced Jul 20, 2021

convert logger environment version to int when necessary #8489

Closed

DDP Logdir Multiple Runs Bug #4866

Closed

awaelchli marked this pull request as ready for review July 20, 2021 11:02

awaelchli requested review from Borda, SeanNaren, carmocca, justusschock and kaushikb11 as code owners July 20, 2021 11:02

awaelchli requested review from tchaton and williamFalcon as code owners July 20, 2021 11:02

awaelchli added the distributed Generic distributed-related topic label Jul 20, 2021

carmocca approved these changes Jul 20, 2021

View reviewed changes

SeanNaren approved these changes Jul 20, 2021

View reviewed changes

mergify bot added ready PRs ready to be merged has conflicts labels Jul 20, 2021

Merge branch 'master' into bugfix/ddp-logdir

5664629

mergify bot removed the has conflicts label Jul 20, 2021

mergify bot added the has conflicts label Jul 21, 2021

edenlightning modified the milestones: v1.3.x, v1.4 Jul 21, 2021

Borda approved these changes Jul 22, 2021

View reviewed changes

Merge branch 'master' into bugfix/ddp-logdir

bc07789

mergify bot removed the has conflicts label Jul 22, 2021

awaelchli merged commit 0ad7f3a into master Jul 23, 2021

awaelchli deleted the bugfix/ddp-logdir branch July 23, 2021 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix log_dir tracking in case of multiple Trainer instances + DDP #7403

Fix log_dir tracking in case of multiple Trainer instances + DDP #7403

Uh oh!

awaelchli commented May 6, 2021 •

edited

Loading

Uh oh!

codecov bot commented May 6, 2021 •

edited

Loading

Uh oh!

pep8speaks commented Jul 2, 2021 •

edited

Loading

Uh oh!

carmocca commented Jul 20, 2021

Uh oh!

awaelchli commented Jul 20, 2021

Uh oh!

carmocca commented Jul 20, 2021

Uh oh!

awaelchli commented Jul 20, 2021

Uh oh!

tchaton commented Jul 20, 2021

Uh oh!

awaelchli commented Jul 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix log_dir tracking in case of multiple Trainer instances + DDP #7403

Fix log_dir tracking in case of multiple Trainer instances + DDP #7403

Uh oh!

Conversation

awaelchli commented May 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented May 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pep8speaks commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-07-22 19:14:57 UTC

Uh oh!

carmocca commented Jul 20, 2021

Uh oh!

awaelchli commented Jul 20, 2021

Uh oh!

carmocca commented Jul 20, 2021

Uh oh!

awaelchli commented Jul 20, 2021

Uh oh!

tchaton commented Jul 20, 2021

Uh oh!

awaelchli commented Jul 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

awaelchli commented May 6, 2021 •

edited

Loading

codecov bot commented May 6, 2021 •

edited

Loading

pep8speaks commented Jul 2, 2021 •

edited

Loading