Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure sync across val/test step when using DDP #371

Merged
merged 2 commits into from
Nov 17, 2020

Conversation

SeanNaren
Copy link
Contributor

What does this PR do?

Fixes Lightning-AI/pytorch-lightning#4693

Lightning behaviour assumes that if no monitor key is passed to the checkpoint function, to assume val_loss or checkpoint_on as the key to save top k models. This means we have to ensure that this is synced correctly on all processes so that they are in sync when tracking the best models.

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Nov 16, 2020

Codecov Report

Merging #371 (df0c174) into master (b746be0) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #371   +/-   ##
=======================================
  Coverage   82.00%   82.00%           
=======================================
  Files         100      100           
  Lines        5639     5639           
=======================================
  Hits         4624     4624           
  Misses       1015     1015           
Flag Coverage Δ
cpu 24.55% <0.00%> (ø)
pytest 24.55% <0.00%> (ø)
unittests 81.25% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pl_bolts/models/self_supervised/ssl_finetuner.py 24.13% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b746be0...df0c174. Read the comment docs.

@SeanNaren SeanNaren requested a review from Borda November 16, 2020 15:18
@SeanNaren
Copy link
Contributor Author

To reduce confusion I can also remove the return loss steps for test/validation, as we're explicitly logging within the function. Let me know if this works @ananyahjha93

@Borda Borda added the enhancement New feature or request label Nov 16, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @ananyahjha93 mind test it?

@SeanNaren
Copy link
Contributor Author

@ananyahjha93 and I spoke earlier, need to clean this up a bit.

We have class metrics and normal metrics being used, and are using the val loss to log. This means the sync_dist call doesn't need to be added to the class metric self.log call. But then it looks strangely separated:

self.log('val_loss', loss, prog_bar=True, sync_dist=True) # requires sync_dist because not class metric
self.log('val_acc', self.val_acc) # class metric handles sync for us

I'm not sure what the resolve here. I've tried to rally for setting sync_dist=True as default, but this has unexpected side effects (what if you want to set val loss reduce to sum but you forgot?)

@Borda Borda added model Priority High priority task labels Nov 16, 2020
@ananthsub
Copy link

ananthsub commented Nov 16, 2020

Note this came up in Lightning-AI/pytorch-lightning#4323 too

@SeanNaren
Copy link
Contributor Author

Hey @ananthsub @Borda what do you think about adding a warning in self.log if this case comes up? Is there a way to make it a one time warning?

I.e if sync_dist is False for the logged metric and distribution is True, create a warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Priority High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

trainer.test(datamodule=dm) stores reference to wrong checkpoint
5 participants