Ensure sync across val/test step when using DDP #371

SeanNaren · 2020-11-16T15:18:08Z

What does this PR do?

Fixes Lightning-AI/pytorch-lightning#4693

Lightning behaviour assumes that if no monitor key is passed to the checkpoint function, to assume val_loss or checkpoint_on as the key to save top k models. This means we have to ensure that this is synced correctly on all processes so that they are in sync when tracking the best models.

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2020-11-16T15:18:22Z

Codecov Report

Merging #371 (df0c174) into master (b746be0) will not change coverage.
The diff coverage is 0.00%.

@@           Coverage Diff           @@
##           master     #371   +/-   ##
=======================================
  Coverage   82.00%   82.00%           
=======================================
  Files         100      100           
  Lines        5639     5639           
=======================================
  Hits         4624     4624           
  Misses       1015     1015

Flag	Coverage Δ
cpu	`24.55% <0.00%> (ø)`
pytest	`24.55% <0.00%> (ø)`
unittests	`81.25% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pl_bolts/models/self_supervised/ssl_finetuner.py	`24.13% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b746be0...df0c174. Read the comment docs.

SeanNaren · 2020-11-16T15:19:14Z

To reduce confusion I can also remove the return loss steps for test/validation, as we're explicitly logging within the function. Let me know if this works @ananyahjha93

Borda

lgtm, @ananyahjha93 mind test it?

SeanNaren · 2020-11-16T16:42:44Z

@ananyahjha93 and I spoke earlier, need to clean this up a bit.

We have class metrics and normal metrics being used, and are using the val loss to log. This means the sync_dist call doesn't need to be added to the class metric self.log call. But then it looks strangely separated:

self.log('val_loss', loss, prog_bar=True, sync_dist=True) # requires sync_dist because not class metric
self.log('val_acc', self.val_acc) # class metric handles sync for us

I'm not sure what the resolve here. I've tried to rally for setting sync_dist=True as default, but this has unexpected side effects (what if you want to set val loss reduce to sum but you forgot?)

ananthsub · 2020-11-16T21:53:55Z

Note this came up in Lightning-AI/pytorch-lightning#4323 too

SeanNaren · 2020-11-16T22:43:05Z

Hey @ananthsub @Borda what do you think about adding a warning in self.log if this case comes up? Is there a way to make it a one time warning?

I.e if sync_dist is False for the logged metric and distribution is True, create a warning?

Ensure sync across val/test step when using DDP

0d08e6d

SeanNaren requested a review from ananyahjha93 November 16, 2020 15:18

SeanNaren requested a review from Borda November 16, 2020 15:18

Borda added the enhancement New feature or request label Nov 16, 2020

Borda approved these changes Nov 16, 2020

View reviewed changes

Remove sync_dist from class metrics as they are automatically reduced

df0c174

Borda added model Priority High priority task labels Nov 16, 2020

ananyahjha93 approved these changes Nov 17, 2020

View reviewed changes

ananyahjha93 merged commit a44de3c into master Nov 17, 2020

ananyahjha93 deleted the fix/sync_val_step branch November 17, 2020 18:58

edenlightning added this to the 1.0.8 milestone Nov 18, 2020

Borda removed this from the 0.2.x milestone Nov 20, 2020

Borda added this to the v0.3 milestone Jan 18, 2021

This was referenced Mar 13, 2021

[tune](deps): Bump pytorch-lightning-bolts from 0.2.5 to 0.3.0 in /python/requirements suquark/ray#8

Closed

[tune](deps): Bump pytorch-lightning-bolts from 0.2.5 to 0.3.0 in /python/requirements sven1977/ray#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure sync across val/test step when using DDP #371

Ensure sync across val/test step when using DDP #371

SeanNaren commented Nov 16, 2020

codecov bot commented Nov 16, 2020 •

edited

Loading

SeanNaren commented Nov 16, 2020

Borda left a comment •

edited

Loading

SeanNaren commented Nov 16, 2020

ananthsub commented Nov 16, 2020 •

edited by SeanNaren

Loading

SeanNaren commented Nov 16, 2020

Ensure sync across val/test step when using DDP #371

Ensure sync across val/test step when using DDP #371

Conversation

SeanNaren commented Nov 16, 2020

What does this PR do?

PR review

Did you have fun?

codecov bot commented Nov 16, 2020 • edited Loading

Codecov Report

SeanNaren commented Nov 16, 2020

Borda left a comment • edited Loading

Choose a reason for hiding this comment

SeanNaren commented Nov 16, 2020

ananthsub commented Nov 16, 2020 • edited by SeanNaren Loading

SeanNaren commented Nov 16, 2020

codecov bot commented Nov 16, 2020 •

edited

Loading

Borda left a comment •

edited

Loading

ananthsub commented Nov 16, 2020 •

edited by SeanNaren

Loading