Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move optimizer creation after device placement for ddp backends. #2904

Merged

Conversation

PhilJd
Copy link
Contributor

@PhilJd PhilJd commented Aug 10, 2020

What does this PR do?

This PR moves the optimizer creation after the device placements for ddp backends. This allows the user to access the actual parameters used for training when creating the optimizers in the configure_optimizers function.

See discussion in #2902.

Fixes #2902.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team August 10, 2020 13:29
@PhilJd
Copy link
Contributor Author

PhilJd commented Aug 10, 2020

Seems like the tests that previously passed still work. I did some digging into why test_full_loop_ddp_spawn fails, but then realized that it has been failing earlier already (compare for example CI of #2852).

On one of the #GPU>2 machines I have access to, this test fails (without this PR) with RuntimeError: [enforce fail at inline_container.cc:144] . PytorchStreamReader failed reading zip archive: failed finding central directory. I'll see if I have some time this evening to take a closer look at this bug, but I can't promise anything, so if someone else is interested in fixing this, please go ahead. :)

@Borda Borda added the bug Something isn't working label Aug 11, 2020
@Borda Borda added the ready PRs ready to be merged label Aug 11, 2020
@mergify mergify bot requested a review from a team August 11, 2020 23:12
@Borda Borda added this to the 0.9.0 milestone Aug 11, 2020
@williamFalcon
Copy link
Contributor

this is awesome. can we get the GPU tests to pass?

@Borda
Copy link
Member

Borda commented Aug 11, 2020

this is awesome. can we get the GPU tests to pass?

it is an unrelated issue, as the failing case is happening because we set too high test acc value and sometimes it just did not reach it... it shall be fixed with setting fixed seed @awaelchli

@awaelchli
Copy link
Member

There is a Fatal Python error: Bus error. This happens frequently lately, I usually restart the drone job and then it works the second time. Maybe the GPU is dying :)

@codecov
Copy link

codecov bot commented Aug 12, 2020

Codecov Report

Merging #2904 into master will increase coverage by 1%.
The diff coverage is 33%.

@@           Coverage Diff           @@
##           master   #2904    +/-   ##
=======================================
+ Coverage      89%     90%    +1%     
=======================================
  Files          80      80            
  Lines        7531    7825   +294     
=======================================
+ Hits         6738    7041   +303     
+ Misses        793     784     -9     

@Borda
Copy link
Member

Borda commented Aug 12, 2020

There is a Fatal Python error: Bus error. This happens frequently lately, I usually restart the drone job and then it works the second time. Maybe the GPU is dying :)

I say it already several times this week on multiple PRs, kind of randomly...

@PhilJd
Copy link
Contributor Author

PhilJd commented Aug 12, 2020

Hm, how would I go about fixing the codecov/patch? I'm not familiar with this tool and don't find the details link too helpful ;)

@williamFalcon williamFalcon merged commit e3528af into Lightning-AI:master Aug 12, 2020
ameliatqy pushed a commit to ameliatqy/pytorch-lightning that referenced this pull request Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimizer initialization with DDP
4 participants