Support sharded optimizer state dumping outside of sharded strategies #14208

awaelchli · 2022-08-15T12:29:49Z

What does this PR do?

Motivation: User wants to checkpoint the sharded optimizer outside of ddp/sharded strategy, e.g. not using it.

This PR implements it exactly as proposed in #6387 and #11867. However, this leaks fairscale specific logic into the base strategy. We have some options to mitigate the issue:

Option 0:
Do not care about this. Move forward with this PR as is.

Option A:
Only apply the current approach to the native sharded optimizer from torch. For Fairscale, the user is still forced to use the dedicated strategy and we will keep the logic for this one in the Fairscale sharded strategy.

Option B:
Instead of moving it to the top into the base strategy, move it to the parallel strategy. This means you will have to use at least a strategy where parallel execution is assumed. Otherwise, the logic stays the same.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

cc @Borda @justusschock @awaelchli @rohitgr7 @akihironitta

justusschock

Implementation and test-wise this is fine, I just really don't like the solution we opted for (I know, I'm late to the party, but I missed the issue).

This feels like negating all the efforts we made during the refactor to separate concerns.

src/pytorch_lightning/strategies/strategy.py

tests/tests_pytorch/strategies/test_ddp_strategy.py

rohitgr7

I think this might be breaking the strategy encapsulation, and now since PyTorch copied Zero from deepspeed, we need to provide special support for it.

Why not let users override the Strategy if they are mixing the configuration? or use deepspeed stage 1 if they want to use Zero or some special wrapper?

Also, I remember @otaj is working on checkpoint-related part which will let users choose how the checkpoint is created. So maybe that could be a better way to handle such configuration?

tests/tests_pytorch/strategies/test_sharded_strategy.py

tests/tests_pytorch/strategies/test_ddp_strategy.py

awaelchli · 2022-08-22T12:34:51Z

@rohitgr7

I think this might be breaking the strategy encapsulation, and now since PyTorch copied Zero from deepspeed, we need to provide special support for it.

I agree, this does not fit well into our strategy design. It is also not my personal opinion that we should do it that way. But in sprint planning it was determined that it is important to work on this issue and I was assigned to it. So I am going to complete the task regardless.

Also, I remember @otaj is working on checkpoint-related part which will let users choose how the checkpoint is created. So maybe that could be a better way to handle such configuration?

It is not the responsibility of the checkpoint callback to know WHAT to save. Therefore it won't belong there.

rohitgr7 · 2022-08-22T12:41:19Z

But in sprint planning it was determined that it is important to work on this issue and I was assigned to it.

can we discuss this again today or in the retro?

It is not the responsibility of the checkpoint callback to know WHAT to save. Therefore it won't belong there.

I didn't say checkpoint callback.

awaelchli · 2022-08-22T15:19:29Z

Should this PR and the related issue be closed?

👀 Yes, close it
👍 No, move forward with this solution

carmocca · 2022-08-22T16:20:21Z

This PR resolves an existing issue in our codebase. Whether the patch is perfect design-wise should be secondary to the improvement in stability. The latter can be improved with time or on a follow-up as designs mature, especially if there's no clear alternative at the moment. My vote is 👍

codecov · 2022-08-26T07:52:33Z

Codecov Report

Merging #14208 (fd4cab8) into master (a01e016) will increase coverage by 15%.
The diff coverage is 71%.

@@            Coverage Diff            @@
##           master   #14208     +/-   ##
=========================================
+ Coverage      61%      76%    +15%     
=========================================
  Files         332      332             
  Lines       26852    26883     +31     
=========================================
+ Hits        16421    20428   +4007     
+ Misses      10431     6455   -3976

implement

2a64389

github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 15, 2022

awaelchli changed the title ~~implement~~ Support for OSS optimizers when dumping checkpoints outside of sharded strategies Aug 15, 2022

awaelchli added strategy: fairscale sharded (removed) Sharded Data Parallel feature Is an improvement or enhancement refactor labels Aug 15, 2022

awaelchli changed the title ~~Support for OSS optimizers when dumping checkpoints outside of sharded strategies~~ Support for sharded optimizer state dumping outside of sharded strategies Aug 15, 2022

awaelchli changed the title ~~Support for sharded optimizer state dumping outside of sharded strategies~~ Support sharded optimizer state dumping outside of sharded strategies Aug 15, 2022

awaelchli added 3 commits August 15, 2022 14:43

changelog

1ef74ea

update

ea270e2

add tests

3ab0c6a

awaelchli force-pushed the feature/oss-state-outside-ddp branch from fffb49c to 3ab0c6a Compare August 15, 2022 12:48

awaelchli added 4 commits August 15, 2022 18:04

update run args

b2c08fa

Merge branch 'master' into feature/oss-state-outside-ddp

763e4fd

standalone tests

227b4f5

Merge branch 'master' into feature/oss-state-outside-ddp

02ac16f

awaelchli marked this pull request as ready for review August 16, 2022 15:42

awaelchli requested review from tchaton, justusschock, kaushikb11, carmocca, williamFalcon, Borda and rohitgr7 as code owners August 16, 2022 15:42

awaelchli self-assigned this Aug 16, 2022

justusschock reviewed Aug 16, 2022

View reviewed changes

src/pytorch_lightning/strategies/strategy.py Outdated Show resolved Hide resolved

mergify bot added the has conflicts label Aug 18, 2022

awaelchli added 3 commits August 18, 2022 10:29

Merge branch 'master' into feature/oss-state-outside-ddp

8a078ac

remove fairscale dependency

10c14ae

Merge branch 'master' into feature/oss-state-outside-ddp

8562c63

mergify bot removed the has conflicts label Aug 18, 2022

awaelchli added this to the pl:1.8 milestone Aug 18, 2022

justusschock approved these changes Aug 18, 2022

View reviewed changes

mergify bot added the has conflicts label Aug 18, 2022

carmocca approved these changes Aug 18, 2022

View reviewed changes

src/pytorch_lightning/strategies/strategy.py Show resolved Hide resolved

tests/tests_pytorch/strategies/test_ddp_strategy.py Outdated Show resolved Hide resolved

awaelchli added 2 commits August 18, 2022 17:32

carlos review

544b1fe

Merge branch 'master' into feature/oss-state-outside-ddp

21fad49

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Aug 18, 2022

rohitgr7 reviewed Aug 22, 2022

View reviewed changes

tests/tests_pytorch/strategies/test_sharded_strategy.py Outdated Show resolved Hide resolved

tests/tests_pytorch/strategies/test_ddp_strategy.py Outdated Show resolved Hide resolved

tests/tests_pytorch/strategies/test_ddp_strategy.py Outdated Show resolved Hide resolved

awaelchli added 2 commits August 22, 2022 14:29

update tests

966b8ce

Merge branch 'master' into feature/oss-state-outside-ddp

b8a407e

awaelchli requested a review from otaj as a code owner August 22, 2022 12:29

Borda assigned awaelchli and unassigned awaelchli Aug 22, 2022

Borda approved these changes Aug 23, 2022

View reviewed changes

Merge branch 'master' into feature/oss-state-outside-ddp

391700f

awaelchli enabled auto-merge (squash) August 23, 2022 21:01

Merge branch 'master' into feature/oss-state-outside-ddp

fd4cab8

awaelchli merged commit e67842d into master Aug 26, 2022

awaelchli deleted the feature/oss-state-outside-ddp branch August 26, 2022 07:58

otaj mentioned this pull request Aug 26, 2022

Fix mypy errors in pytorch_lightning/strategies/sharded.py #14184

Merged

12 tasks

rohitgr7 mentioned this pull request Oct 21, 2022

Support for PyTorch ZeroRedundancyOptimizer #15176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sharded optimizer state dumping outside of sharded strategies #14208

Support sharded optimizer state dumping outside of sharded strategies #14208

awaelchli commented Aug 15, 2022 •

edited by github-actions bot

justusschock left a comment

rohitgr7 left a comment

awaelchli commented Aug 22, 2022

rohitgr7 commented Aug 22, 2022

awaelchli commented Aug 22, 2022

carmocca commented Aug 22, 2022

codecov bot commented Aug 26, 2022

Support sharded optimizer state dumping outside of sharded strategies #14208

Support sharded optimizer state dumping outside of sharded strategies #14208

Conversation

awaelchli commented Aug 15, 2022 • edited by github-actions bot

What does this PR do?

Before submitting

PR review

Did you have fun?

justusschock left a comment

Choose a reason for hiding this comment

rohitgr7 left a comment

Choose a reason for hiding this comment

awaelchli commented Aug 22, 2022

rohitgr7 commented Aug 22, 2022

awaelchli commented Aug 22, 2022

carmocca commented Aug 22, 2022

codecov bot commented Aug 26, 2022

Codecov Report

awaelchli commented Aug 15, 2022 •

edited by github-actions bot