[FIX] Native FSDP precision + tests #12985

SeanNaren · 2022-05-05T15:30:29Z

What does this PR do?

After merging the above-associated PR and trying to use Native FSDP, I noticed there were many things wrong.

Tests were not being run as GPU CI uses 1.8.x PyTorch (LTS). This is an issue that is being resolved by potentially introducing an additional GPU run for PyTorch Latest. This meant the tests were not running, and they were all failing
1.11 FSDP seems to be broken in various ways, with state_dict saving/loading issues + no mixed precision. So many fixes have come out for 1.12 (nightly) that no user really should be using 1.11, and be using 1.12 FSDP.
The native precision plugin was not selected when using Native FSDP.

This PR addresses 2 & 3, however, 1 remains a separate issue. I've run each test individually to ensure the integration works (as well as updated it a bit, and introduced mixed-precision support which was missing). I've also moved the requirement to 1.12dev, which will work for PyTorch Nightly.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @tchaton @rohitgr7 @otaj

tests/strategies/test_ddp_fully_sharded_native.py

SeanNaren · 2022-05-06T14:51:40Z

These tests were seriously broken 😅

There is an error in the test and integration currently. Going to push some changes to support the subprocess launcher + make the tests fail.

SeanNaren · 2022-05-11T09:41:22Z

After some thought, I think I've reached a consensus:

Let's wait for 1.12 to be released, and revisit this PR when tests start failing (if we GPU test latest). Given past experience with BFloat16, I think its best to hold off trying to stabilize the API till the major release is out. If people end up trying to use FSDP native, we can point them to this PR for a fixed version.

for more information, see https://pre-commit.ci

sisilmehta2000 · 2022-06-22T17:47:37Z

Most of the code is gated by the PYTORCH_1_12 flag. So then should we just land this before the PyTD 1.12 is released? cc @SeanNaren

rohan-varma · 2022-07-18T17:40:52Z

@SeanNaren is there any plan to revisit this PR now that 1.12 is landed? Would be great to have the subprocess launcher and the additional features (mixed precision) for FSDP.

carmocca · 2022-07-18T17:57:00Z

@rohan-varma Absolutely! This is a high priority for the PL 1.7 release

src/pytorch_lightning/strategies/fully_sharded_native.py

SeanNaren · 2022-07-19T09:43:13Z

As discussed with @carmocca offline, we should get this PR in even if we currently do not have 1.12 tests running on the GPU.

I have manually confirmed all tests pass using 1.12.

src/pytorch_lightning/strategies/fully_sharded_native.py

src/pytorch_lightning/trainer/connectors/accelerator_connector.py

rohitgr7

nit

src/pytorch_lightning/strategies/fully_sharded_native.py

src/pytorch_lightning/plugins/precision/fully_sharded_native_amp.py

src/pytorch_lightning/strategies/fully_sharded_native.py

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

for more information, see https://pre-commit.ci

…_fsdp

* Rename GPUAccelerator to CUDAAccelerator * Add back GPUAccelerator and deprecate it * Remove temporary registration * accelerator connector reroute * accelerator_connector tests * update enums * lite support + tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move "gpu" support up before actual accelerator flag checks * Stupid arguments * fix tests * change exception type * fix registry test * pre-commit * CI: debug HPU flow (#13419) * Update the hpu-tests.yml to pull docker from vault * fire & sudo * habana-gaudi-hpus * Check the driver status on gaudi server (#13718) Co-authored-by: arao <arao@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akarsha Rao <94624926+raoakarsha@users.noreply.github.com> * Update typing-extensions requirement from <4.2.1,>=4.0.0 to >=4.0.0,<4.3.1 in /requirements (#13529) Update typing-extensions requirement in /requirements Updates the requirements on [typing-extensions](https://github.com/python/typing_extensions) to permit the latest version. - [Release notes](https://github.com/python/typing_extensions/releases) - [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md) - [Commits](python/typing_extensions@4.0.0...4.3.0) --- updated-dependencies: - dependency-name: typing-extensions dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [pre-commit.ci] pre-commit suggestions (#13540) updates: - [github.com/psf/black: 22.3.0 → 22.6.0](psf/black@22.3.0...22.6.0) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [FIX] Native FSDP precision + tests (#12985) * Simplify fetching's loader types (#13111) * Include app templates to the lightning and app packages (#13731) * Include app templates to the package Co-authored-by: mansy <mansy@lightning.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix mypy typing errors in pytorch_lightning/callbacks/model_checkpoint.py (#13617) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Fix typos initialize in docs (#13557) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix main progress bar counter when `val_check_interval=int` and `check_val_every_n_epoch=None` (#12832) * Fix mypy errors attributed to `pytorch_lightning.loggers.tensorboard.py` (#13688) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Fix mypy errors attributed to `pytorch_lightning.loggers.mlflow` (#13691) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> * fix mypy errors for loggers/wandb.py (#13483) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Fix gatekeeper minimum check (#13769) * changelog * changelog * fix order * move up again * add missing test Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: arao <arao@habana.ai> Co-authored-by: Akarsha Rao <94624926+raoakarsha@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sean Naren <sean@grid.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Mansy <ahmed.mansy156@gmail.com> Co-authored-by: mansy <mansy@lightning.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Lee Jungwon <33821003+BongYang@users.noreply.github.com> Co-authored-by: Nathaniel D'Amours <88633026+NathanielDamours@users.noreply.github.com> Co-authored-by: Justin Goheen <26209687+JustinGoheen@users.noreply.github.com> Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> Co-authored-by: Gautier Dagan <s2234411@ed.ac.uk> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>

Fix Native FSDP precision + tests

9fb1506

SeanNaren added bug Something isn't working strategy: fsdp Fully Sharded Data Parallel labels May 5, 2022

SeanNaren added this to the 1.7 milestone May 5, 2022

SeanNaren self-assigned this May 5, 2022

SeanNaren requested review from Borda, awaelchli, carmocca, justusschock, kaushikb11, rohitgr7, tchaton and williamFalcon as code owners May 5, 2022 15:30

SeanNaren commented May 5, 2022

View reviewed changes

tests/strategies/test_ddp_fully_sharded_native.py Show resolved Hide resolved

Fix typing

83df204

akihironitta mentioned this pull request May 6, 2022

Add testing with PyTorch 1.11 on GPUs in CI #12955

Closed

5 tasks

awaelchli approved these changes May 6, 2022

View reviewed changes

SeanNaren added 2 commits May 6, 2022 15:52

Add support for children scripts, work on fixing tests

e514c9a

Fix tests, fix FSDP integration

b715e5c

akihironitta mentioned this pull request May 10, 2022

CI: Azure - multiple configs #12984

Merged

12 tasks

Borda approved these changes May 12, 2022

View reviewed changes

mergify bot added the ready PRs ready to be merged label May 12, 2022

Borda marked this pull request as draft May 12, 2022 12:17

SeanNaren mentioned this pull request May 17, 2022

Using FSDP SeanNaren/min-LLM#11

Open

SeanNaren and others added 3 commits June 17, 2022 11:15

Merge branch 'master' into fix/native_fsdp

ff5c5d1

Only print on rank 0

ef0add9

[pre-commit.ci] auto fixes from pre-commit.com hooks

5f9437c

for more information, see https://pre-commit.ci

SeanNaren added 2 commits June 24, 2022 11:11

Fix checks

771b8cf

Merge branch 'master' into fix/native_fsdp

1e48cf0

carmocca added the priority: 0 High priority task label Jul 18, 2022

rohan-varma reviewed Jul 18, 2022

View reviewed changes

src/pytorch_lightning/strategies/fully_sharded_native.py Outdated Show resolved Hide resolved

Sean Naren added 4 commits July 19, 2022 10:22

Merge branch 'master' into fix/native_fsdp

b1a6a9b

Fix typing issues

a8828d3

Update min torch

033ce15

Add type

86ec744

SeanNaren marked this pull request as ready for review July 19, 2022 09:42

Fix tests

b275f03

carmocca reviewed Jul 19, 2022

View reviewed changes

Mypy cleanup

2e94842

github-actions bot added the pl Generic label for PyTorch Lightning package label Jul 19, 2022

Address reviews

b9f2ba1

carmocca approved these changes Jul 19, 2022

View reviewed changes

rohitgr7 approved these changes Jul 19, 2022

View reviewed changes

Sean Naren and others added 5 commits July 19, 2022 14:28

Apply suggestions from code review

bb1c7ce

Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7a3533f

for more information, see https://pre-commit.ci

Merge branch 'master' into fix/native_fsdp

cbe77cd

Merge remote-tracking branch 'origin/fix/native_fsdp' into fix/native…

28aaec2

…_fsdp

Merge branch 'master' into fix/native_fsdp

4069110

SeanNaren enabled auto-merge (squash) July 20, 2022 10:51

SeanNaren merged commit d786985 into master Jul 20, 2022

SeanNaren deleted the fix/native_fsdp branch July 20, 2022 11:32

justusschock pushed a commit that referenced this pull request Jul 21, 2022

[FIX] Native FSDP precision + tests (#12985)

bb15521

awaelchli mentioned this pull request May 16, 2024

Remove redundant code to set the device on the LightningModule #19877

Merged

[FIX] Native FSDP precision + tests #12985

[FIX] Native FSDP precision + tests #12985

Uh oh!

Conversation

SeanNaren commented May 5, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

Uh oh!

SeanNaren commented May 6, 2022

Uh oh!

SeanNaren commented May 11, 2022

Uh oh!

sisilmehta2000 commented Jun 22, 2022

Uh oh!

rohan-varma commented Jul 18, 2022

Uh oh!

carmocca commented Jul 18, 2022

Uh oh!

Uh oh!

SeanNaren commented Jul 19, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rohitgr7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

SeanNaren commented May 5, 2022 •

edited by github-actions bot

Loading