-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[FIX] Native FSDP precision + tests #12985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
These tests were seriously broken 😅 There is an error in the test and integration currently. Going to push some changes to support the subprocess launcher + make the tests fail. |
|
After some thought, I think I've reached a consensus: Let's wait for 1.12 to be released, and revisit this PR when tests start failing (if we GPU test latest). Given past experience with BFloat16, I think its best to hold off trying to stabilize the API till the major release is out. If people end up trying to use FSDP native, we can point them to this PR for a fixed version. |
|
Most of the code is gated by the PYTORCH_1_12 flag. So then should we just land this before the PyTD 1.12 is released? cc @SeanNaren |
|
@SeanNaren is there any plan to revisit this PR now that 1.12 is landed? Would be great to have the subprocess launcher and the additional features (mixed precision) for FSDP. |
|
@rohan-varma Absolutely! This is a high priority for the PL 1.7 release |
|
As discussed with @carmocca offline, we should get this PR in even if we currently do not have 1.12 tests running on the GPU. I have manually confirmed all tests pass using 1.12. |
rohitgr7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
src/pytorch_lightning/plugins/precision/fully_sharded_native_amp.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
for more information, see https://pre-commit.ci
* Rename GPUAccelerator to CUDAAccelerator * Add back GPUAccelerator and deprecate it * Remove temporary registration * accelerator connector reroute * accelerator_connector tests * update enums * lite support + tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move "gpu" support up before actual accelerator flag checks * Stupid arguments * fix tests * change exception type * fix registry test * pre-commit * CI: debug HPU flow (#13419) * Update the hpu-tests.yml to pull docker from vault * fire & sudo * habana-gaudi-hpus * Check the driver status on gaudi server (#13718) Co-authored-by: arao <arao@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akarsha Rao <94624926+raoakarsha@users.noreply.github.com> * Update typing-extensions requirement from <4.2.1,>=4.0.0 to >=4.0.0,<4.3.1 in /requirements (#13529) Update typing-extensions requirement in /requirements Updates the requirements on [typing-extensions](https://github.com/python/typing_extensions) to permit the latest version. - [Release notes](https://github.com/python/typing_extensions/releases) - [Changelog](https://github.com/python/typing_extensions/blob/main/CHANGELOG.md) - [Commits](python/typing_extensions@4.0.0...4.3.0) --- updated-dependencies: - dependency-name: typing-extensions dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [pre-commit.ci] pre-commit suggestions (#13540) updates: - [github.com/psf/black: 22.3.0 → 22.6.0](psf/black@22.3.0...22.6.0) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [FIX] Native FSDP precision + tests (#12985) * Simplify fetching's loader types (#13111) * Include app templates to the lightning and app packages (#13731) * Include app templates to the package Co-authored-by: mansy <mansy@lightning.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix mypy typing errors in pytorch_lightning/callbacks/model_checkpoint.py (#13617) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Fix typos initialize in docs (#13557) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * Fix main progress bar counter when `val_check_interval=int` and `check_val_every_n_epoch=None` (#12832) * Fix mypy errors attributed to `pytorch_lightning.loggers.tensorboard.py` (#13688) Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> * Fix mypy errors attributed to `pytorch_lightning.loggers.mlflow` (#13691) Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> * fix mypy errors for loggers/wandb.py (#13483) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> * Fix gatekeeper minimum check (#13769) * changelog * changelog * fix order * move up again * add missing test Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: arao <arao@habana.ai> Co-authored-by: Akarsha Rao <94624926+raoakarsha@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sean Naren <sean@grid.ai> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Mansy <ahmed.mansy156@gmail.com> Co-authored-by: mansy <mansy@lightning.ai> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Lee Jungwon <33821003+BongYang@users.noreply.github.com> Co-authored-by: Nathaniel D'Amours <88633026+NathanielDamours@users.noreply.github.com> Co-authored-by: Justin Goheen <26209687+JustinGoheen@users.noreply.github.com> Co-authored-by: otaj <6065855+otaj@users.noreply.github.com> Co-authored-by: Gautier Dagan <s2234411@ed.ac.uk> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com>
What does this PR do?
Related #12334 #12447.
After merging the above-associated PR and trying to use Native FSDP, I noticed there were many things wrong.
This PR addresses 2 & 3, however, 1 remains a separate issue. I've run each test individually to ensure the integration works (as well as updated it a bit, and introduced mixed-precision support which was missing). I've also moved the requirement to 1.12dev, which will work for PyTorch Nightly.
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃
cc @Borda @tchaton @rohitgr7 @otaj