Simplify enabling CPU offload in FSDP #15832

awaelchli · 2022-11-26T10:33:57Z

What does this PR do?

A UX improvement for enabling offloading without needing to import extras from the FSDP library.

Before:

from torch.distributed.fsdp.fully_sharded_data_parallel import CPUOffload

strategy = FSDPStrategy(cpu_offload=CPUOffload(offload_params=True))

After:

strategy = FSDPStrategy(cpu_offload=True)

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

cc @Borda @justusschock @awaelchli @carmocca

for more information, see https://pre-commit.ci

github-actions · 2022-11-26T10:38:35Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, pytorch, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, pytorch, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, pytorch, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.10, oldest)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.10)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.9, 1.11)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.12)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.7, 1.10, oldest)	success	✅
pl-cpu (windows-2022, pytorch, 3.9, 1.11)	success	✅
pl-cpu (windows-2022, pytorch, 3.10, 1.12)	success	✅
pl-cpu (windows-2022, pytorch, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.7, 1.10, oldest)	success	✅
pl-cpu (slow, macOS-11, pytorch, 3.7, 1.11)	success	✅
pl-cpu (slow, ubuntu-20.04, pytorch, 3.7, 1.11)	success	✅
pl-cpu (slow, windows-2022, pytorch, 3.7, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning_lite/strategies/fsdp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, tests/tests_pytorch/strategies/test_ddp_fully_sharded_native.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs)	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/fully_sharded_native.py, tests/tests_pytorch/strategies/test_ddp_fully_sharded_native.py, src/lightning_lite/strategies/fsdp.py.

🟢 pytorch_lightning: Azure HPU

Check ID	Status
pytorch-lightning (HPUs)	success	✅

These checks are required after the changes to src/lightning_lite/strategies/fsdp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, tests/tests_pytorch/strategies/test_ddp_fully_sharded_native.py.

🟢 pytorch_lightning: Azure IPU

Check ID	Status
pytorch-lightning (IPUs)	success	✅

These checks are required after the changes to src/lightning_lite/strategies/fsdp.py, src/pytorch_lightning/strategies/fully_sharded_native.py, tests/tests_pytorch/strategies/test_ddp_fully_sharded_native.py.

🟢 pytorch_lightning: Docs

Check ID	Status
make-doctest (pytorch)	success	✅
make-html (pytorch)	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/fully_sharded_native.py, docs/source-pytorch/advanced/model_parallel.rst.

🟢 lightning_lite: CPU workflow

Check ID	Status
lite-cpu (macOS-11, lite, 3.8, 1.11)	success	✅
lite-cpu (macOS-11, lite, 3.9, 1.12)	success	✅
lite-cpu (macOS-11, lite, 3.10, 1.13)	success	✅
lite-cpu (macOS-11, lite, 3.7, 1.10, oldest)	success	✅
lite-cpu (ubuntu-20.04, lite, 3.8, 1.10)	success	✅
lite-cpu (ubuntu-20.04, lite, 3.9, 1.11)	success	✅
lite-cpu (ubuntu-20.04, lite, 3.10, 1.12)	success	✅
lite-cpu (ubuntu-20.04, lite, 3.10, 1.13)	success	✅
lite-cpu (ubuntu-20.04, lite, 3.7, 1.10, oldest)	success	✅
lite-cpu (windows-2022, lite, 3.9, 1.11)	success	✅
lite-cpu (windows-2022, lite, 3.10, 1.12)	success	✅
lite-cpu (windows-2022, lite, 3.10, 1.13)	success	✅
lite-cpu (windows-2022, lite, 3.7, 1.10, oldest)	success	✅
lite-cpu (macOS-11, lightning, 3.8, 1.13)	success	✅
lite-cpu (ubuntu-20.04, lightning, 3.8, 1.13)	success	✅
lite-cpu (windows-2022, lightning, 3.8, 1.13)	success	✅

These checks are required after the changes to src/lightning_lite/strategies/fsdp.py, tests/tests_lite/strategies/test_fsdp.py.

🟢 lightning_lite: Azure GPU

Check ID	Status
lightning-lite (GPUs)	success	✅

These checks are required after the changes to src/lightning_lite/strategies/fsdp.py, tests/tests_lite/strategies/test_fsdp.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning_lite/strategies/fsdp.py, src/pytorch_lightning/strategies/fully_sharded_native.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.7)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, lite, 3.7)	success	✅
install-pkg (ubuntu-22.04, lite, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.7)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.7)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (macOS-12, app, 3.7)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, lite, 3.7)	success	✅
install-pkg (macOS-12, lite, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.7)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.7)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (windows-2022, app, 3.7)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, lite, 3.7)	success	✅
install-pkg (windows-2022, lite, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.7)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.7)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅

These checks are required after the changes to src/lightning_lite/strategies/fsdp.py, src/pytorch_lightning/strategies/fully_sharded_native.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

…i' into feature/fsdp-simplify-offload-api

src/lightning_lite/strategies/fsdp.py

src/pytorch_lightning/strategies/fully_sharded_native.py

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

src/lightning_lite/strategies/fsdp.py

for more information, see https://pre-commit.ci

* Simplify enabling CPU offload in FSDP (#15832) Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> * [App] Enable running with spawn context (#15923) * Fix compiler support test (#15927) * Enable back inference mode support with hpu & update links (#15918) * Enable back inference mode support with hpu * Remove unused * Update document link and address comment Signed-off-by: Jerome <janand@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [App] Introduce auto scaler (#15769) * Exlucde __pycache__ in setuptools * Add load balancer example * wip * Update example * rename * remove prints * _LoadBalancer -> LoadBalancer * AutoScaler(work) * change var name * remove locust * Update docs * include autoscaler in api ref * docs typo * docs typo * docs typo * docs typo * remove unused loadtest * remove unused device_type * clean up * clean up * clean up * Add docstring * type * env vars to args * expose an API for users to override to customise autoscaling logic * update example * comment * udpate var name * fix scale mechanism and clean up * Update exampl * ignore mypy * Add test file * . * update impl and update tests * Update changlog * . * revert docs * update test * update state to keep calling 'flow.run()' Co-authored-by: Aniket Maurya <theaniketmaurya@gmail.com> * Add aiohttp to base requirements * Update docs Co-authored-by: Luca Antiga <luca.antiga@gmail.com> * Use deserializer utility * fake trigger * wip: protect /system/* with basic auth * read password at runtime * Change env var name * import torch as optional * Don't overcreate works * simplify imports * Update example * aiohttp * Add work_args work_kwargs * More docs * remove FIXME * Apply Jirka's suggestions Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean example device * add comment on init threshold value * bad merge * nit: logging format * {in,out}put_schema -> {in,out}put_type * lowercase * docs on seconds * process_time -> processing_time * Dont modify work state from flow * Update tests * worker_url -> endpoint * fix exampl * Fix default scale logic * Fix default scale logic * Fix num_pending_works * Update num_pending_works * Fix bug creating too many works * Remove up/downscale_threshold args * Update example * Add typing * Fix example in docstring * Fix default scale logic * Update src/lightning_app/components/auto_scaler.py Co-authored-by: Noha Alon <nohalon@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rename method * rename locvar * Add todo * docs ci * docs ci * asdfafsdasdf pls docs * Apply suggestions from code review Co-authored-by: Ethan Harris <ethanwharris@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * . * doc * Update src/lightning_app/components/auto_scaler.py Co-authored-by: Noha Alon <nohalon@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit 24983a0. * Revert "Update src/lightning_app/components/auto_scaler.py" This reverts commit 56ea78b. * Remove redefinition * Remove load balancer run blocker * raise RuntimeError * remove has_sent * lower the default timeout_batching from 10 to 1 * remove debug * update the default timeout_batching * . * tighten condition * fix endpoint * typo in runtimeerror cond * async lock update severs * add a test * {in,out}put_type typing * Update examples/app_server_with_auto_scaler/app.py Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> * Update .actions/setup_tools.py Co-authored-by: Aniket Maurya <theaniketmaurya@gmail.com> Co-authored-by: Luca Antiga <luca.antiga@gmail.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Noha Alon <nohalon@gmail.com> Co-authored-by: Ethan Harris <ethanwharris@gmail.com> Co-authored-by: Akihiro Nitta <aki@pop-os.localdomain> Co-authored-by: thomas chaton <thomas@grid.ai> * ENG-627: Docs for CloudCompute Mount Argument (#15182) fixed conflicts * Fix LRScheduler import for PyTorch 2.0 (#15940) * Fix LRScheduler import for PyTorch 2.0 * Add comment for posterity * CI: fix pypi flow (#15944) * CI: fixing pypi syntax (#15943) * connect * input * [App] Remove `SingleProcessRuntime` (#15933) * Remove SingleProcessRuntime * Remove unused queues * Docs * [App] Fix bug when using structures with works (#15911) * Fix bug when using structures with works * Add test * Update CHANGELOG.md * [App] Wait for full file to be transferred in Path / Payload (#15934) * Wait for full file to be transferred in Path / Payload * Fixes * [docs] Include all components in the API reference (#15805) * Update docs Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> * Bump playwright from 1.27.1 to 1.28.0 in /requirements (#15903) * Bump playwright from 1.27.1 to 1.28.0 in /requirements Bumps [playwright](https://github.com/Microsoft/playwright-python) from 1.27.1 to 1.28.0. - [Release notes](https://github.com/Microsoft/playwright-python/releases) - [Commits](microsoft/playwright-python@v1.27.1...v1.28.0) --- updated-dependencies: - dependency-name: playwright dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * 1.28 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> * [App] Add `configure_layout` method for works (#15926) * Add `configure_layout` method for works * Check for api access availability * Updates from review * Update CHANGELOG.md * Apply suggestions from code review Co-authored-by: Sherin Thomas <sherin@lightning.ai> * Make gradients available for all_gather on TPU (#15003) * Make gradients available for all_gather on TPU * Modify switch and tests * Apply suggestions from code review * Modify tests * Fix test * Drop test Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> * Don't try to aggregate `requirements/__pycache__/base.txt` in setuptools (#15775) Exlucde __pycache__ in setuptools * [App] Multiprocessing-safe work pickling (#15836) * Upgrade to HPU release 1.7.1 (#15956) * Upgrade to HPU release 1.7.1 Update torch version check for hpu Signed-off-by: Jerome <janand@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Multinode on MPS (#15748) * Fix restarting attribute for lr finder * update lite executor * update trainer executor * update spawn executor * add multinode component tests * add testing helpers * add lite tests * add trainer tests * update changelog * update trainer * update workflow * update tests * debug * add reason for skipif * Apply suggestions from code review * switch skipif Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> * [App] Resolve PythonServer on M1 (#15949) Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Lite: Fix DataLoader shuffling when using DistributedSampler (#15931) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [App] Temporarily disable ready (#15958) * Fix restarting attribute for lr finder (#15620) * [App] Improve pdb for multiprocessing (#15950) Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [App] Improve debug triggering (#15951) * [App] Add automatic conversion to structures (#15961) * Make LightningModule torch.jit.script-able again (#15947) * Make LightningModule torch.jit.script-able again * remove skip Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * refactor: simplify Tensor import (#15959) * Fix ImportErrors on Multinode if package not present (#15963) * Fix typo in definition of world size in docs (#15954) * [App] Enable running an app from the Gallery (#15941) Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ethan Harris <ethanwharris@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> * Apply dynamo to training_step, validation_step, test_step, predict_step (#15957) * Apply dynamo to training_step, validation_step, test_step, predict_step * Add entry to CHANGELOG.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix merge conflict * rename tpu workflow Signed-off-by: Jerome <janand@habana.ai> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: Luca Antiga <luca.antiga@gmail.com> Co-authored-by: Jerome Anand <88475913+jerome-habana@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Aniket Maurya <theaniketmaurya@gmail.com> Co-authored-by: Noha Alon <nohalon@gmail.com> Co-authored-by: Ethan Harris <ethanwharris@gmail.com> Co-authored-by: Akihiro Nitta <aki@pop-os.localdomain> Co-authored-by: Rick Izzo <rlizzo@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: Sherin Thomas <sherin@lightning.ai> Co-authored-by: stekiri <stekiri@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local>

awaelchli added 2 commits November 26, 2022 11:31

simplify enabling cpu offload in fsdp strategy

0faa278

update docs

e92a933

github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 26, 2022

pre-commit-ci bot and others added 2 commits November 26, 2022 10:35

[pre-commit.ci] auto fixes from pre-commit.com hooks

88f142e

for more information, see https://pre-commit.ci

changelog

8a5c3bf

awaelchli added refactor fabric lightning.fabric.Fabric strategy: fsdp Fully Sharded Data Parallel fun Staff contributions outside working hours - to differentiate from the "community" label labels Nov 26, 2022

awaelchli added this to the v1.9 milestone Nov 26, 2022

awaelchli self-assigned this Nov 26, 2022

awaelchli marked this pull request as ready for review November 26, 2022 10:38

awaelchli requested review from carmocca, justusschock, williamFalcon, edenlightning, tchaton and Borda as code owners November 26, 2022 10:38

awaelchli added 3 commits November 26, 2022 11:39

precommit

0a83951

Merge remote-tracking branch 'origin/feature/fsdp-simplify-offload-ap…

2cfa5b8

…i' into feature/fsdp-simplify-offload-api

fix mypy issue

4e2d069

Borda approved these changes Nov 28, 2022

View reviewed changes

src/lightning_lite/strategies/fsdp.py Outdated Show resolved Hide resolved

src/pytorch_lightning/strategies/fully_sharded_native.py Outdated Show resolved Hide resolved

awaelchli and others added 2 commits December 1, 2022 18:53

Update src/lightning_lite/strategies/fsdp.py

78dc79b

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

Update src/pytorch_lightning/strategies/fully_sharded_native.py

fb50fa2

Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>

justusschock approved these changes Dec 6, 2022

View reviewed changes

mergify bot added the ready PRs ready to be merged label Dec 6, 2022

carmocca reviewed Dec 6, 2022

View reviewed changes

src/lightning_lite/strategies/fsdp.py Show resolved Hide resolved

mergify bot added has conflicts and removed ready PRs ready to be merged labels Dec 6, 2022

Merge branch 'master' into feature/fsdp-simplify-offload-api

feda5e5

awaelchli enabled auto-merge (squash) December 7, 2022 02:17

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Dec 7, 2022

[pre-commit.ci] auto fixes from pre-commit.com hooks

58786cd

for more information, see https://pre-commit.ci

awaelchli merged commit 2debd1c into master Dec 7, 2022

awaelchli deleted the feature/fsdp-simplify-offload-api branch December 7, 2022 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify enabling CPU offload in FSDP #15832

Simplify enabling CPU offload in FSDP #15832

awaelchli commented Nov 26, 2022 •

edited by github-actions bot

github-actions bot commented Nov 26, 2022 •

edited

Simplify enabling CPU offload in FSDP #15832

Simplify enabling CPU offload in FSDP #15832

Conversation

awaelchli commented Nov 26, 2022 • edited by github-actions bot

What does this PR do?

Before submitting

PR review

Did you have fun?

github-actions bot commented Nov 26, 2022 • edited

⚡ Required checks status: All passing 🟢

Groups summary

awaelchli commented Nov 26, 2022 •

edited by github-actions bot

github-actions bot commented Nov 26, 2022 •

edited