[Model Parallel] Add configure sharded model hook #6679

kaushikb11 · 2021-03-25T18:08:52Z

What does this PR do?

Adds a configure sharded model hook. This is required for both DeepSpeed and Fully Sharded. Both teams have defined a context in which you can wrap model layers to be sharded instantly.

Why does it need to be a hook?

It needs to be a hook because model layers can only be sharded instantly after torch distributed is setup. This ensures that the distributed environment is setup, then calls the hook.

We also wrap the hook with the model parallel context for DeepSpeed/Fully Sharded so the user doesn't need to do anything extra.

What about fit/test/predict?

It will be treated like the setup hook for now, where we assume we're either using fit/test/predict and can call the hook for the user. We can iterate on this in further PRs.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-03-25T18:24:47Z

Hello @kaushikb11! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-29 19:15:47 UTC

codecov · 2021-03-25T18:40:58Z

Codecov Report

Merging #6679 (aa35583) into master (646cf2f) will decrease coverage by 4%.
The diff coverage is 98%.

@@           Coverage Diff           @@
##           master   #6679    +/-   ##
=======================================
- Coverage      91%     87%    -4%     
=======================================
  Files         192     192            
  Lines       12189   12231    +42     
=======================================
- Hits        11143   10659   -484     
- Misses       1046    1572   +526

pytorch_lightning/accelerators/accelerator.py

pytorch_lightning/core/hooks.py

pytorch_lightning/trainer/trainer.py

tchaton · 2021-03-25T19:17:31Z

tests/accelerators/test_common.py

+
+    class TestModel(BoringModel):
+
+        def on_model_parallel_setup(self):


Add a check that the context manager is actually being yield.

Thanks, it made me see an Issue in implementation. Fixinggg

Might want to check the hook is not called within the callbacks.

I think we can assume Users would be aware of the implications.

pytorch_lightning/trainer/trainer.py

pytorch_lightning/accelerators/accelerator.py

carmocca · 2021-03-29T11:20:43Z

pytorch_lightning/callbacks/base.py

+    def on_model_parallel_setup(self, trainer, pl_module: LightningModule) -> None:
+        """Called before model parallel accelerator setup"""
+


But those two snippets you shared don't use on_model_parallel_setup, they use model_parallel_context.

I think the question is more along the lines of "when would you override on_model_parallel_setup?"

pytorch_lightning/trainer/trainer.py

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

tchaton

LGTM !

pytorch_lightning/accelerators/accelerator.py

# Conflicts: # pytorch_lightning/accelerators/accelerator.py # pytorch_lightning/plugins/training_type/training_type_plugin.py

ananthsub · 2021-03-31T20:23:26Z

pytorch_lightning/trainer/trainer.py

+    def call_configure_sharded_model(self, model: LightningModule) -> None:
+        # Call configure sharded model hook if accelerator requests. In some cases
+        # we will not call the hook; the hook has initialized the sharded model for example.
+        if self.accelerator.call_configure_sharded_model_hook:
+            with self.accelerator.model_sharded_context():
+                model.configure_sharded_model()
+                self.configure_sharded_model(model)
+            self.accelerator.call_configure_sharded_model_hook = False
+


@SeanNaren @kaushikb11 why's this needed in the trainer? could this be part of the accelerator setup? in L440 instead?

which part exactly? the trainer has control over the hook, we do not want the hook to be called within the accelerator i think

I see, on first read I thought that we could do this:

self.call_setup_hook(model) self.accelerator.setup(self, model) -> inside here apply the model sharded context and call model.configure_sharded_model() # now that accelerator has finished doing the model sharding setup call the callbacks self.configure_sharded_model(model)

the upside is that more of the training type + accelerator logic lives inside of the accelerator internals
but the downside is that this splits the callback hook from the sharded model setup which can be less convenient

Agree as long as we're ok with calling the model hook within the accelerator (cc @justusschock @awaelchli)

…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f

SeanNaren and others added 6 commits March 23, 2021 12:17

Add base hook for model parallel

9f8864f

fix callback signature

eac5344

Simplify hook

32df0cb

Add hook logic

282a133

add tests

7a94e72

add property setter

8091481

kaushikb11 requested review from awaelchli, Borda, carmocca, justusschock, SeanNaren, tchaton and williamFalcon as code owners March 25, 2021 18:08

mergify bot added the has conflicts label Mar 25, 2021

kaushikb11 and others added 3 commits March 25, 2021 23:49

add logic for being called once

633fc77

Update changelog

c99a36f

Merge branch 'master' into feat/model_parallel_hook

a68c8d7

mergify bot removed the has conflicts label Mar 25, 2021

kaushikb11 added 2 commits March 25, 2021 23:56

Fix

9529a22

fix return type

3c1c782

kaushikb11 added the feature Is an improvement or enhancement label Mar 25, 2021

kaushikb11 added this to the 1.3 milestone Mar 25, 2021

kaushikb11 added 2 commits March 26, 2021 00:27

fix lambda callback test

a49ec3b

Fix tests

4dd55d7

tchaton reviewed Mar 25, 2021

View reviewed changes

kaushikb11 added 3 commits March 26, 2021 01:02

Apply code suggestions

caad43c

add logic for setup_optimizers_predispatch

a2574be

add common dummy model

8c2bd6a

SeanNaren added 2 commits March 26, 2021 15:14

Add a bit more doc

e94a7ae

Merge branch 'master' into feat/model_parallel_hook

6a38417

awaelchli reviewed Mar 28, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

carmocca reviewed Mar 29, 2021

View reviewed changes

SeanNaren and others added 2 commits March 29, 2021 12:36

Few code review fixes

202ef1a

Update pytorch_lightning/accelerators/accelerator.py

0709baa

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

SeanNaren changed the title ~~Add model parallel setup hook~~ [Model Parallel] Add model parallel setup hook Mar 29, 2021

carmocca mentioned this pull request Mar 29, 2021

[RFC] Create explicit setup and teardown hooks for each stage on the Lightning and DataModules #6420

Closed

SeanNaren added 2 commits March 29, 2021 16:55

Change hook name

9152d08

Fix test

fbfe65f

SeanNaren changed the title ~~[Model Parallel] Add model parallel setup hook~~ [Model Parallel] Add configure sharded model hook Mar 29, 2021

SeanNaren added 2 commits March 29, 2021 18:21

Test setup hook, refactor names

bae858f

Swap call order of callbacks and model initialization

41e9c22

tchaton approved these changes Mar 29, 2021

View reviewed changes

SeanNaren self-assigned this Mar 29, 2021

SeanNaren added the _Will label Mar 29, 2021

carmocca reviewed Mar 29, 2021

View reviewed changes

pytorch_lightning/accelerators/accelerator.py Outdated Show resolved Hide resolved

mergify bot added the has conflicts label Mar 29, 2021

SeanNaren added 2 commits March 29, 2021 20:05

Change name of context manager

76c7376

Merge branch 'master' into feat/model_parallel_hook

2dcafd0

# Conflicts: # pytorch_lightning/accelerators/accelerator.py # pytorch_lightning/plugins/training_type/training_type_plugin.py

SeanNaren enabled auto-merge (squash) March 29, 2021 19:06

mergify bot removed the has conflicts label Mar 29, 2021

add docstring

aa35583

carmocca approved these changes Mar 29, 2021

View reviewed changes

williamFalcon approved these changes Mar 29, 2021

View reviewed changes

SeanNaren merged commit f79a13e into master Mar 29, 2021

SeanNaren deleted the feat/model_parallel_hook branch March 29, 2021 20:50

ananthsub reviewed Mar 31, 2021

View reviewed changes

daniellepintz mentioned this pull request Jan 20, 2022

[RFC] Deprecate on_configure_sharded_model callback hook #11560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model Parallel] Add configure sharded model hook #6679

[Model Parallel] Add configure sharded model hook #6679

kaushikb11 commented Mar 25, 2021 •

edited by SeanNaren

Loading

pep8speaks commented Mar 25, 2021 •

edited

Loading

codecov bot commented Mar 25, 2021 •

edited

Loading

tchaton Mar 25, 2021 •

edited

Loading

kaushikb11 Mar 25, 2021

tchaton Mar 26, 2021

kaushikb11 Mar 26, 2021

carmocca Mar 29, 2021

tchaton left a comment

ananthsub Mar 31, 2021

SeanNaren Mar 31, 2021

ananthsub Mar 31, 2021 •

edited

Loading

SeanNaren Apr 1, 2021


		class TestModel(BoringModel):

		def on_model_parallel_setup(self):

		def on_model_parallel_setup(self, trainer, pl_module: LightningModule) -> None:
		"""Called before model parallel accelerator setup"""

[Model Parallel] Add configure sharded model hook #6679

[Model Parallel] Add configure sharded model hook #6679

Conversation

kaushikb11 commented Mar 25, 2021 • edited by SeanNaren Loading

What does this PR do?

Why does it need to be a hook?

What about fit/test/predict?

Before submitting

PR review

Did you have fun?

pep8speaks commented Mar 25, 2021 • edited Loading

Comment last updated at 2021-03-29 19:15:47 UTC

codecov bot commented Mar 25, 2021 • edited Loading

Codecov Report

tchaton Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

kaushikb11 Mar 25, 2021

Choose a reason for hiding this comment

tchaton Mar 26, 2021

Choose a reason for hiding this comment

kaushikb11 Mar 26, 2021

Choose a reason for hiding this comment

carmocca Mar 29, 2021

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

ananthsub Mar 31, 2021

Choose a reason for hiding this comment

SeanNaren Mar 31, 2021

Choose a reason for hiding this comment

ananthsub Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

SeanNaren Apr 1, 2021

Choose a reason for hiding this comment

kaushikb11 commented Mar 25, 2021 •

edited by SeanNaren

Loading

pep8speaks commented Mar 25, 2021 •

edited

Loading

codecov bot commented Mar 25, 2021 •

edited

Loading

tchaton Mar 25, 2021 •

edited

Loading

ananthsub Mar 31, 2021 •

edited

Loading