Support for sharded optimizers when dumping checkpoints outside of the DDP sharded training type plugin #6387

ananthsub · 2021-03-07T02:09:00Z

🐛 Bug

Using fairscale distributed optimizers without DDP sharded leads to crashes or inconsistent stae in the trainer when checkpointing. This will also occur with PyTorch's latest prototype version of the ZeroRedundancy optimizer. @SeanNaren

Imagine this scenario:

Someone wraps their optimizer with Fairscale OSS inside their lightning module. The user does not use the DDP sharded plugin.
At checkpoint time, when the trainer dumps the checkpoint dict, it looks up the optimizer state
The optimizer state goes through the training type plugin
The training type plugin calls optimizer.state_dict() - For fairscale/pytorch distributed optimizers, we need to consolidate the state dict on a rank. Afterwards, we should look up the state dict only from that rank.

One could add a callback which implements on_save_checkpoint to call consolidate_state_dict() on the optimizer across all ranks. However, the trainer calls state_dict on all ranks, leading to the exception here: https://github.com/facebookresearch/fairscale/blob/1204c7cf54ec301d46a0d3f3fd703da6b306f8f5/fairscale/optim/oss.py#L354-L358

This error occurs only on non-zero ranks. As a result, the error in checkpointing is compounded by the comment here around the exception handling logic for training and its interaction with checkpointing: #6343 (comment)

Proposal to fix:

We upstream this optimizer state on all training type plugins: https://github.com/PyTorchLightning/pytorch-lightning/blob/966184a452c773a691b063195c817f3aed899f16/pytorch_lightning/plugins/training_type/sharded.py#L56-L68
This way, until PyTorch has a consistent interface for distributed/sharded optimizers, we check if the optimizer is fairscale OSS/PyTorch ZeroRedundancy. If so, we call consolidate_state_dict on all ranks, and then fetch the optimizer state only from rank 0.
plugins for FSDP or fancier things can continue to override this to handle the optimizer states however they need to.
If the optimizer is neither of these, we use the existing flow

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @tchaton @akihironitta @blefaudeux

The text was updated successfully, but these errors were encountered:

blefaudeux · 2021-03-07T06:49:11Z

This is new to me, but on fairscale's or pytorch's side it's easy to make the checkpointing compatible with calls from all ranks. It was not the default because some frameworks (classy and vissl at least) only call state_dict from a single rank, and until now I thought that lightning was doing the same, if it's useful then both behaviors can be supported through a flag for instance

SeanNaren · 2021-03-07T18:00:03Z

Thanks for the issue @ananthsub!

Just to make sure, I see the issue being that if the user wants to use a Sharded Optimizer outside the plugin we do not support this.

I'll need to think this moreover, currently configure_optimizer is called before distributed communication is initialized (the FSDP plugin integration will make an option for the training type plugin to delay this till after) so wrapping your optimizers in OSS should lead to a crash, unless you init distributed yourself currently

Just to clear up we call consolidate_state_dict on all processes, but only get the state dict from rank 0. Just to make sure I understand @ananthsub are you suggesting upstreaming the consolidation/return to Fairscale?

ananthsub · 2021-03-09T06:15:23Z

Just to clear up we call consolidate_state_dict on all processes, but only get the state dict from rank 0. Just to make sure I understand @ananthsub are you suggesting upstreaming the consolidation/return to Fairscale?

In this case, could we upstream the optimizer_state from the sharded plugin into the base training type plugin?
https://github.com/PyTorchLightning/pytorch-lightning/blob/523c59bfddca48d003ce20168e727e6683f3efd4/pytorch_lightning/plugins/training_type/sharded.py#L56-L60

if fairscale is available and if the optimizer is of type OSS, then we call consolidate_state_dict on all ranks, and then return the optimizer state from rank 0. otherwise we return the optimizer state from all ranks

stale · 2021-04-10T05:54:59Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ananthsub added bug Something isn't working help wanted Open to be worked on labels Mar 7, 2021

edenlightning added priority: 1 Medium priority task and removed bug Something isn't working labels Mar 8, 2021

blefaudeux mentioned this issue Mar 9, 2021

[feat] Make OSS state available on all ranks facebookresearch/fairscale#500

Merged

4 tasks

stale bot added the won't fix This will not be worked on label Apr 10, 2021

stale bot closed this as completed Apr 18, 2021

ananthsub reopened this Feb 3, 2022

stale bot removed the won't fix This will not be worked on label Feb 3, 2022

DuYicong515 mentioned this issue Feb 11, 2022

Support sharded optimizers outside of DDP sharded strategy #11867

Closed

10 tasks

ananthsub added checkpointing Related to checkpointing and removed help wanted Open to be worked on labels Feb 16, 2022

ananthsub added this to the 1.6 milestone Feb 16, 2022

carmocca modified the milestones: 1.6, 1.5.x Feb 16, 2022

carmocca assigned ananthsub Feb 16, 2022

Borda assigned awaelchli Mar 21, 2022

Borda modified the milestones: 1.5.x, 1.6 Mar 21, 2022

awaelchli modified the milestones: 1.6, 1.7 Mar 21, 2022

carmocca modified the milestones: pl:1.7, pl:future Jul 19, 2022

carmocca unassigned ananthsub Jul 19, 2022

awaelchli mentioned this issue Aug 15, 2022

Support sharded optimizer state dumping outside of sharded strategies #14208

Merged

11 tasks

awaelchli closed this as completed in #14208 Aug 26, 2022

carmocca modified the milestones: pl:future, pl:1.8 Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for sharded optimizers when dumping checkpoints outside of the DDP sharded training type plugin #6387

Support for sharded optimizers when dumping checkpoints outside of the DDP sharded training type plugin #6387

ananthsub commented Mar 7, 2021 •

edited by github-actions bot

blefaudeux commented Mar 7, 2021 •

edited

SeanNaren commented Mar 7, 2021

ananthsub commented Mar 9, 2021

stale bot commented Apr 10, 2021

Support for sharded optimizers when dumping checkpoints outside of the DDP sharded training type plugin #6387

Support for sharded optimizers when dumping checkpoints outside of the DDP sharded training type plugin #6387

Comments

ananthsub commented Mar 7, 2021 • edited by github-actions bot

🐛 Bug

blefaudeux commented Mar 7, 2021 • edited

SeanNaren commented Mar 7, 2021

ananthsub commented Mar 9, 2021

stale bot commented Apr 10, 2021

ananthsub commented Mar 7, 2021 •

edited by github-actions bot

blefaudeux commented Mar 7, 2021 •

edited