Failed Manual Backward during DeepSpeed training. #7957

Zasder3 · 2021-06-12T17:47:02Z

🐛 Bug

When attempting to use manual optimization in Lightning, the backward pass fails as it causes a reference to an undefined attribute.

On line 293 of pytorch-lightning/plugins/training_type/ddp.py the line:

if not self.lightning_module.automatic_optimization and self.model.require_backward_grad_sync:

Throws the error torch.nn.modules.module.ModuleAttributeError: 'DeepSpeedEngine' object has no attribute 'require_backward_grad_sync'.

I attempted solving it by messing with the boiler plate but it only led to more similar errors of assuming the existence of this attribute.

Please reproduce using the BoringModel

The boring model didn't fully facilitate the command-line arguments that I needed, so I instead have the following notebook.

To Reproduce

The colab notebook in question can be found here.

Expected behavior

Normally the training script would allow the backward pass to run smoothly, as it does without a DeepSpeed plugin.

Environment

* CUDA:
	- GPU:
		- Tesla P100-PCIE-16GB
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.7.1
	- pytorch-lightning: 1.3.5
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.10
	- version:           #1 SMP Sat Jun 5 09:50:34 PDT 2021

Additional context

This error was first noticed on an 8x2080ti setup under near-identical conditions. Should just be something to tweak during the backward pass. It would be nice to have an alternative to the self.manual_backward that I could use in the meantime.

The text was updated successfully, but these errors were encountered:

tchaton · 2021-06-14T06:45:36Z

Dear @Zasder3,

Thanks for reporting this bug.
Currently, manual optimization is being supported only with DDP.
We apologise for the inconvenience.

Best,
T.C

SeanNaren · 2021-06-14T09:35:37Z

Thanks @Zasder3

DeepSpeed requires having control over more steps making manual optimization a bit tricky to support. In a future release we might be able to support it but require the user to have to access the deepspeed engine to run step backward and using the wrapped model for forward however that will require very specific changes for deepspeed only.

Apologies as the message should've been clearer! I've added a clearer message in #7234

SeanNaren · 2021-06-16T09:32:33Z

Thanks to @tchaton he managed to get this working :)

Just note only one optimizer is supported, and manual optimization with deepspeed is quite untested. We do have a test for a basic manual optimization example which you can see here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/plugins/test_deepspeed_plugin.py#L36

kushalj001 · 2023-05-19T06:06:39Z

Hi @SeanNaren @tchaton
I just wanted to check up on this thread about manual optimization with deepspeed. Is it supported right now? My training loop is a bit complex (involves RL), so I cannot use automatic optimization. I had been using automatic optimization with deepspeed earlier and it worked very well for me. I wanted to know if I should continue using lightning in my case or switch to native torch + deepspeed (or any other recommendations).
Thanks!

SophieOstmeier · 2023-09-13T05:21:32Z

I have the same issue. Considering switching too.

Zasder3 added bug Something isn't working help wanted Open to be worked on labels Jun 12, 2021

carmocca added the priority: 1 Medium priority task label Jun 13, 2021

carmocca assigned SeanNaren Jun 13, 2021

tchaton mentioned this issue Jun 14, 2021

[fix] Enable manual optimization DeepSpeed #7970

Merged

11 tasks

SeanNaren closed this as completed in #7970 Jun 16, 2021

Zasder3 mentioned this issue Jul 21, 2021

Error occurs when using DeepSpeed Zasder3/train-CLIP#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed Manual Backward during DeepSpeed training. #7957

Failed Manual Backward during DeepSpeed training. #7957

Zasder3 commented Jun 12, 2021

tchaton commented Jun 14, 2021

SeanNaren commented Jun 14, 2021

SeanNaren commented Jun 16, 2021

kushalj001 commented May 19, 2023

SophieOstmeier commented Sep 13, 2023

Failed Manual Backward during DeepSpeed training. #7957

Failed Manual Backward during DeepSpeed training. #7957

Comments

Zasder3 commented Jun 12, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

tchaton commented Jun 14, 2021

SeanNaren commented Jun 14, 2021

SeanNaren commented Jun 16, 2021

kushalj001 commented May 19, 2023

SophieOstmeier commented Sep 13, 2023