Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed Manual Backward during DeepSpeed training. #7957

Closed
Zasder3 opened this issue Jun 12, 2021 · 5 comments · Fixed by #7970
Closed

Failed Manual Backward during DeepSpeed training. #7957

Zasder3 opened this issue Jun 12, 2021 · 5 comments · Fixed by #7970
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task

Comments

@Zasder3
Copy link

Zasder3 commented Jun 12, 2021

🐛 Bug

When attempting to use manual optimization in Lightning, the backward pass fails as it causes a reference to an undefined attribute.

On line 293 of pytorch-lightning/plugins/training_type/ddp.py the line:

if not self.lightning_module.automatic_optimization and self.model.require_backward_grad_sync:

Throws the error torch.nn.modules.module.ModuleAttributeError: 'DeepSpeedEngine' object has no attribute 'require_backward_grad_sync'.

I attempted solving it by messing with the boiler plate but it only led to more similar errors of assuming the existence of this attribute.

Please reproduce using the BoringModel

The boring model didn't fully facilitate the command-line arguments that I needed, so I instead have the following notebook.

To Reproduce

The colab notebook in question can be found here.

Expected behavior

Normally the training script would allow the backward pass to run smoothly, as it does without a DeepSpeed plugin.

Environment

* CUDA:
	- GPU:
		- Tesla P100-PCIE-16GB
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.7.1
	- pytorch-lightning: 1.3.5
	- tqdm:              4.41.1
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- 
	- processor:         x86_64
	- python:            3.7.10
	- version:           #1 SMP Sat Jun 5 09:50:34 PDT 2021

Additional context

This error was first noticed on an 8x2080ti setup under near-identical conditions. Should just be something to tweak during the backward pass. It would be nice to have an alternative to the self.manual_backward that I could use in the meantime.

@Zasder3 Zasder3 added bug Something isn't working help wanted Open to be worked on labels Jun 12, 2021
@carmocca carmocca added the priority: 1 Medium priority task label Jun 13, 2021
@tchaton
Copy link
Contributor

tchaton commented Jun 14, 2021

Dear @Zasder3,

Thanks for reporting this bug.
Currently, manual optimization is being supported only with DDP.
We apologise for the inconvenience.

Best,
T.C

@SeanNaren
Copy link
Contributor

Thanks @Zasder3

DeepSpeed requires having control over more steps making manual optimization a bit tricky to support. In a future release we might be able to support it but require the user to have to access the deepspeed engine to run step backward and using the wrapped model for forward however that will require very specific changes for deepspeed only.

Apologies as the message should've been clearer! I've added a clearer message in #7234

@SeanNaren
Copy link
Contributor

Thanks to @tchaton he managed to get this working :)

Just note only one optimizer is supported, and manual optimization with deepspeed is quite untested. We do have a test for a basic manual optimization example which you can see here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/plugins/test_deepspeed_plugin.py#L36

@kushalj001
Copy link

Hi @SeanNaren @tchaton
I just wanted to check up on this thread about manual optimization with deepspeed. Is it supported right now? My training loop is a bit complex (involves RL), so I cannot use automatic optimization. I had been using automatic optimization with deepspeed earlier and it worked very well for me. I wanted to know if I should continue using lightning in my case or switch to native torch + deepspeed (or any other recommendations).
Thanks!

@SophieOstmeier
Copy link

I have the same issue. Considering switching too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants