-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed Manual Backward during DeepSpeed training. #7957
Comments
Dear @Zasder3, Thanks for reporting this bug. Best, |
Thanks @Zasder3 DeepSpeed requires having control over more steps making manual optimization a bit tricky to support. In a future release we might be able to support it but require the user to have to access the deepspeed engine to run Apologies as the message should've been clearer! I've added a clearer message in #7234 |
Thanks to @tchaton he managed to get this working :) Just note only one optimizer is supported, and manual optimization with deepspeed is quite untested. We do have a test for a basic manual optimization example which you can see here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/plugins/test_deepspeed_plugin.py#L36 |
Hi @SeanNaren @tchaton |
I have the same issue. Considering switching too. |
🐛 Bug
When attempting to use manual optimization in Lightning, the backward pass fails as it causes a reference to an undefined attribute.
On line 293 of
pytorch-lightning/plugins/training_type/ddp.py
the line:Throws the error
torch.nn.modules.module.ModuleAttributeError: 'DeepSpeedEngine' object has no attribute 'require_backward_grad_sync'
.I attempted solving it by messing with the boiler plate but it only led to more similar errors of assuming the existence of this attribute.
Please reproduce using the BoringModel
The boring model didn't fully facilitate the command-line arguments that I needed, so I instead have the following notebook.
To Reproduce
The colab notebook in question can be found here.
Expected behavior
Normally the training script would allow the backward pass to run smoothly, as it does without a DeepSpeed plugin.
Environment
Additional context
This error was first noticed on an 8x2080ti setup under near-identical conditions. Should just be something to tweak during the backward pass. It would be nice to have an alternative to the self.manual_backward that I could use in the meantime.
The text was updated successfully, but these errors were encountered: