Training process freezes on step 2 when training with manual optimization. #15395
Labels
bug
Something isn't working
distributed
Generic distributed-related topic
logging
Related to the `LoggerConnector` and `log()`
repro needed
The issue is missing a reproducible example
Bug description
I'm using manual optimization to work with two datasets for multi-task learning. Due to memory usage limitations, I want to do a forward and backward pass with a batch from one dataset, then a forward and backward pass with the other dataset.
When just enabling manual optimization on one dataset, my training process freezes on step 2 if I log scalars in the on_after_backwards call with sync_dist=True for the logging call.
How to reproduce the bug
Error messages and logs
Environment
More info
No response
cc @awaelchli @rohitgr7 @akihironitta @carmocca @edward-io @ananthsub @Blaizzy
The text was updated successfully, but these errors were encountered: