You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lightning throws an error when using a saved model to run inference, while using the ddp_notebook strategy.
In this case, it throws the error: "RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel."
I submit a minimum working example to reproduce the error.
RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel
"GPU operations such as moving tensors to the GPU or calling torch.cuda functions before invoking Trainer.fit is not allowed."
This means that there can be no CUDA tensors before calling Trainer.fit. By default, when training, PyTorch Lightning saves the state_dict of the trainer as CUDA when using GPU. So when load from checkpoint, CUDA is initialised. You can verify this with a simple check:
You'll observe that CUDA is initialized when calling load_from_checkpoint, and once CUDA is initialized here, it cannot be re-initialized in a different context as required by ddp_notebook.
The Fix:
Use map_location as CPU when calling load_from_checkpoint:
Bug description
Lightning throws an error when using a saved model to run inference, while using the ddp_notebook strategy.
In this case, it throws the error: "RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel."
I submit a minimum working example to reproduce the error.
What version are you seeing the problem on?
v2.1
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: