-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingdistributedGeneric distributed-related topicGeneric distributed-related topic
Milestone
Description
🐛 Bug
I am getting an exception in:
migration.py
Particularly with the line:
del sys.modules["pytorch_lightning.utilities.argparse_utils"]
A cheap patch was to try:/catch the line.
To Reproduce
I am using a threadexecutor to init 4 threads to run a model on each thread (each with a cuda device). At some point between thread loads of checkpoints i get a KeyError in the line I mentioned above. I think that on the first load it deletes this module, but doesnt check if its deleted in subsequent loads.
Expected behavior
That the checkpoints load in the multiple threads.
Environment
- CUDA:
- GPU:
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- NVIDIA RTX A6000
- available: True
- version: 11.3
- GPU:
- Packages:
- numpy: 1.21.2
- pyTorch_debug: False
- pyTorch_version: 1.10.0
- pytorch-lightning: 1.6.0
- tqdm: 4.62.3
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.7
- version: Names of parameters may benefit from not being abbreviated #119-Ubuntu SMP Mon Mar 7 18:49:24 UTC 2022
Additional context
The idea was to have each model in a thread feeding it data.
cc @awaelchli @rohitgr7 @akihironitta @ananthsub @ninginthecloud
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingdistributedGeneric distributed-related topicGeneric distributed-related topic