Skip to content

Crash on multiple initializations in load_checkpoint() calls  #12557

@pedrocolon93

Description

@pedrocolon93

🐛 Bug

I am getting an exception in:
migration.py

Particularly with the line:
del sys.modules["pytorch_lightning.utilities.argparse_utils"]

A cheap patch was to try:/catch the line.

To Reproduce

I am using a threadexecutor to init 4 threads to run a model on each thread (each with a cuda device). At some point between thread loads of checkpoints i get a KeyError in the line I mentioned above. I think that on the first load it deletes this module, but doesnt check if its deleted in subsequent loads.

Expected behavior

That the checkpoints load in the multiple threads.

Environment

  • CUDA:
    • GPU:
      • NVIDIA RTX A6000
      • NVIDIA RTX A6000
      • NVIDIA RTX A6000
      • NVIDIA RTX A6000
    • available: True
    • version: 11.3
  • Packages:
    • numpy: 1.21.2
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0
    • pytorch-lightning: 1.6.0
    • tqdm: 4.62.3
  • System:

Additional context

The idea was to have each model in a thread feeding it data.

cc @awaelchli @rohitgr7 @akihironitta @ananthsub @ninginthecloud

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcheckpointingRelated to checkpointingdistributedGeneric distributed-related topic

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions