Skip to content

KFold example with ddp_spawn crash #12439

@niberger

Description

@niberger

🐛 Bug

When using the cross validation loop from the example pl_examples/loop_examples/kfold.py with ddp_spawn strategy, it encounter a SIGABRT Exception and the program crash.

To Reproduce

import os
from pytorch_lightning import seed_everything, Trainer

from pl_examples.loop_examples.kfold import KFoldLoop, LitImageClassifier, MNISTKFoldDataModule


def run():
    seed_everything(42)
    model = LitImageClassifier()
    datamodule = MNISTKFoldDataModule()

    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        strategy="ddp_spawn",
    )
    internal_fit_loop = trainer.fit_loop
    trainer.fit_loop = KFoldLoop(5, export_path="./")
    trainer.fit_loop.connect(internal_fit_loop)
    trainer.fit(model, datamodule=datamodule)


if __name__ == "__main__":
    run()

Expected behavior

Training without exception

Environment

  • CUDA:
    • GPU:
      • Tesla P100-PCIE-16GB
      • Tesla P100-PCIE-16GB
      • Tesla P100-PCIE-16GB
      • Tesla P100-PCIE-16GB
      • Tesla P100-PCIE-16GB
      • Tesla P100-PCIE-16GB
      • Tesla P100-PCIE-16GB
      • Tesla P100-PCIE-16GB
    • available: True
    • version: 10.2
  • Packages:
    • numpy: 1.20.2
    • pyTorch_debug: False
    • pyTorch_version: 1.9.0
    • pytorch-lightning: 1.6.0rc0
    • tqdm: 4.60.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor:
    • python: 3.8.9
    • version: Proposal for help #1 SMP Fri Jan 14 13:59:45 UTC 2022

Additional context

I am working on a fix.

cc @Borda @carmocca @justusschock @ananthsub @ninginthecloud @rohitgr7

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingexamplegood first issueGood for newcomersloopsRelated to the Loop API

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions