Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

TracerPythonScript is stuck when using same work to run different DDP scripts #14720

Open
rohitgr7 opened this issue Sep 15, 2022 · 1 comment
Labels
app:lightningwork lightning_app.LightningWork app Generic label for Lightning App package bug Something isn't working
Milestone

Comments

@rohitgr7
Copy link
Contributor

馃悰 Bug

To Reproduce

Create 2 scripts with this code:

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)


def run():
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_test_batches=1,
        accelerator='cpu',
        devices=2,
        strategy='ddp',
    )
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run()

and app.py

import lightning as L
from lightning.app.components.python.tracer import TracerPythonScript


class ScriptRunner(TracerPythonScript):
    def __init__(self, script_path):
        super().__init__(script_path=script_path, cache_calls=True, parallel=False)

    def run(self, script_path, **kwargs):
        self.script_path = script_path
        super().run()


class Text2ImgFlow(L.LightningFlow):
    def __init__(self):
        super().__init__()
        self.script_runner = ScriptRunner('')

    def run(self):
        self.script_runner.run('bug_report_model.py')
        self.script_runner.run('bug_report_model2.py')
        print('done..........')


class RootFlow(L.LightningFlow):
    def __init__(self):
        super().__init__()
        self.text2img = Text2ImgFlow()

    def run(self):
        self.text2img.run()

app = L.LightningApp(RootFlow())

the 2nd script is stuck while creating the processes.

The reason I am using the same Tracer is that TracerPythonScript is a LightningWork and if I create multiple works to run different scripts, it will eventually allocate multiple machines for each work. Ideally it should be flexible enough to run the script and exit that without any issues and users should be able to use the same machine to run different scripts.

Also:

script 1 script 2 result
ddp_spawn ddp_spawn works
ddp_spawn ddp works
ddp ddp_spawn doesn't work
ddp ddp doesn't work

Expected behavior

Environment

  • Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
  • PyTorch Lightning Version (e.g., 1.5.0):
  • Lightning App Version (e.g., 0.5.2):
  • PyTorch Version (e.g., 1.10):
  • Python version (e.g., 3.9):
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Running environment of LightningApp (e.g. local, cloud):
  • Any other relevant information:

Additional context

@rohitgr7 rohitgr7 added bug Something isn't working app Generic label for Lightning App package app:lightningwork lightning_app.LightningWork labels Sep 15, 2022
@stale
Copy link

stale bot commented Oct 16, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, PyTorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 16, 2022
@rohitgr7 rohitgr7 added this to the future milestone Oct 17, 2022
@stale stale bot removed the won't fix This will not be worked on label Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app:lightningwork lightning_app.LightningWork app Generic label for Lightning App package bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant