Hydra + DDP Fails for NeMo after Hydra refactor in 1.8 #15689

SeanNaren · 2022-11-15T13:22:59Z

Bug description

NeMo PR where regression was spotted NVIDIA/NeMo#5353

After #11617 was merged and included in 1.8, this has caused NeMo to break with DDP (as NeMo uses Hydra internally). I'm going to cross-paste the explanation from the above PR:

In this PR #11617 Lightning changed the way in which sub-processes are started in DDP. Instead of re-running the command (and passing env variables to set the rank), sub-processes now use a saved config yaml file auto-generated by Hydra, default stored to .hydra to pass the arguments via config.

In NeMo, we disable the creation of this output subdirectory, and we always set the run directory to the current working directory. This lets the experiment manager handle everything regarding checkpoints/logging directory instead of hydra.

The issue is that at running of sub-processes, the hydra runner is not aware of the experiment manager's directory choices. If we allow the sub directory to be created in the default .hydra, with the DDP Lightning code we'd start processes in the current working directory, each with a new folder (`.pl_hydra_local_rank_{rank}).

This issue is if you have multiple runs in the same repo, each of them will overwrite each other and it would be a mess.

I have been unable to come up with an elegant solution between NeMo and Lightning.

The easiest option would be to monkey-patch the class to the old behaviour with argv, however this doesn't future proof at all, if anything goes the opposite direction.

I may have missed something however, so if anyone has any other suggestions on how we can fix this please let me know!

cc @tchaton @justusschock @awaelchli @akihironitta @Borda @titu1994 @ericharper

How to reproduce the bug

Requires 2 devices (not sure if it has to be on the GPU though).

# install NeMo with Lightning 1.8 (WIP) support
pip install https://github.com/NVIDIA/NeMo/archive/refs/heads/feat/lightning_1.8_support.zip

Create config in conf/config.yaml:

name: "QuartzNet15x5"
exp_manager:
  exp_dir: null
  name: ${name}
  create_tensorboard_logger: False
  create_checkpoint_callback: False
  create_wandb_logger: False

Create file with this code:

import os

import torch
from nemo.core.config import hydra_runner
from nemo.utils.exp_manager import exp_manager
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


@hydra_runner(config_path="conf", config_name="config")
def run(cfg):
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        max_epochs=1,
        devices=2,
        accelerator='gpu',
        logger=False,
        enable_checkpointing=False,
        strategy='ddp',
    )
    exp_manager(trainer=trainer, cfg=cfg.exp_manager)
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)


if __name__ == "__main__":
    run()

Run file.

python report.py
[NeMo W 2022-11-15 05:08:44 nemo_logging:349] /home/snarenthiran/anaconda3/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo I 2022-11-15 05:08:44 exp_manager:343] Experiments will be logged at /home/snarenthiran/nemo_experiments/QuartzNet15x5/2022-11-15_05-08-44
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Cannot find primary config 'config.yaml'. Check that it's in your config search path.

Config search path:
	provider=hydra, path=pkg://hydra.conf
	provider=main, path=file:///home/snarenthiran
	provider=schema, path=structured://

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

The text was updated successfully, but these errors were encountered:

awaelchli · 2022-11-18T13:04:25Z

Brief update: We decided to revert #11617 in #15727 to restore the previous behavior. We believe that this issue with Nemo is affecting more users than the original #11300, but we will revisit the original fix. #15727 is implementing the revert and I'm considering breaking out a test case based on what @SeanNaren has shared here. However, I need to study the behavior of the exp manager first.

SeanNaren · 2022-11-18T13:39:23Z

Brief update: We decided to revert #11617 in #15727 to restore the previous behavior. We believe that this issue with Nemo is affecting more users than the original #11300, but we will revisit the original fix. #15727 is implementing the revert and I'm considering breaking out a test case based on what @SeanNaren has shared here. However, I need to study the behavior of the exp manager first.

Great! I'll be honest, the exp_manager case is specific to NeMo where we do not want hydra to anything but config management, so we disable everything else (and do our own output folders and such). I can imagine other repos doing the same, but just wanted to make it clear this isn't hydras' default behaviour.

justusschock · 2022-11-19T01:31:53Z

@SeanNaren would it be reasonable for you to create a custom Launcher based on these changes if we made custom launchers publicly available or do you consider this digging too deep into lightning internals?

titu1994 · 2022-11-19T03:10:29Z

If to even use Lightning we require custom launchers by default, it would mostly defeat the purpose of having lightning be a stable engine to train our models. We've had to do some necessary overrides of pytorch internals already to support model parallel and we'd rather not have to do that for every use case of Nemo

justusschock · 2022-11-19T07:33:27Z

@titu1994 that's fair. We'll definitely be looking for other options!
Just wanted to know if we need to look into this direction or rather another

SeanNaren added the needs triage Waiting to be triaged by maintainers label Nov 15, 2022

akihironitta added strategy: ddp DistributedDataParallel performance and removed needs triage Waiting to be triaged by maintainers labels Nov 15, 2022

awaelchli added the priority: 1 Medium priority task label Nov 18, 2022

awaelchli added this to the v1.8.x milestone Nov 18, 2022

awaelchli mentioned this issue Nov 18, 2022

WIP: Revert "Improving Hydra+DDP support (#11617)" #15727

Closed

11 tasks

awaelchli mentioned this issue Nov 19, 2022

Revert new Hydra launch behavior #15737

Merged

11 tasks

Borda closed this as completed in #15737 Nov 21, 2022

awaelchli mentioned this issue Jul 26, 2023

Use hydra.run.dir (not os.getcwd) for DDP subprocesses' run dir #18145

Merged

8 tasks

nisheethlahoti mentioned this issue Jul 27, 2023

Support multi-run with hydra + DDP #18175

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hydra + DDP Fails for NeMo after Hydra refactor in 1.8 #15689

Hydra + DDP Fails for NeMo after Hydra refactor in 1.8 #15689

SeanNaren commented Nov 15, 2022 •

edited by github-actions bot

awaelchli commented Nov 18, 2022

SeanNaren commented Nov 18, 2022

justusschock commented Nov 19, 2022

titu1994 commented Nov 19, 2022 •

edited

justusschock commented Nov 19, 2022

Hydra + DDP Fails for NeMo after Hydra refactor in 1.8 #15689

Hydra + DDP Fails for NeMo after Hydra refactor in 1.8 #15689

Comments

SeanNaren commented Nov 15, 2022 • edited by github-actions bot

Bug description

How to reproduce the bug

awaelchli commented Nov 18, 2022

SeanNaren commented Nov 18, 2022

justusschock commented Nov 19, 2022

titu1994 commented Nov 19, 2022 • edited

justusschock commented Nov 19, 2022

SeanNaren commented Nov 15, 2022 •

edited by github-actions bot

titu1994 commented Nov 19, 2022 •

edited