New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hydra + DDP Fails for NeMo after Hydra refactor in 1.8 #15689
Comments
Brief update: We decided to revert #11617 in #15727 to restore the previous behavior. We believe that this issue with Nemo is affecting more users than the original #11300, but we will revisit the original fix. #15727 is implementing the revert and I'm considering breaking out a test case based on what @SeanNaren has shared here. However, I need to study the behavior of the exp manager first. |
Great! I'll be honest, the |
@SeanNaren would it be reasonable for you to create a custom Launcher based on these changes if we made custom launchers publicly available or do you consider this digging too deep into lightning internals? |
If to even use Lightning we require custom launchers by default, it would mostly defeat the purpose of having lightning be a stable engine to train our models. We've had to do some necessary overrides of pytorch internals already to support model parallel and we'd rather not have to do that for every use case of Nemo |
@titu1994 that's fair. We'll definitely be looking for other options! |
Bug description
Related #15545
NeMo PR where regression was spotted NVIDIA/NeMo#5353
After #11617 was merged and included in 1.8, this has caused NeMo to break with DDP (as NeMo uses Hydra internally). I'm going to cross-paste the explanation from the above PR:
In this PR #11617 Lightning changed the way in which sub-processes are started in DDP. Instead of re-running the command (and passing env variables to set the rank), sub-processes now use a saved config yaml file auto-generated by Hydra, default stored to
.hydra
to pass the arguments via config.In NeMo, we disable the creation of this output subdirectory, and we always set the run directory to the current working directory. This lets the experiment manager handle everything regarding checkpoints/logging directory instead of hydra.
The issue is that at running of sub-processes, the hydra runner is not aware of the experiment manager's directory choices. If we allow the sub directory to be created in the default
.hydra
, with the DDP Lightning code we'd start processes in the current working directory, each with a new folder (`.pl_hydra_local_rank_{rank}).This issue is if you have multiple runs in the same repo, each of them will overwrite each other and it would be a mess.
I have been unable to come up with an elegant solution between NeMo and Lightning.
I may have missed something however, so if anyone has any other suggestions on how we can fix this please let me know!
cc @tchaton @justusschock @awaelchli @akihironitta @Borda @titu1994 @ericharper
How to reproduce the bug
Requires 2 devices (not sure if it has to be on the GPU though).
Create config in
conf/config.yaml
:Create file with this code:
Run file.
The text was updated successfully, but these errors were encountered: