Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging on slurm stopped working #2317

Closed
kumuji opened this issue Jun 22, 2020 · 5 comments
Closed

Logging on slurm stopped working #2317

kumuji opened this issue Jun 22, 2020 · 5 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@kumuji
Copy link
Contributor

kumuji commented Jun 22, 2020

馃悰 Bug

Logging and checkpoint saving stopped working for me when I run experiments via slurm system.
I am using log keys in return functions: training_epoch_end/validation_epoch_end.
Version 0.7.6 works.

To Reproduce

Steps to reproduce the behaviour:

  1. Define Tensorboard logger
  2. Run training using slurm system sbatch ...
  3. No logs.

Code sample

Expected behaviour

Environment

  • PyTorch 1.4.0:
  • PyTorch-lightning 0.8.1,
  • Linux,
  • Python 3.7.6,
  • CUDA/cuDNN 10.1, 7.6.5,
@kumuji kumuji added bug Something isn't working help wanted Open to be worked on labels Jun 22, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@sebamenabar
Copy link

sebamenabar commented Jun 23, 2020

Hi, I think I'm having the same problem, running locally logs work correctly (I'm sending to comet), but when I run on a cluster through slurm using sbatch or srun, the experiments in comet are created, but none of the logging works.

Edit: Downgraded to 0.7.6 and it works.

@ExpectationMax
Copy link
Contributor

I think this might be due to an issue due to how the rank id is set, I'm not totally sure, but it could have occurred here: #2231
I guess it's due to a malfunction with rank_zero_only, sucht that the gated code is never executed.
See also comment in #2278 (comment)

@kumuji
Copy link
Contributor Author

kumuji commented Jun 23, 2020

If you want a quick fix, just remove this line. (Dirty solution)

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 24, 2020

Fixed by #2339

Please run from master or 0.8.2 on June 25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

4 participants