Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate environment variables in SLURMEnvironment and warn user about incompatible settings #10150

Closed
awaelchli opened this issue Oct 26, 2021 · 3 comments · Fixed by #15011
Closed
Assignees
Labels
environment: slurm feature Is an improvement or enhancement let's do it! approved to implement
Milestone

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Oct 26, 2021

🚀 Feature

Check slurm environment settings and print warnings if needed.

Motivation

If SLURM srun variables are set incorrectly, the processes can hang and the user will not know why.

Examples:

Pitch

In the SLURM cluster environment, check for example:

if os.environ.get("SLURM_CPUS_PER_TASK") > 1:
    warn(
        "You asked SLURM for multiple CPUs but we are not sure your machine has multiple CPUs. "
        "If your process hangs, consider setting --cpus-per-task=1"
    )

Alternatives

Do nothing. Users will keep submitting issues :)

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @awaelchli

@awaelchli awaelchli added feature Is an improvement or enhancement environment: slurm labels Oct 26, 2021
@stale
Copy link

stale bot commented Nov 25, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 25, 2021
@awaelchli
Copy link
Contributor Author

No!

@stale stale bot removed the won't fix This will not be worked on label Nov 25, 2021
@tchaton
Copy link
Contributor

tchaton commented Nov 26, 2021

Great idea! Let's do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
environment: slurm feature Is an improvement or enhancement let's do it! approved to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants