New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate Trainer settings against cluster environment #18292
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…tings' into feature/validate-cluster-env-settings
⚡ Required checks status: All passing 🟢Groups summary🟢 pytorch_lightning: Tests workflow
These checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Benchmarks
These checks are required after the changes to 🟢 fabric: Docs
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🟢 lightning_fabric: CPU workflowThese checks are required after the changes to 🟢 lightning_fabric: Azure GPU
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to 🟢 link-check
These checks are required after the changes to Thank you for your contribution! 💜
|
f" `--ntasks-per-node={ntasks_per_node}` does not match. HINT: Set `devices={ntasks_per_node}`." | ||
) | ||
nnodes = os.environ.get("SLURM_NNODES") | ||
if nnodes is not None and int(nnodes) != num_nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @awaelchli, just a minor note. I came across this error when using lightning nightly build
ValueError: You set `num_nodes=4` in Lightning, but the number of nodes configured in SLURM `--nodes=4` does not match. HINT: Set `num_nodes=4`.
I wasn't able to investigate this deeply but wanted to flag it here in case its of help to you. Not sure if it happens due to type of num_nodes, but casting it to int gets the code through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Sumith1896
That's very odd. I don't understand how the check above can pass 4 != 4 when the raised error message shows that both nnodes=4 and num_nodes=4.
int(nnodes)
makes it an int and num_nodes
should already be one.
but casting it to int gets the code through.
Could you tell me where a cast to int is needed? Is num_nodes
not an int for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_nodes
wasn't an int in my case. I'm quite puzzled myself as the same lightning env worked on different compute instances/slurm fine. Not sure if its a slurm version / instance specific. This same PR seems to have removed cast of num_nodes to int, maybe that's the cause. I can also help narrow it down for you on my end if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using Trainer or Fabric?
I removed the cast because I deduced it as redundent, as earlier in the code the num_nodes is assumed to be int and converted to int. In Fabric, this happens at the very beginning here: https://github.com/Lightning-AI/lightning/blob/32415c68ae981a83c72d9587cb02769a4a5b894e/src/lightning/fabric/connector.py#L116
And in Trainer, the num_nodes input is expected to be int from the start: https://github.com/Lightning-AI/lightning/blob/32415c68ae981a83c72d9587cb02769a4a5b894e/src/lightning/pytorch/trainer/connectors/accelerator_connector.py#L81
Random guess. Do you have an argparser that takes the num_nodes argument from the command line, and keeps it as string instead of converting to int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do me a favor and on your system check where in the connector I linked above the num_nodes is not an int. Is it a the input or later in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli ah yes! It's a faulty argparse that sends num_nodes
as str
, I'm a bit surprised by the non-deterministic behavior, but it doesn't seem to be lightning's fault. One could also check if num_nodes
is an str
that could be cast as int
and do so if possible but I'll let you take the call. Thanks for engaging!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming! I added input validation yesterday: #18598, this should help resolve any confusions in the future.
What does this PR do?
Fixes #10107
Fixes #8993
This PR adds checks that the user sets
devices
andnum_nodes
correctly in a cluster environment where processes get configured and launched externally. In this PR, we do this for SLURM and torchelastic launch, and others are left for future work (not needed for LightningEnvironment).Note: This is an intermediate step towards automatically inferring the number of nodes in a cluster environment, so the user doesn't have to set it by hand: #14078, #7361
cc @Borda @carmocca @justusschock @awaelchli