Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate Trainer settings against cluster environment #18292

Merged
merged 11 commits into from Aug 12, 2023

Conversation

awaelchli
Copy link
Member

@awaelchli awaelchli commented Aug 12, 2023

What does this PR do?

Fixes #10107
Fixes #8993

This PR adds checks that the user sets devices and num_nodes correctly in a cluster environment where processes get configured and launched externally. In this PR, we do this for SLURM and torchelastic launch, and others are left for future work (not needed for LightningEnvironment).

Note: This is an intermediate step towards automatically inferring the number of nodes in a cluster environment, so the user doesn't have to set it by hand: #14078, #7361

cc @Borda @carmocca @justusschock @awaelchli

@awaelchli awaelchli added feature Is an improvement or enhancement environment labels Aug 12, 2023
@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package and removed feature Is an improvement or enhancement environment labels Aug 12, 2023
@awaelchli awaelchli added the fun Staff contributions outside working hours - to differentiate from the "community" label label Aug 12, 2023
@awaelchli awaelchli added this to the 2.1 milestone Aug 12, 2023
@awaelchli awaelchli marked this pull request as ready for review August 12, 2023 11:14
@github-actions
Copy link
Contributor

github-actions bot commented Aug 12, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 1.11) success
pl-cpu (macOS-11, lightning, 3.9, 1.12) success
pl-cpu (macOS-11, lightning, 3.10, 1.13) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
pl-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
pl-cpu (windows-2022, lightning, 3.8, 1.11) success
pl-cpu (windows-2022, lightning, 3.9, 1.12) success
pl-cpu (windows-2022, lightning, 3.10, 1.13) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
pl-cpu (macOS-11, pytorch, 3.8, 1.13) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.13) success
pl-cpu (windows-2022, pytorch, 3.8, 1.13) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/fabric/plugins/environments/cluster_environment.py, src/lightning/fabric/plugins/environments/slurm.py, src/lightning/fabric/plugins/environments/torchelastic.py, src/lightning/fabric/strategies/launchers/subprocess_script.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/strategies/launchers/test_subprocess_script.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
[pytorch-lightning (GPUs) (testing Lightning latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=169594&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e) success
[pytorch-lightning (GPUs) (testing PyTorch latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=169594&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f) success

These checks are required after the changes to src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py, tests/tests_pytorch/strategies/launchers/test_subprocess_script.py, src/lightning/fabric/connector.py, src/lightning/fabric/plugins/environments/cluster_environment.py, src/lightning/fabric/plugins/environments/slurm.py, src/lightning/fabric/plugins/environments/torchelastic.py, src/lightning/fabric/strategies/launchers/subprocess_script.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/fabric/plugins/environments/cluster_environment.py, src/lightning/fabric/plugins/environments/slurm.py, src/lightning/fabric/plugins/environments/torchelastic.py, src/lightning/fabric/strategies/launchers/subprocess_script.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 fabric: Docs
Check ID Status
make-doctest (fabric) success
make-html (fabric) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/fabric/plugins/environments/cluster_environment.py, src/lightning/fabric/plugins/environments/slurm.py, src/lightning/fabric/plugins/environments/torchelastic.py, src/lightning/fabric/strategies/launchers/subprocess_script.py.

🟢 pytorch_lightning: Docs
Check ID Status
make-doctest (pytorch) success
make-html (pytorch) success

These checks are required after the changes to src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11) success
fabric-cpu (macOS-11, lightning, 3.9, 1.12) success
fabric-cpu (macOS-11, lightning, 3.10, 1.13) success
fabric-cpu (macOS-11, lightning, 3.10, 2.0) success
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.12) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11) success
fabric-cpu (windows-2022, lightning, 3.9, 1.12) success
fabric-cpu (windows-2022, lightning, 3.10, 1.13) success
fabric-cpu (windows-2022, lightning, 3.10, 2.0) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
fabric-cpu (macOS-11, fabric, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13) success
fabric-cpu (windows-2022, fabric, 3.8, 1.13) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/fabric/plugins/environments/cluster_environment.py, src/lightning/fabric/plugins/environments/slurm.py, src/lightning/fabric/plugins/environments/torchelastic.py, src/lightning/fabric/strategies/launchers/subprocess_script.py, tests/tests_fabric/plugins/environments/test_slurm.py, tests/tests_fabric/plugins/environments/test_torchelastic.py, tests/tests_fabric/strategies/launchers/test_subprocess_script.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
[lightning-fabric (GPUs) (testing Fabric latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=169596&view=logs&jobId=3f274fac-2e11-54ca-487e-194c91f3ae9f) success
[lightning-fabric (GPUs) (testing Lightning latest)](https://dev.azure.com/Lightning-AI/72ab7ed8-b00f-4b6e-b131-3388f7ffafa7/_build/results?buildId=169596&view=logs&jobId=47e66f3c-897a-5428-da11-bf5c7745762e) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/fabric/plugins/environments/cluster_environment.py, src/lightning/fabric/plugins/environments/slurm.py, src/lightning/fabric/plugins/environments/torchelastic.py, src/lightning/fabric/strategies/launchers/subprocess_script.py, tests/tests_fabric/plugins/environments/test_slurm.py, tests/tests_fabric/plugins/environments/test_torchelastic.py, tests/tests_fabric/strategies/launchers/test_subprocess_script.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/fabric/plugins/environments/cluster_environment.py, src/lightning/fabric/plugins/environments/slurm.py, src/lightning/fabric/plugins/environments/torchelastic.py, src/lightning/fabric/strategies/launchers/subprocess_script.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.10) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.10) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.10) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.10) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.10) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.10) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.10) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.10) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.10) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.10) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.10) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.10) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.10) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.10) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.10) success

These checks are required after the changes to src/lightning/fabric/connector.py, src/lightning/fabric/plugins/environments/cluster_environment.py, src/lightning/fabric/plugins/environments/slurm.py, src/lightning/fabric/plugins/environments/torchelastic.py, src/lightning/fabric/strategies/launchers/subprocess_script.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/connectors/accelerator_connector.py.

🟢 link-check
Check ID Status
check-md-links / markdown-link-check success

These checks are required after the changes to src/lightning/fabric/CHANGELOG.md, src/lightning/pytorch/CHANGELOG.md.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@mergify mergify bot added the ready PRs ready to be merged label Aug 12, 2023
@awaelchli awaelchli merged commit c95dbac into master Aug 12, 2023
109 of 115 checks passed
@awaelchli awaelchli deleted the feature/validate-cluster-env-settings branch August 12, 2023 19:26
f" `--ntasks-per-node={ntasks_per_node}` does not match. HINT: Set `devices={ntasks_per_node}`."
)
nnodes = os.environ.get("SLURM_NNODES")
if nnodes is not None and int(nnodes) != num_nodes:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @awaelchli, just a minor note. I came across this error when using lightning nightly build

ValueError: You set `num_nodes=4` in Lightning, but the number of nodes configured in SLURM `--nodes=4` does not match. HINT: Set `num_nodes=4`.

I wasn't able to investigate this deeply but wanted to flag it here in case its of help to you. Not sure if it happens due to type of num_nodes, but casting it to int gets the code through.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Sumith1896
That's very odd. I don't understand how the check above can pass 4 != 4 when the raised error message shows that both nnodes=4 and num_nodes=4.

int(nnodes) makes it an int and num_nodes should already be one.

but casting it to int gets the code through.

Could you tell me where a cast to int is needed? Is num_nodes not an int for some reason?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_nodes wasn't an int in my case. I'm quite puzzled myself as the same lightning env worked on different compute instances/slurm fine. Not sure if its a slurm version / instance specific. This same PR seems to have removed cast of num_nodes to int, maybe that's the cause. I can also help narrow it down for you on my end if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using Trainer or Fabric?
I removed the cast because I deduced it as redundent, as earlier in the code the num_nodes is assumed to be int and converted to int. In Fabric, this happens at the very beginning here: https://github.com/Lightning-AI/lightning/blob/32415c68ae981a83c72d9587cb02769a4a5b894e/src/lightning/fabric/connector.py#L116
And in Trainer, the num_nodes input is expected to be int from the start: https://github.com/Lightning-AI/lightning/blob/32415c68ae981a83c72d9587cb02769a4a5b894e/src/lightning/pytorch/trainer/connectors/accelerator_connector.py#L81

Random guess. Do you have an argparser that takes the num_nodes argument from the command line, and keeps it as string instead of converting to int?

Copy link
Member Author

@awaelchli awaelchli Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do me a favor and on your system check where in the connector I linked above the num_nodes is not an int. Is it a the input or later in the code?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli ah yes! It's a faulty argparse that sends num_nodes as str, I'm a bit surprised by the non-deterministic behavior, but it doesn't seem to be lightning's fault. One could also check if num_nodes is an str that could be cast as int and do so if possible but I'll let you take the call. Thanks for engaging!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming! I added input validation yesterday: #18598, this should help resolve any confusions in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric fun Staff contributions outside working hours - to differentiate from the "community" label pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
None yet
4 participants