Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Slurm Autorequeue for Array Jobs #15040

Merged
merged 7 commits into from
Oct 10, 2022

Conversation

Queuecumber
Copy link
Contributor

@Queuecumber Queuecumber commented Oct 8, 2022

What does this PR do?

This PR adds support for autorequeueing slurm array jobs.

Array jobs are groups of jobs which are grouped in some meaningful way. Slurm provides facilities to quickly (i.e. with less load on the scheduler) create such jobs.

These jobs behave slightly differently from regular jobs.

Although they have SLURM_JOB_ID set, and it is unique for each job in the array, some commands, notably scontrol, differ in their meaning when the JOB_ID is used vs. the array specific SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID. For example, if scontrol is called with the SLURM_JOB_ID of the "parent job" it will apply the scontrol command to all array jobs. In order to requeue (for example) only a specific array job, the string <array job id>_<array_task_id> needs to be used.

For example if a job has JOB_ID 1234, the first array job will have ARRAY_JOB_ID 1234 and TASK_ID 1 and the second job will have JOB_ID 1235 ARRAY_JOB_ID 1234 and TASK_ID 2. In this case scontrol requeue 1234 would requeue both jobs in the array. scontrol requeue 1234_1 would requeue only the first job in the array and scontrol requeue 1234_2 would requeue only the second.

Since lightning was only checking the JOB_ID this was causing it to prematurely requeue some array jobs. This was causing lots of weird issues like failed requeues and timeouts with no requeue attempted.

Fixes #15022

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Oct 8, 2022
@Queuecumber
Copy link
Contributor Author

I'm pretty confident in the test cases I wrote but I'm leaving this in draft mode until I get a chance to run it on one of our clusters. Should be before next week.

Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
@Queuecumber
Copy link
Contributor Author

This is confirmed working on one of our clusters so I think we're good to go

Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
@Queuecumber
Copy link
Contributor Author

Not sure why the python 3.7 tests are failing, when I run them locally with python 3.7.14 they work

@awaelchli awaelchli added environment: slurm feature Is an improvement or enhancement labels Oct 9, 2022
@awaelchli awaelchli added this to the pl:future milestone Oct 9, 2022
@awaelchli awaelchli self-assigned this Oct 9, 2022
Copy link
Member

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Queuecumber I fixed the tests.
Thanks for extending this feature!!

@@ -78,7 +78,13 @@ def slurm_sigusr_handler_fn(self, signum: _SIGNUM, frame: FrameType) -> None:

if self.trainer.is_global_zero:
# find job id
job_id = os.environ["SLURM_JOB_ID"]
array_job_id = os.getenv("SLURM_ARRAY_JOB_ID")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should integrate this into SLURMEnvironment class similar to job_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah actually the entire function should probably be in the slurm environment since that would keep all the slurm stuff in one place

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve to merge, but you should follow-up on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for merging, I definitely could have included this in the PR if you wanted, although it's somewhat of an invasive change there's probably not much risk to it either

@awaelchli awaelchli added the community This PR is from the community label Oct 9, 2022
@carmocca carmocca enabled auto-merge (squash) October 10, 2022 11:43
@mergify mergify bot added the ready PRs ready to be merged label Oct 10, 2022
@carmocca carmocca merged commit 5a3007c into Lightning-AI:master Oct 10, 2022
@carmocca carmocca modified the milestones: pl:future, pl:1.7.x Oct 10, 2022
@carmocca carmocca modified the milestones: pl:1.7.x, pl:1.8 Oct 10, 2022
nicolai86 pushed a commit that referenced this pull request Oct 13, 2022
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This PR is from the community environment: slurm feature Is an improvement or enhancement pl Generic label for PyTorch Lightning package ready PRs ready to be merged
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Slurm Autorequeue Doesn't Work for Job Arrays
3 participants