-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Slurm Autorequeue for Array Jobs #15040
Conversation
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
I'm pretty confident in the test cases I wrote but I'm leaving this in draft mode until I get a chance to run it on one of our clusters. Should be before next week. |
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
This is confirmed working on one of our clusters so I think we're good to go |
Signed-off-by: Max Ehrlich <max.ehr@gmail.com>
Not sure why the python 3.7 tests are failing, when I run them locally with python 3.7.14 they work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Queuecumber I fixed the tests.
Thanks for extending this feature!!
@@ -78,7 +78,13 @@ def slurm_sigusr_handler_fn(self, signum: _SIGNUM, frame: FrameType) -> None: | |||
|
|||
if self.trainer.is_global_zero: | |||
# find job id | |||
job_id = os.environ["SLURM_JOB_ID"] | |||
array_job_id = os.getenv("SLURM_ARRAY_JOB_ID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should integrate this into SLURMEnvironment class similar to job_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah actually the entire function should probably be in the slurm environment since that would keep all the slurm stuff in one place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve to merge, but you should follow-up on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for merging, I definitely could have included this in the PR if you wanted, although it's somewhat of an invasive change there's probably not much risk to it either
Signed-off-by: Max Ehrlich <max.ehr@gmail.com> Co-authored-by: awaelchli <aedu.waelchli@gmail.com>
What does this PR do?
This PR adds support for autorequeueing slurm array jobs.
Array jobs are groups of jobs which are grouped in some meaningful way. Slurm provides facilities to quickly (i.e. with less load on the scheduler) create such jobs.
These jobs behave slightly differently from regular jobs.
Although they have
SLURM_JOB_ID
set, and it is unique for each job in the array, some commands, notablyscontrol
, differ in their meaning when theJOB_ID
is used vs. the array specificSLURM_ARRAY_JOB_ID
andSLURM_ARRAY_TASK_ID
. For example, ifscontrol
is called with theSLURM_JOB_ID
of the "parent job" it will apply thescontrol
command to all array jobs. In order to requeue (for example) only a specific array job, the string<array job id>_<array_task_id>
needs to be used.For example if a job has
JOB_ID
1234, the first array job will haveARRAY_JOB_ID
1234 andTASK_ID
1 and the second job will haveJOB_ID
1235ARRAY_JOB_ID
1234 andTASK_ID
2. In this casescontrol requeue 1234
would requeue both jobs in the array.scontrol requeue 1234_1
would requeue only the first job in the array andscontrol requeue 1234_2
would requeue only the second.Since lightning was only checking the
JOB_ID
this was causing it to prematurely requeue some array jobs. This was causing lots of weird issues like failed requeues and timeouts with no requeue attempted.Fixes #15022
Does your PR introduce any breaking changes? If yes, please list them.
None
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃