Skip to content

Commit

Permalink
Fix different issues when requesting memory per cpu/node.
Browse files Browse the repository at this point in the history
First issue was identified on multi partition requests. job_limits_check()
was overriding the original memory requests, so the next partition
Slurm validating limits against was not using the original values. The
solution consists in adding three members to job_details struct to
preserve the original requests. This issue is reported in bug 4895.

Second issue was memory enforcement behavior being different depending on
job the request issued against a reservation or not.

Third issue had to do with the automatic adjustments Slurm did underneath
when the memory request exceeded the limit. These adjustments included
increasing pn_min_cpus (even incorrectly beyond the number of cpus
available on the nodes) or different tricks increasing cpus_per_task and
decreasing mem_per_cpu.

Fourth issue was identified when requesting the special case of 0 memory,
which was handled inside the select plugin after the partition validations
and thus that could be used to incorrectly bypass the limits.

Issues 2-4 were identified in bug 4976.

Patch also includes an entire refactor on how and when job memory is
is both set to default values (if not requested initially) and how and
when limits are validated.

Co-authored-by: Dominik Bartkiewicz <bart@schedmd.com>
  • Loading branch information
asanchez1987 and fafik23 committed May 10, 2018
1 parent b67d735 commit bf4cb0b
Show file tree
Hide file tree
Showing 6 changed files with 293 additions and 213 deletions.
1 change: 1 addition & 0 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ documents those changes that are of interest to users and administrators.
* Changes in Slurm 17.11.7
==========================
-- Fix for possible slurmctld daemon abort with NULL pointer.
-- Fix different issues when requesting memory per cpu/node.

* Changes in Slurm 17.11.6
==========================
Expand Down
4 changes: 4 additions & 0 deletions src/plugins/sched/backfill/backfill.c
Original file line number Diff line number Diff line change
Expand Up @@ -1401,6 +1401,10 @@ static int _attempt_backfill(void)
continue;
}
job_ptr->part_ptr = part_ptr;
if (job_limits_check(&job_ptr, true) != WAIT_NO_REASON) {
/* should never happen */
continue;
}

if (debug_flags & DEBUG_FLAG_BACKFILL) {
char job_id_str[64];
Expand Down
18 changes: 1 addition & 17 deletions src/plugins/select/cons_res/job_test.c
Original file line number Diff line number Diff line change
Expand Up @@ -3858,23 +3858,7 @@ extern int cr_job_test(struct job_record *job_ptr, bitstr_t *node_bitmap,
for (i = 0; i < job_res->nhosts; i++) {
job_res->memory_allocated[i] = save_mem;
}
} else { /* --mem=0, allocate job all memory on node */
uint64_t avail_mem, lowest_mem = 0;
first = bit_ffs(job_res->node_bitmap);
if (first != -1)
last = bit_fls(job_res->node_bitmap);
else
last = first - 1;
for (i = first, j = 0; i <= last; i++) {
if (!bit_test(job_res->node_bitmap, i))
continue;
avail_mem = select_node_record[i].real_memory -
select_node_record[i].mem_spec_limit;
if ((j == 0) || (lowest_mem > avail_mem))
lowest_mem = avail_mem;
job_res->memory_allocated[j++] = avail_mem;
}
details_ptr->pn_min_memory = lowest_mem;
}

return error_code;
}

0 comments on commit bf4cb0b

Please sign in to comment.