Fix issue where job steps wouldn't run if the first node was full

In a multi-node job it was possible to be in a situation where there were more CPUs available for steps to use but steps would not launch. For example, if a node has 2 cores and 1 thread per core and this job is submitted: sbatch -N2 --ntasks-per-node=2 --mem=1000 job.bash And job.bash contains the following: for i in {1..4} do srun --exact --mem=100 -N1 -c1 -n1 sleep 60 & done wait In this case, two steps would run on the first node and one step would run on the second node, but the fourth step would not run until the first step completed, even though there is an available task and CPU on the second node in the allocation. Why does this happen? If the step requests CPUs <= number of nodes, then when _pick_step_nodes() calls _pick_step_nodes_cpus: node_tmp = _pick_step_nodes_cpus(job_ptr, nodes_avail, nodes_needed, cpus_needed, usable_cpu_cnt); it will simply return the first N nodes from the nodes_avail bitmap, where N is the number of nodes that the step requested. In this example job, all the CPUs on the first node are allocated, but the first node remains in the nodes_avail bitmap. Then _pick_step_nodes_cpus() selects the first node and adds it to the nodes_picked bitmap. Right after that, _pick_step_nodes() gets the number of CPUs from nodes in the nodes_picked bitmap, which is 0 CPUs. The fix is to remove fully allocated nodes from nodes_avail bitmap. But this also creates a problem where once all the nodes are fully allocated and another valid step request comes, then an incorrect error message of ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE would happen, when the correct error message is ESLURM_NODES_BUSY. So we increment job_blocked_nodes if there are no available cpus on the node. Bug 11357
SchedMD · Apr 29, 2021 · 6a2c99e · 6a2c99e
1 parent 63e94c2
commit 6a2c99e
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 0 deletions.
diff --git a/NEWS b/NEWS
@@ -7,6 +7,9 @@ documents those changes that are of interest to users and administrators.
     indefinitely.
  -- select/cons_tres - fix Dragonfly topology not selecting nodes in the same
     leaf switch when it should as well as requests with --switches option.
+ -- Fix issue where certain step requests wouldn't run if the first node in the
+    job allocation was full and there were idle resources on other nodes in
+    the job allocation.
 
 * Changes in Slurm 20.11.6
 ==========================

diff --git a/src/slurmctld/step_mgr.c b/src/slurmctld/step_mgr.c
@@ -1126,12 +1126,15 @@ static bitstr_t *_pick_step_nodes(job_record_t *job_ptr,
 						cpus_used[node_inx];
 					job_blocked_cpus += job_resrcs_ptr->
 						cpus_used[node_inx];
+					if (!total_cpus)
+						job_blocked_nodes++;
 				}
 			}
 
 			if (!total_cpus) {
 				log_flag(STEPS, "%s: %pJ Skipping node. Not enough CPUs to run step here.",
 					 __func__, job_ptr);
+				bit_clear(nodes_avail, i);
 				continue;
 			}
 
@@ -1432,6 +1435,12 @@ static bitstr_t *_pick_step_nodes(job_record_t *job_ptr,
 				usable_cpu_cnt[i] =
 					job_resrcs_ptr->cpus[node_inx];
 
+				log_flag(STEPS, "%s: %pJ Currently running steps use %d of allocated %d CPUs on node %s",
+					 __func__, job_ptr,
+					 job_resrcs_ptr->cpus_used[node_inx],
+					 usable_cpu_cnt[i],
+					 node_record_table_ptr[i].name);
+
 				if (step_spec->flags & SSF_EXCLUSIVE) {
 					/*
 					 * If whole is given and
@@ -1453,8 +1462,15 @@ static bitstr_t *_pick_step_nodes(job_record_t *job_ptr,
 						usable_cpu_cnt[i] -=
 							job_resrcs_ptr->
 							cpus_used[node_inx];
+						if (!usable_cpu_cnt[i])
+							job_blocked_nodes++;
 					}
 				}
+				if (!usable_cpu_cnt[i]) {
+					log_flag(STEPS, "%s: %pJ Skipping node. Not enough CPUs to run step here.",
+						 __func__, job_ptr);
+					bit_clear(nodes_avail, i);
+				}
 			}
 
 		}