Sched crash when array job and reservation is submitted #1109

arungrover · 2019-05-01T22:06:37Z

Describe Bug or Feature

Scheduler crashes when it encounters an array job and a reservation in the same scheduling cycle.

Describe Your Change

As part of speeding up scheduler, we added a way to find a job/reservation based on their indexes in the all_resresv array.
But after these indexes are created, all_resresv array gets sorted to bring timed events in front in order to create a calendar. This sort messes up the indexes and causes the crash. While the sort is needed there is no reason to have all_resresv sorted.
As part of the fix, create_events code which creates the calendar will create a local copy of the array and sort that local copy instead of sorting all_resresv array.

Link to Design Doc

None

Attach Test Logs or Output

sched_dump_valgrind.txt
test_out.txt

bhroam · 2019-05-01T22:08:37Z

src/scheduler/simulate.c

 	 */
-	all = sinfo->all_resresv;
+	all_resresv_len = count_array((void **)sinfo->all_resresv);
+	all_resresv_copy = (resource_resv **)malloc((all_resresv_len + 1) * sizeof(resource_resv *));


No need to cast the return value of malloc()

bhroam · 2019-05-01T22:13:32Z

src/scheduler/simulate.c

 				return 0;
+			}


I think this is actually a bug. If we return here, we'll leak events. Instead of returning, errflag++ and break (that is what happens above). That will cause the code below to free everything and return.

bhroam · 2019-05-01T22:15:01Z

test/tests/pbs_smoketest.py

-        r = Reservation()
-        a = {'Resource_List.select': '1:ncpus=1'}
+        a = {'resources_available.ncpus': 4}
+        self.server.manager(MGR_CMD_SET, NODE, a, id=self.mom.shortname)


If you do this, I think you need to add the skip on cpuset decorator to this test. You can't modify the resources on a cpuset machine to differ from what the mom reported.

bhroam · 2019-05-01T22:16:02Z

test/tests/pbs_smoketest.py

+        r = Reservation(TEST_USER)
+        now = int(time.time())
+        a = {'Resource_List.select': '1:ncpus=4',
+             'reserve_start': now + 5,


5s might be too short on slow machines. I'd use 10s.

bhroam · 2019-05-01T22:16:58Z

test/tests/pbs_smoketest.py

+        jid1 = self.server.submit(j1)
+
+        a = {'Resource_List.select': '1:ncpus=1',
+             ATTR_q: rid.split('.')[0], ATTR_J: '1-2'}


Since you're splitting the rid twice, why not do it once and store it into a local variable?

arungrover · 2019-05-01T22:34:24Z

Thanks for quick review @bhroam. I've addressed your review comments.

bhroam

Looks good. I sign off.

nishiya

LGTM

Sched crash when array job is submitted to reservation

f40bd67

bhroam requested changes May 1, 2019

View reviewed changes

Addressed Bhroam's review comments

274d0a5

bhroam approved these changes May 1, 2019

View reviewed changes

nishiya approved these changes May 1, 2019

View reviewed changes

bhroam merged commit af90831 into openpbs:master May 1, 2019

arungrover added a commit to arungrover/openpbs that referenced this pull request Jun 7, 2019

Sched crash when array job and reservation is submitted (openpbs#1109)

fb11ea8

arungrover mentioned this pull request Jun 7, 2019

Sched crash when array job and reservation is submitted #1158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sched crash when array job and reservation is submitted #1109

Sched crash when array job and reservation is submitted #1109

arungrover commented May 1, 2019 •

edited

bhroam May 1, 2019

bhroam May 1, 2019

bhroam May 1, 2019

bhroam May 1, 2019

bhroam May 1, 2019

arungrover commented May 1, 2019

bhroam left a comment

nishiya left a comment

Sched crash when array job and reservation is submitted #1109

Sched crash when array job and reservation is submitted #1109

Conversation

arungrover commented May 1, 2019 • edited

Describe Bug or Feature

Describe Your Change

Link to Design Doc

Attach Test Logs or Output

bhroam May 1, 2019

Choose a reason for hiding this comment

bhroam May 1, 2019

Choose a reason for hiding this comment

bhroam May 1, 2019

Choose a reason for hiding this comment

bhroam May 1, 2019

Choose a reason for hiding this comment

bhroam May 1, 2019

Choose a reason for hiding this comment

arungrover commented May 1, 2019

bhroam left a comment

Choose a reason for hiding this comment

nishiya left a comment

Choose a reason for hiding this comment

arungrover commented May 1, 2019 •

edited