PP-657: scheduler loops infinitely after subjob fails to start on node #264

arungrover · 2017-03-02T19:31:57Z

Issue-ID

PP-657

Problem description

When Server is unable to send a subjob to a mom, scheduler goes into infinite loop and never comes out of it.

Cause / Analysis

There were 3 problems here -
1 - Server was not setting the qrank value for subjob while enqueueing them. This resulted into placing the subjob as the first job in the server's job list.
2 - Scheduler assumed that a subjob is always reported after it's parent. So it had no protective checks if this assumption goes wrong. This resulted into infinite loop.
3 - Once a subjob fails to go to a mom, it never gets deleted. This problem has been there since we moved to async run job implementation. Since the job never gets deleted scheduler will never be able to run this job again because server is written with an assumption that if a subjob is queued it is running. So it errors out every subsequent run request from scheduler.

Solution description

Set the qrank on subjob when that is queued.
Change scheduler to check against subjobs being found without an array parent (just in case) and if they are, assign them the parent job's address.
Make sure subjob gets deleted when send_job_exec fails.

Checklist:

I have joined the pbspro community forum.
My pull request contains a single, signed commit. See setting up gpg signature.
My code follows the coding style of this project.
My change requires project documentation. See required documentation checklist for details.
I have added documentation in the project documentation area.
I have added new PTL test(s) to my commit. (See using PTL for testing) (or)
I have added manual test(s) to the Jira ticket and explained why PTL is not appropriate for this case.
All new and existing automated tests have passed. (See running automated PTL tests).
I have attached test logs to the Jira ticket as evidence of testing/verification.

For further information please visit the Developer Guide Home.

bhroam

Looks good to me.

PP-657: scheduler infinite loop after subjob fails to start on node

41bdd80

arungrover changed the title ~~PP-657: scheduler infinite loop after subjob fails to start on node~~ PP-657: scheduler loops infinitely after subjob fails to start on node Mar 2, 2017

bhroam approved these changes Mar 2, 2017

View reviewed changes

mike0042 approved these changes Mar 2, 2017

View reviewed changes

mike0042 merged commit 41bdd80 into openpbs:master Mar 2, 2017

Shrini-h mentioned this pull request Jan 9, 2020

Fixing unknown state message in logs for maintainence state #1467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PP-657: scheduler loops infinitely after subjob fails to start on node #264

PP-657: scheduler loops infinitely after subjob fails to start on node #264

arungrover commented Mar 2, 2017

bhroam left a comment

PP-657: scheduler loops infinitely after subjob fails to start on node #264

PP-657: scheduler loops infinitely after subjob fails to start on node #264

Conversation

arungrover commented Mar 2, 2017

Issue-ID

Problem description

Cause / Analysis

Solution description

Checklist:

bhroam left a comment

Choose a reason for hiding this comment