PP-657: scheduler loops infinitely after subjob fails to start on node #264
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue-ID
Problem description
When Server is unable to send a subjob to a mom, scheduler goes into infinite loop and never comes out of it.
Cause / Analysis
There were 3 problems here -
1 - Server was not setting the qrank value for subjob while enqueueing them. This resulted into placing the subjob as the first job in the server's job list.
2 - Scheduler assumed that a subjob is always reported after it's parent. So it had no protective checks if this assumption goes wrong. This resulted into infinite loop.
3 - Once a subjob fails to go to a mom, it never gets deleted. This problem has been there since we moved to async run job implementation. Since the job never gets deleted scheduler will never be able to run this job again because server is written with an assumption that if a subjob is queued it is running. So it errors out every subsequent run request from scheduler.
Solution description
Checklist:
For further information please visit the Developer Guide Home.