Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PP-657: scheduler loops infinitely after subjob fails to start on node #264

Merged
merged 1 commit into from
Mar 2, 2017

Conversation

arungrover
Copy link
Contributor

Issue-ID

Problem description

When Server is unable to send a subjob to a mom, scheduler goes into infinite loop and never comes out of it.

Cause / Analysis

There were 3 problems here -
1 - Server was not setting the qrank value for subjob while enqueueing them. This resulted into placing the subjob as the first job in the server's job list.
2 - Scheduler assumed that a subjob is always reported after it's parent. So it had no protective checks if this assumption goes wrong. This resulted into infinite loop.
3 - Once a subjob fails to go to a mom, it never gets deleted. This problem has been there since we moved to async run job implementation. Since the job never gets deleted scheduler will never be able to run this job again because server is written with an assumption that if a subjob is queued it is running. So it errors out every subsequent run request from scheduler.

Solution description

  • Set the qrank on subjob when that is queued.
  • Change scheduler to check against subjobs being found without an array parent (just in case) and if they are, assign them the parent job's address.
  • Make sure subjob gets deleted when send_job_exec fails.

Checklist:

For further information please visit the Developer Guide Home.

@arungrover arungrover changed the title PP-657: scheduler infinite loop after subjob fails to start on node PP-657: scheduler loops infinitely after subjob fails to start on node Mar 2, 2017
Copy link
Contributor

@bhroam bhroam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants