Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release lock on bounce when bounced with no running instances #1553

Merged
merged 3 commits into from
Jun 19, 2017

Conversation

PtrTeixeira
Copy link
Contributor

Currently the lock on the bounce is released when all the tasks that were running at the time of the bounce have been killed. This means that when a bounce happens when there are no running tasks - like when Singularity is still starting tasks after every task has failed - the expiring bounce can't be removed. This shims a fix by removing the lock when it schedules the new tasks if there are no currently running tasks, it is adding more running tasks, and there is an expiring bounce on the request. This shouldn't interfere with a scale or an incremental bounce, because they always keep at least the minimum number of instances.

I am concerned that it releases the bounce lock too early, though, because normally the bounce is guaranteed to be present until all the new instances have successfully started, which isn't something that this change preservers.

/cc @ssalinas

If a bounce occurs in the quiet period between a task failing and a new
task being launched by Singularity, it seems that the new tasks are
launched and marked as having been started by the bounce, but because no
tasks were ever killed, the bounce isn't marked as finished. Add test
case to reproduce the issue; will work on fix next.
Shim a fix by removing the bounce lock after scheduling the new tasks
when there is a bounce lock and are currently no active tasks.  This
means that the bounce lock is removed before those tasks become active.
I don't think that this will create any problems; this should be the
only case where there is a bounce present, more instances are being
scheduled, and there are currently no running instances, but it's an odd
enough case that I'm not really sure.
@@ -411,6 +412,12 @@ private int scheduleTasks(SingularitySchedulerStateCache stateCache, Singularity

if (numMissingInstances > 0) {
schedule(numMissingInstances, matchingTaskIds, request, state, deployStatistics, pendingRequest, maybePendingDeploy);

List<SingularityTaskId> remainingActiveTasks = new ArrayList<>(matchingTaskIds);
Copy link
Member

@ssalinas ssalinas May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic here is good, just debating if this is the right place to put this. Ideally since bounce is all cleanup related this would live in the SingularityCleaner. Maybe it makes sense for each run of the cleaner to double check any in progress bounces or requests that are currently marked as bouncing and verify that they are correct?

Having it here will solve the current edge case we ran into, but I could see there still being more. Moving something like that to the cleaner might give us a better long term solution

Move the bounce release to the cleaner, which is a more reasonable place
for it than the scheduler. The bounce gets released slightly earlier now
- in particular, it is released immediately after there is a pending
request enqueued to schedule more tasks.
@PtrTeixeira PtrTeixeira changed the title [WIP] Release lock on bounce when bounced with no running instances Release lock on bounce when bounced with no running instances May 26, 2017
@ssalinas ssalinas modified the milestone: 0.16.0 Jun 2, 2017
@ssalinas ssalinas merged commit a70f552 into master Jun 19, 2017
@ssalinas ssalinas deleted the catch-stuck-bounce branch June 19, 2017 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants