New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release lock on bounce when bounced with no running instances #1553

Merged
merged 3 commits into from Jun 19, 2017

Conversation

Projects
None yet
2 participants
@PtrTeixeira
Contributor

PtrTeixeira commented May 24, 2017

Currently the lock on the bounce is released when all the tasks that were running at the time of the bounce have been killed. This means that when a bounce happens when there are no running tasks - like when Singularity is still starting tasks after every task has failed - the expiring bounce can't be removed. This shims a fix by removing the lock when it schedules the new tasks if there are no currently running tasks, it is adding more running tasks, and there is an expiring bounce on the request. This shouldn't interfere with a scale or an incremental bounce, because they always keep at least the minimum number of instances.

I am concerned that it releases the bounce lock too early, though, because normally the bounce is guaranteed to be present until all the new instances have successfully started, which isn't something that this change preservers.

/cc @ssalinas

PtrTeixeira added some commits May 24, 2017

Reproduce issue causing bounce to get stuck
If a bounce occurs in the quiet period between a task failing and a new
task being launched by Singularity, it seems that the new tasks are
launched and marked as having been started by the bounce, but because no
tasks were ever killed, the bounce isn't marked as finished. Add test
case to reproduce the issue; will work on fix next.
Kill bounce on scheduling new tasks
Shim a fix by removing the bounce lock after scheduling the new tasks
when there is a bounce lock and are currently no active tasks.  This
means that the bounce lock is removed before those tasks become active.
I don't think that this will create any problems; this should be the
only case where there is a bounce present, more instances are being
scheduled, and there are currently no running instances, but it's an odd
enough case that I'm not really sure.
Show outdated Hide outdated ...rc/main/java/com/hubspot/singularity/scheduler/SingularityScheduler.java
@@ -411,6 +412,12 @@ private int scheduleTasks(SingularitySchedulerStateCache stateCache, Singularity
if (numMissingInstances > 0) {
schedule(numMissingInstances, matchingTaskIds, request, state, deployStatistics, pendingRequest, maybePendingDeploy);
List<SingularityTaskId> remainingActiveTasks = new ArrayList<>(matchingTaskIds);

This comment has been minimized.

@ssalinas

ssalinas May 25, 2017

Member

Logic here is good, just debating if this is the right place to put this. Ideally since bounce is all cleanup related this would live in the SingularityCleaner. Maybe it makes sense for each run of the cleaner to double check any in progress bounces or requests that are currently marked as bouncing and verify that they are correct?

Having it here will solve the current edge case we ran into, but I could see there still being more. Moving something like that to the cleaner might give us a better long term solution

@ssalinas

ssalinas May 25, 2017

Member

Logic here is good, just debating if this is the right place to put this. Ideally since bounce is all cleanup related this would live in the SingularityCleaner. Maybe it makes sense for each run of the cleaner to double check any in progress bounces or requests that are currently marked as bouncing and verify that they are correct?

Having it here will solve the current edge case we ran into, but I could see there still being more. Moving something like that to the cleaner might give us a better long term solution

Move bounce release to cleaner
Move the bounce release to the cleaner, which is a more reasonable place
for it than the scheduler. The bounce gets released slightly earlier now
- in particular, it is released immediately after there is a pending
request enqueued to schedule more tasks.

@PtrTeixeira PtrTeixeira changed the title from [WIP] Release lock on bounce when bounced with no running instances to Release lock on bounce when bounced with no running instances May 26, 2017

@ssalinas ssalinas modified the milestone: 0.16.0 Jun 2, 2017

@ssalinas ssalinas merged commit a70f552 into master Jun 19, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ssalinas ssalinas deleted the catch-stuck-bounce branch Jun 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment