Release lock on bounce when bounced with no running instances#1553
Merged
Conversation
If a bounce occurs in the quiet period between a task failing and a new task being launched by Singularity, it seems that the new tasks are launched and marked as having been started by the bounce, but because no tasks were ever killed, the bounce isn't marked as finished. Add test case to reproduce the issue; will work on fix next.
Shim a fix by removing the bounce lock after scheduling the new tasks when there is a bounce lock and are currently no active tasks. This means that the bounce lock is removed before those tasks become active. I don't think that this will create any problems; this should be the only case where there is a bounce present, more instances are being scheduled, and there are currently no running instances, but it's an odd enough case that I'm not really sure.
ssalinas
reviewed
May 25, 2017
| if (numMissingInstances > 0) { | ||
| schedule(numMissingInstances, matchingTaskIds, request, state, deployStatistics, pendingRequest, maybePendingDeploy); | ||
|
|
||
| List<SingularityTaskId> remainingActiveTasks = new ArrayList<>(matchingTaskIds); |
Contributor
There was a problem hiding this comment.
Logic here is good, just debating if this is the right place to put this. Ideally since bounce is all cleanup related this would live in the SingularityCleaner. Maybe it makes sense for each run of the cleaner to double check any in progress bounces or requests that are currently marked as bouncing and verify that they are correct?
Having it here will solve the current edge case we ran into, but I could see there still being more. Moving something like that to the cleaner might give us a better long term solution
Move the bounce release to the cleaner, which is a more reasonable place for it than the scheduler. The bounce gets released slightly earlier now - in particular, it is released immediately after there is a pending request enqueued to schedule more tasks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently the lock on the bounce is released when all the tasks that were running at the time of the bounce have been killed. This means that when a bounce happens when there are no running tasks - like when Singularity is still starting tasks after every task has failed - the expiring bounce can't be removed. This shims a fix by removing the lock when it schedules the new tasks if there are no currently running tasks, it is adding more running tasks, and there is an expiring bounce on the request. This shouldn't interfere with a scale or an incremental bounce, because they always keep at least the minimum number of instances.
I am concerned that it releases the bounce lock too early, though, because normally the bounce is guaranteed to be present until all the new instances have successfully started, which isn't something that this change preservers.
/cc @ssalinas