Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Release lock on bounce when bounced with no running instances #1553
Currently the lock on the bounce is released when all the tasks that were running at the time of the bounce have been killed. This means that when a bounce happens when there are no running tasks - like when Singularity is still starting tasks after every task has failed - the expiring bounce can't be removed. This shims a fix by removing the lock when it schedules the new tasks if there are no currently running tasks, it is adding more running tasks, and there is an expiring bounce on the request. This shouldn't interfere with a scale or an incremental bounce, because they always keep at least the minimum number of instances.
I am concerned that it releases the bounce lock too early, though, because normally the bounce is guaranteed to be present until all the new instances have successfully started, which isn't something that this change preservers.
If a bounce occurs in the quiet period between a task failing and a new task being launched by Singularity, it seems that the new tasks are launched and marked as having been started by the bounce, but because no tasks were ever killed, the bounce isn't marked as finished. Add test case to reproduce the issue; will work on fix next.
Shim a fix by removing the bounce lock after scheduling the new tasks when there is a bounce lock and are currently no active tasks. This means that the bounce lock is removed before those tasks become active. I don't think that this will create any problems; this should be the only case where there is a bounce present, more instances are being scheduled, and there are currently no running instances, but it's an odd enough case that I'm not really sure.
Move the bounce release to the cleaner, which is a more reasonable place for it than the scheduler. The bounce gets released slightly earlier now - in particular, it is released immediately after there is a pending request enqueued to schedule more tasks.