-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow immediate run in pending queue with deploy #1538
Conversation
Give immediate run requests in the pending queue a seperate key, so they are permitted to be in the pending queue at the same time as a new deploy for the same requestID / deployID. This means that a pending deploy will not block an immediate run from being enqueued.
Add test to validate that an immediate (run-now) test replaces a scheduled task in the pending list.
Thanks for the additional test, let's give this a go in staging 👍 |
In particular, if attempting to delete an immediate run node from ZK indicates that the node wasn't there, try deleting it from the other location instead. /cc @ssalinas
(existingRequest.get().getPendingType() == PendingType.TASK_DONE | ||
|| existingRequest.get().getPendingType() == PendingType.NEW_DEPLOY)) { | ||
boolean markImmediate = forceImmediate | ||
|| (existingRequest.isPresent() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may still be causing us some issues. I've seen a scheduled request or two pop through that never got a pending task in place after a successful deploy, I think due to the two possible deletes and the fact that they could be enqueued between those attempts. We also have a weird bit going on with the getPendingPath(String requestId, String deployId)
implementation which returns a different version of the path when we call getPendingRequest
I'm wondering about the consequences of returning String.format("%s%s", deployKey, pendingRequest.getTimestamp())
for all cases. Since we sort the pending requests by timestamp, we will still be able to put multiple in the queue. The work then falls to the scheduler to de-dupe and process appropriately. We should check out where else we might be relying on the EXISTED
response to see if this is possible
Defer checking duplicate requests from ZK into the scheduler. This allows slightly more control over what kinds of requests are allowed to be in the pending queue simultaneously, and avoids the kind of edge cases that we were running into previously by attempting to use nodenames in ZK to resolve the same problem. /cc @ssalinas
return ZKPaths.makePath(PENDING_PATH_ROOT, nodeName); | ||
} | ||
|
||
private String pendingQueueKey(SingularityPendingRequest pendingRequest) { | ||
SingularityDeployKey deployKey = new SingularityDeployKey(pendingRequest.getRequestId(), pendingRequest.getDeployId()); | ||
return String.format("%s%s", deployKey.toString(), pendingRequest.getTimestamp()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only immediate thing I see here is that we'd need to handle any pending requests that are currently in the queue when singularity starts. Otherwise we will not delete them correctly since we will calculate the wrong path, but we will continually process them since it's just a getChildren call. Should be able to write a simple zk migration for this that reads them all writes to the correct path and deletes the old path on startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only immediate / one-off requests have a different pending path now, do we still need the zk migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We most likely do. This will still change the path of things that might be currently in the pending queue on startup. So we'd need to rewrite those to their new path
List<SingularityPendingRequest> effectivePendingRequests = new ArrayList<>(); | ||
pendingRequestsForDeploy.sort(Comparator.comparingLong(SingularityPendingRequest::getTimestamp)); | ||
for (SingularityPendingRequest pendingRequest : pendingRequestsForDeploy) { | ||
final SingularityRequest updatedRequest = updatedRequest(maybePendingDeploy, pendingRequest, maybeRequest.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this earlier on and make sure all methods that use the SingularityRequest object are using this one. i.e. in the old code getMatchingTaskIds
was also called with the updated one instead of maybeRequest.get().getRequest()
. Maybe put this up right after the isRequestActive check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getMatchingTaskIds
method really only cares about the requestId
and whether the request is long-running, which I don't think can change per pending request. The reason that I'm pushing back here is that if it needs to be run on the updated request, then it needs to be run in the inner loop, rather than the outer one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's right we've got the inner loop for the individual pending requests now, 👍
LOG.trace("Holding pending request {} because it is scheduled and has an active task", pendingRequest); | ||
heldForScheduledActiveTask++; | ||
continue; | ||
requestManager.deletePendingRequest(pendingRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for any pending requests that we are scheduling tasks for, the delete should come after the scheduleTasks call. If we are interrupted between here and scheduleTasks we lose a pending request and could get into an inconsistent state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second thing on this, if we hit the else
case there, should we really be deleting the pending request? In this particular case if we had a BOUNCE and a NEW_DEPLOY in the queue, we'd take the bounce and end up deleting the new deploy
|
||
requestManager.deletePendingRequest(pendingRequest); | ||
totalNewScheduledTasks += scheduledInstances; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep the variable as numScheduledTasks
since we call it tasks not instances everywhere else in this class
@@ -398,8 +416,30 @@ private void deleteScheduledTasks(final Collection<SingularityPendingTask> sched | |||
} | |||
} | |||
|
|||
private int scheduleTasks(SingularitySchedulerStateCache stateCache, SingularityRequest request, RequestState state, SingularityDeployStatistics deployStatistics, | |||
SingularityPendingRequest pendingRequest, List<SingularityTaskId> matchingTaskIds, Optional<SingularityPendingDeploy> maybePendingDeploy) { | |||
private List<SingularityTaskId> getMatchingTaskIds(SingularitySchedulerStateCache stateCache, SingularityRequest request, SingularityDeployKey deployKey) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you duplicated this method?
In particular - Revert to only using the timestamp format for immediate & one-off requests - Correct the order of schedule / delete in the scheduler to guarantee that the scheduler is not left in an inconsistent state. - Rename some variables for consistency with the result of the class
Add a ZK migration that will rewrite pending requests onto the new format, where they are timestamped to allow multiples in the pending queue at a time.
👍 looks to go give this another go with the zk migration in there now |
this has looked good with the new updates, merging. Thanks @PtrTeixeira |
Give immediate run requests in the pending queue a separate key, so they
are permitted to be in the pending queue at the same time as a new
deploy for the same requestID / deployID. This means that a pending
deploy will not block an immediate run from being enqueued.
/cc @ssalinas