Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow immediate run in pending queue with deploy #1538

Merged
merged 7 commits into from
Jun 12, 2017

Conversation

PtrTeixeira
Copy link
Contributor

Give immediate run requests in the pending queue a separate key, so they
are permitted to be in the pending queue at the same time as a new
deploy for the same requestID / deployID. This means that a pending
deploy will not block an immediate run from being enqueued.

/cc @ssalinas

Give immediate run requests in the pending queue a seperate key, so they
are permitted to be in the pending queue at the same time as a new
deploy for the same requestID / deployID. This means that a pending
deploy will not block an immediate run from being enqueued.
Add test to validate that an immediate (run-now) test replaces a
scheduled task in the pending list.
@ssalinas ssalinas modified the milestone: 0.16.0 May 10, 2017
@ssalinas
Copy link
Member

Thanks for the additional test, let's give this a go in staging 👍

In particular, if attempting to delete an immediate run node from ZK
indicates that the node wasn't there, try deleting it from the other
location instead.

/cc @ssalinas
(existingRequest.get().getPendingType() == PendingType.TASK_DONE
|| existingRequest.get().getPendingType() == PendingType.NEW_DEPLOY)) {
boolean markImmediate = forceImmediate
|| (existingRequest.isPresent()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may still be causing us some issues. I've seen a scheduled request or two pop through that never got a pending task in place after a successful deploy, I think due to the two possible deletes and the fact that they could be enqueued between those attempts. We also have a weird bit going on with the getPendingPath(String requestId, String deployId) implementation which returns a different version of the path when we call getPendingRequest

I'm wondering about the consequences of returning String.format("%s%s", deployKey, pendingRequest.getTimestamp()) for all cases. Since we sort the pending requests by timestamp, we will still be able to put multiple in the queue. The work then falls to the scheduler to de-dupe and process appropriately. We should check out where else we might be relying on the EXISTED response to see if this is possible

Defer checking duplicate requests from ZK into the scheduler.  This
allows slightly more control over what kinds of requests are allowed to
be in the pending queue simultaneously, and avoids the kind of edge
cases that we were running into previously by attempting to use
nodenames in ZK to resolve the same problem.

/cc @ssalinas
return ZKPaths.makePath(PENDING_PATH_ROOT, nodeName);
}

private String pendingQueueKey(SingularityPendingRequest pendingRequest) {
SingularityDeployKey deployKey = new SingularityDeployKey(pendingRequest.getRequestId(), pendingRequest.getDeployId());
return String.format("%s%s", deployKey.toString(), pendingRequest.getTimestamp());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only immediate thing I see here is that we'd need to handle any pending requests that are currently in the queue when singularity starts. Otherwise we will not delete them correctly since we will calculate the wrong path, but we will continually process them since it's just a getChildren call. Should be able to write a simple zk migration for this that reads them all writes to the correct path and deletes the old path on startup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only immediate / one-off requests have a different pending path now, do we still need the zk migration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We most likely do. This will still change the path of things that might be currently in the pending queue on startup. So we'd need to rewrite those to their new path

List<SingularityPendingRequest> effectivePendingRequests = new ArrayList<>();
pendingRequestsForDeploy.sort(Comparator.comparingLong(SingularityPendingRequest::getTimestamp));
for (SingularityPendingRequest pendingRequest : pendingRequestsForDeploy) {
final SingularityRequest updatedRequest = updatedRequest(maybePendingDeploy, pendingRequest, maybeRequest.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this earlier on and make sure all methods that use the SingularityRequest object are using this one. i.e. in the old code getMatchingTaskIds was also called with the updated one instead of maybeRequest.get().getRequest(). Maybe put this up right after the isRequestActive check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getMatchingTaskIds method really only cares about the requestId and whether the request is long-running, which I don't think can change per pending request. The reason that I'm pushing back here is that if it needs to be run on the updated request, then it needs to be run in the inner loop, rather than the outer one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's right we've got the inner loop for the individual pending requests now, 👍

LOG.trace("Holding pending request {} because it is scheduled and has an active task", pendingRequest);
heldForScheduledActiveTask++;
continue;
requestManager.deletePendingRequest(pendingRequest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for any pending requests that we are scheduling tasks for, the delete should come after the scheduleTasks call. If we are interrupted between here and scheduleTasks we lose a pending request and could get into an inconsistent state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second thing on this, if we hit the else case there, should we really be deleting the pending request? In this particular case if we had a BOUNCE and a NEW_DEPLOY in the queue, we'd take the bounce and end up deleting the new deploy


requestManager.deletePendingRequest(pendingRequest);
totalNewScheduledTasks += scheduledInstances;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the variable as numScheduledTasks since we call it tasks not instances everywhere else in this class

@@ -398,8 +416,30 @@ private void deleteScheduledTasks(final Collection<SingularityPendingTask> sched
}
}

private int scheduleTasks(SingularitySchedulerStateCache stateCache, SingularityRequest request, RequestState state, SingularityDeployStatistics deployStatistics,
SingularityPendingRequest pendingRequest, List<SingularityTaskId> matchingTaskIds, Optional<SingularityPendingDeploy> maybePendingDeploy) {
private List<SingularityTaskId> getMatchingTaskIds(SingularitySchedulerStateCache stateCache, SingularityRequest request, SingularityDeployKey deployKey) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you duplicated this method?

In particular
- Revert to only using the timestamp format for immediate & one-off
  requests
- Correct the order of schedule / delete in the scheduler to
  guarantee that the scheduler is not left in an inconsistent
  state.
- Rename some variables for consistency with the result of the
  class
Add a ZK migration that will rewrite pending requests onto the new
format, where they are timestamped to allow multiples in the pending
queue at a time.
@ssalinas
Copy link
Member

ssalinas commented Jun 1, 2017

👍 looks to go give this another go with the zk migration in there now

@ssalinas
Copy link
Member

this has looked good with the new updates, merging. Thanks @PtrTeixeira

@ssalinas ssalinas merged commit dd72a31 into master Jun 12, 2017
@ssalinas ssalinas deleted the add-items-in-pending-queue branch June 12, 2017 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants