Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry lost tasks #1773

Merged
merged 10 commits into from
May 8, 2018
Merged

Retry lost tasks #1773

merged 10 commits into from
May 8, 2018

Conversation

pschoenfelder
Copy link
Contributor

No description provided.

@@ -175,6 +176,10 @@ private void saveNewTaskStatusHolder(SingularityTaskId taskIdObj, SingularityTas
return Optional.absent();
}

private void relaunchTask(SingularityTask task) {
taskManager.savePendingTask(task.getTaskRequest().getPendingTask());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually how to put something on the pending queue? Not convinced this is correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm iffy on having this here. This is partially because we let the status update handler write to pending tasks, and partially because it means we are reusing a pending task id. Generally we let the SingularityScheduler do all of the work of creating a pending task to keep responsibility for those types of operations separate. We actually removed bits from the status update handler a little while back so that it would avoid mutating the pending task queue.

Instead I'd suggest using requestManager to add to the pending request queue. This will let the scheduler do it's normal thing and rebuild a full new pending task with new unique ID from that pending request

@pschoenfelder pschoenfelder requested a review from ssalinas April 3, 2018 17:44
Copy link
Member

@ssalinas ssalinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments above on the pending task piece. Would also be good to see unit tests for this. We have a good unit test framework set up already for testing interactions with status updates

RequestType requestType = task.isPresent() ? task.get().getTaskRequest().getRequest().getRequestType() : null;
boolean isRelaunchable =
requestType != null
&& (requestType == RequestType.ON_DEMAND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use !isLongRunning here instead of having to specify all 3

LOG.info("Relaunching lost task {}", task);
relaunchTask(task.get());
} else {
lostTasksMeter.mark();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll still want the lost task meter to fire regardless of failure type. The lost task metric allows us to do things like alert on a large wave of these (indicating that something is wrong with singularity. Similar with the disaster detection piece below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this still cause the alerts that the original ticket wanted to resolve then?

Copy link
Member

@ssalinas ssalinas Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There weren't alerts in the original issue. The original issue was that a failure that was mesos'/singularity's fault was counting towards the retry count for a scheduled task. i.e. it was given 2 attempts, both failed due to the invalid offers reason. So, even though the task code itself wasn't the thing that failed, it wasn't retried again

@@ -175,6 +176,10 @@ private void saveNewTaskStatusHolder(SingularityTaskId taskIdObj, SingularityTas
return Optional.absent();
}

private void relaunchTask(SingularityTask task) {
taskManager.savePendingTask(task.getTaskRequest().getPendingTask());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm iffy on having this here. This is partially because we let the status update handler write to pending tasks, and partially because it means we are reusing a pending task id. Generally we let the SingularityScheduler do all of the work of creating a pending task to keep responsibility for those types of operations separate. We actually removed bits from the status update handler a little while back so that it would avoid mutating the pending task queue.

Instead I'd suggest using requestManager to add to the pending request queue. This will let the scheduler do it's normal thing and rebuild a full new pending task with new unique ID from that pending request

@pschoenfelder
Copy link
Contributor Author

Where is that set of unit tests?

@ssalinas
Copy link
Member

ssalinas commented Apr 9, 2018

@pschoenfelder for unit test examples, you can look at anything that extends SingularitySchedulerTestBase. Those all have lots of helpers set up. You'll see the statusUpdate helper method called quite often, which you can use to test this out

@ssalinas ssalinas added this to the 0.20.0 milestone Apr 11, 2018
} else {
System.out.println(requestManager.getPendingRequests());
Assert.assertEquals(requestManager.getPendingRequests().size(), 0);
// Assert.assertEquals(requestManager.getPendingRequests().get(0).getPendingType(), PendingType.TASK_DONE);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm misunderstanding something because one of the two negative test cases fails here. The two negative test cases have different behaviors. One, itDoesNotRetryLostRequestsDueToNonAgentFailures, results in an empty pending queue, which is what I'd expect. The other, itDoesNotRetryLostLongRunningRequests, winds up with a request in the pending queue, even though I've manually verified through the debugger that the new "relaunch task" code path is never hit. This request also winds up with a PendingType of TASK_DONE which is definitely suspicious. Before I spend more time debugging, is there some obvious behavior I'm missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, that's normal that TASK_DONE is enqueued. What's different is that a TASK_DONE for a scheduled job won't necessarily result in a new task being started right away. We want to make sure we have our own pending request of a different type, that will trigger the desired behavior vs TASK_DONE which would do things like just schedule the next interval for a scheduled task instead of scheduling right away (and for an ON_DEMAND I believe TASK_DONE would be a no-op)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it sounds like this is actually okay as is then. Both of these cases are the negative scenario where we don't want to retry, and it sounds like empty or TASK_DONE both work for that. Only in the positive test case do we expect and get a RETRY.

@ssalinas
Copy link
Member

ssalinas commented May 3, 2018

🚢

@ssalinas ssalinas merged commit b2ae702 into master May 8, 2018
@ssalinas ssalinas deleted the retry-lost-tasks branch May 8, 2018 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants