Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry lost tasks #1773

Merged
merged 10 commits into from May 8, 2018
Merged

Retry lost tasks #1773

merged 10 commits into from May 8, 2018

Conversation

@pschoenfelder
Copy link
Contributor

@pschoenfelder pschoenfelder commented Apr 3, 2018

No description provided.

Paul Schoenfelder
@@ -175,6 +176,10 @@ private void saveNewTaskStatusHolder(SingularityTaskId taskIdObj, SingularityTas
return Optional.absent();
}

private void relaunchTask(SingularityTask task) {
taskManager.savePendingTask(task.getTaskRequest().getPendingTask());

This comment has been minimized.

@pschoenfelder

pschoenfelder Apr 3, 2018
Author Contributor

Is this actually how to put something on the pending queue? Not convinced this is correct

This comment has been minimized.

@ssalinas

ssalinas Apr 3, 2018
Member

I'm iffy on having this here. This is partially because we let the status update handler write to pending tasks, and partially because it means we are reusing a pending task id. Generally we let the SingularityScheduler do all of the work of creating a pending task to keep responsibility for those types of operations separate. We actually removed bits from the status update handler a little while back so that it would avoid mutating the pending task queue.

Instead I'd suggest using requestManager to add to the pending request queue. This will let the scheduler do it's normal thing and rebuild a full new pending task with new unique ID from that pending request

@pschoenfelder pschoenfelder requested a review from ssalinas Apr 3, 2018
Copy link
Member

@ssalinas ssalinas left a comment

Few comments above on the pending task piece. Would also be good to see unit tests for this. We have a good unit test framework set up already for testing interactions with status updates

RequestType requestType = task.isPresent() ? task.get().getTaskRequest().getRequest().getRequestType() : null;
boolean isRelaunchable =
requestType != null
&& (requestType == RequestType.ON_DEMAND

This comment has been minimized.

@ssalinas

ssalinas Apr 3, 2018
Member

You can just use !isLongRunning here instead of having to specify all 3

LOG.info("Relaunching lost task {}", task);
relaunchTask(task.get());
} else {
lostTasksMeter.mark();

This comment has been minimized.

@ssalinas

ssalinas Apr 3, 2018
Member

we'll still want the lost task meter to fire regardless of failure type. The lost task metric allows us to do things like alert on a large wave of these (indicating that something is wrong with singularity. Similar with the disaster detection piece below

This comment has been minimized.

@pschoenfelder

pschoenfelder Apr 4, 2018
Author Contributor

Won't this still cause the alerts that the original ticket wanted to resolve then?

This comment has been minimized.

@ssalinas

ssalinas Apr 4, 2018
Member

There weren't alerts in the original issue. The original issue was that a failure that was mesos'/singularity's fault was counting towards the retry count for a scheduled task. i.e. it was given 2 attempts, both failed due to the invalid offers reason. So, even though the task code itself wasn't the thing that failed, it wasn't retried again

@@ -175,6 +176,10 @@ private void saveNewTaskStatusHolder(SingularityTaskId taskIdObj, SingularityTas
return Optional.absent();
}

private void relaunchTask(SingularityTask task) {
taskManager.savePendingTask(task.getTaskRequest().getPendingTask());

This comment has been minimized.

@ssalinas

ssalinas Apr 3, 2018
Member

I'm iffy on having this here. This is partially because we let the status update handler write to pending tasks, and partially because it means we are reusing a pending task id. Generally we let the SingularityScheduler do all of the work of creating a pending task to keep responsibility for those types of operations separate. We actually removed bits from the status update handler a little while back so that it would avoid mutating the pending task queue.

Instead I'd suggest using requestManager to add to the pending request queue. This will let the scheduler do it's normal thing and rebuild a full new pending task with new unique ID from that pending request

Paul Schoenfelder added 2 commits Apr 4, 2018
Paul Schoenfelder
Paul Schoenfelder
@pschoenfelder
Copy link
Contributor Author

@pschoenfelder pschoenfelder commented Apr 5, 2018

Where is that set of unit tests?

@ssalinas
Copy link
Member

@ssalinas ssalinas commented Apr 9, 2018

@pschoenfelder for unit test examples, you can look at anything that extends SingularitySchedulerTestBase. Those all have lots of helpers set up. You'll see the statusUpdate helper method called quite often, which you can use to test this out

@ssalinas ssalinas added this to the 0.20.0 milestone Apr 11, 2018
Paul Schoenfelder
} else {
System.out.println(requestManager.getPendingRequests());
Assert.assertEquals(requestManager.getPendingRequests().size(), 0);
// Assert.assertEquals(requestManager.getPendingRequests().get(0).getPendingType(), PendingType.TASK_DONE);

This comment has been minimized.

@pschoenfelder

pschoenfelder Apr 11, 2018
Author Contributor

I think I'm misunderstanding something because one of the two negative test cases fails here. The two negative test cases have different behaviors. One, itDoesNotRetryLostRequestsDueToNonAgentFailures, results in an empty pending queue, which is what I'd expect. The other, itDoesNotRetryLostLongRunningRequests, winds up with a request in the pending queue, even though I've manually verified through the debugger that the new "relaunch task" code path is never hit. This request also winds up with a PendingType of TASK_DONE which is definitely suspicious. Before I spend more time debugging, is there some obvious behavior I'm missing?

This comment has been minimized.

@ssalinas

ssalinas Apr 13, 2018
Member

So, that's normal that TASK_DONE is enqueued. What's different is that a TASK_DONE for a scheduled job won't necessarily result in a new task being started right away. We want to make sure we have our own pending request of a different type, that will trigger the desired behavior vs TASK_DONE which would do things like just schedule the next interval for a scheduled task instead of scheduling right away (and for an ON_DEMAND I believe TASK_DONE would be a no-op)

This comment has been minimized.

@pschoenfelder

pschoenfelder Apr 24, 2018
Author Contributor

So it sounds like this is actually okay as is then. Both of these cases are the negative scenario where we don't want to retry, and it sounds like empty or TASK_DONE both work for that. Only in the positive test case do we expect and get a RETRY.

Paul Schoenfelder added 3 commits Apr 24, 2018
Paul Schoenfelder
…retry-lost-tasks
…retry-lost-tasks
baconmania and others added 3 commits Apr 27, 2018
Paul Schoenfelder
Paul Schoenfelder
@ssalinas
Copy link
Member

@ssalinas ssalinas commented May 3, 2018

🚢

@ssalinas ssalinas merged commit b2ae702 into master May 8, 2018
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@ssalinas ssalinas deleted the retry-lost-tasks branch May 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.