-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry lost tasks #1773
Retry lost tasks #1773
Conversation
@@ -175,6 +176,10 @@ private void saveNewTaskStatusHolder(SingularityTaskId taskIdObj, SingularityTas | |||
return Optional.absent(); | |||
} | |||
|
|||
private void relaunchTask(SingularityTask task) { | |||
taskManager.savePendingTask(task.getTaskRequest().getPendingTask()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually how to put something on the pending queue? Not convinced this is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm iffy on having this here. This is partially because we let the status update handler write to pending tasks, and partially because it means we are reusing a pending task id. Generally we let the SingularityScheduler do all of the work of creating a pending task to keep responsibility for those types of operations separate. We actually removed bits from the status update handler a little while back so that it would avoid mutating the pending task queue.
Instead I'd suggest using requestManager
to add to the pending request queue. This will let the scheduler do it's normal thing and rebuild a full new pending task with new unique ID from that pending request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments above on the pending task piece. Would also be good to see unit tests for this. We have a good unit test framework set up already for testing interactions with status updates
RequestType requestType = task.isPresent() ? task.get().getTaskRequest().getRequest().getRequestType() : null; | ||
boolean isRelaunchable = | ||
requestType != null | ||
&& (requestType == RequestType.ON_DEMAND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use !isLongRunning here instead of having to specify all 3
LOG.info("Relaunching lost task {}", task); | ||
relaunchTask(task.get()); | ||
} else { | ||
lostTasksMeter.mark(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll still want the lost task meter to fire regardless of failure type. The lost task metric allows us to do things like alert on a large wave of these (indicating that something is wrong with singularity. Similar with the disaster detection piece below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this still cause the alerts that the original ticket wanted to resolve then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There weren't alerts in the original issue. The original issue was that a failure that was mesos'/singularity's fault was counting towards the retry count for a scheduled task. i.e. it was given 2 attempts, both failed due to the invalid offers reason. So, even though the task code itself wasn't the thing that failed, it wasn't retried again
@@ -175,6 +176,10 @@ private void saveNewTaskStatusHolder(SingularityTaskId taskIdObj, SingularityTas | |||
return Optional.absent(); | |||
} | |||
|
|||
private void relaunchTask(SingularityTask task) { | |||
taskManager.savePendingTask(task.getTaskRequest().getPendingTask()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm iffy on having this here. This is partially because we let the status update handler write to pending tasks, and partially because it means we are reusing a pending task id. Generally we let the SingularityScheduler do all of the work of creating a pending task to keep responsibility for those types of operations separate. We actually removed bits from the status update handler a little while back so that it would avoid mutating the pending task queue.
Instead I'd suggest using requestManager
to add to the pending request queue. This will let the scheduler do it's normal thing and rebuild a full new pending task with new unique ID from that pending request
Where is that set of unit tests? |
@pschoenfelder for unit test examples, you can look at anything that extends |
} else { | ||
System.out.println(requestManager.getPendingRequests()); | ||
Assert.assertEquals(requestManager.getPendingRequests().size(), 0); | ||
// Assert.assertEquals(requestManager.getPendingRequests().get(0).getPendingType(), PendingType.TASK_DONE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm misunderstanding something because one of the two negative test cases fails here. The two negative test cases have different behaviors. One, itDoesNotRetryLostRequestsDueToNonAgentFailures
, results in an empty pending queue, which is what I'd expect. The other, itDoesNotRetryLostLongRunningRequests
, winds up with a request in the pending queue, even though I've manually verified through the debugger that the new "relaunch task" code path is never hit. This request also winds up with a PendingType
of TASK_DONE
which is definitely suspicious. Before I spend more time debugging, is there some obvious behavior I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, that's normal that TASK_DONE is enqueued. What's different is that a TASK_DONE for a scheduled job won't necessarily result in a new task being started right away. We want to make sure we have our own pending request of a different type, that will trigger the desired behavior vs TASK_DONE which would do things like just schedule the next interval for a scheduled task instead of scheduling right away (and for an ON_DEMAND I believe TASK_DONE would be a no-op)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it sounds like this is actually okay as is then. Both of these cases are the negative scenario where we don't want to retry, and it sounds like empty or TASK_DONE
both work for that. Only in the positive test case do we expect and get a RETRY
.
…retry-lost-tasks
…retry-lost-tasks
🚢 |
No description provided.