Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Sending an 'IN_PROGRESS' update for task resurfaces it #1378

Open
Shole4ever opened this issue Nov 7, 2019 · 14 comments
Open

Sending an 'IN_PROGRESS' update for task resurfaces it #1378

Shole4ever opened this issue Nov 7, 2019 · 14 comments

Comments

@Shole4ever
Copy link

Hi,
I need to use the 'responseTimeoutSeconds' as heartbeat for the consumer, and 'timeoutSeconds' as the terminal timeout. The use case being that the consumer of the task need to notify conductor server within every responseTimeoutSeconds time that it is still consuming the task.

For sending the heartbeat, we are sending a taskUpdate via the POST /tasks API refer: [https://netflix.github.io/conductor/apispec/#polling-ack-and-update-task], with 'status' as 'IN_PROGRESS'.
However, this results in server resurfacing the task. [refer: WorkflowExecutor.java:834]
Can you suggest if I am doing something wrong here, or is this a bug altogether.

Thanks.

@bharadwajrembar
Copy link

@kishorebanala @apanicker-nflx Could you please clarify this?

@kishorebanala
Copy link
Contributor

@Shole4ever I'm not sure if I understood what you meant by resurfacing the task. But, if you mean the task is being polled and executed immediately, and if you want to delay the task execution for a while, you can set callbackAfterSeconds in task result, while updating the task with IN_PROGRESS state. This keeps the task hidden for provided duration, and no worker can poll for it meanwhile.

@bharadwajrembar
Copy link

@kishorebanala From an experimentation point of view, we see that the responseTimeoutSeconds is not really a heartbeat, but more of a deadline to update the status of the task. Once, the responseTimeoutSeconds is reached, the task will be considered as timed out, failed and the retry logic will kick in, causing another attempt of the task to be scheduled.

Is there a notion of a heartbeat in Conductor? A way for a worker to update Conductor periodically while processing a task without specifying callbackAfterSeconds? Because if a worker fails to update the task again with a callbackAfterSeconds, it will be visible again for polling in which case it will be picked up by another worker. Now, we will have 2 workers processing the same task.

@Shole4ever
Copy link
Author

@kishorebanala essentially, we want what is mentioned here: https://netflix.github.io/conductor/tasklifecycle/#timeout-seconds

But, sending an update for a task with IN_PROGRESS does not work as mentioned in the above-mentioned link. On the contrary, sending an IN_PROGRESS update makes the task available for other workers to pick, which completely defeats the purpose of sending this status update from the worker.

@kishorebanala
Copy link
Contributor

@bharadwajrembar @Shole4ever A heartbeat without callbackAfterSeconds seems counter intuitive, as the worker would then have to keep informing Conductor server about its health very frequently. Conductor doesn't have task lease extension today - meaning the task should respond to Conductor server before responseTimeoutSeconds. This might be helpful here. One workaround is to set responseTimeoutSeconds to a larger value, and not update the task with IN_PROGRESS state, if we know that the worker is not done processing the task. By updating the task, a worker gives up it's lease and the next worker will pickup.

Also, the Task's retry policy can be set to TIME_OUT_WF to not reschedule the task on timeouts.

@Shole4ever
Copy link
Author

@kishorebanala The task is resurfaced after callbackAfterSeconds and picked up by another worker. I see 2 issues here:
-> The task is resurfaced with the same taskId. This means there are 2 workers working on exactly same task.
-> When the task is resurfaced, the retry count is unchanged. This means the number of times a task is picked by worker now can never be determined [as there will be cases of callbackAfterSeconds as well as the actual responseTimeoutSeconds]

My scenario is: I have a task that takes variable time, say 5 minutes on average, but can go upto 1 hour. Now, i wouldn't to wait 1 hour to retry for every failed task. Hence the use of heartbeat, with the assurance that if there are any retries, they are traceable [which they aren't in case of callbackAfterSeconds] (as per my understanding).

@smullins3000
Copy link

smullins3000 commented Nov 22, 2019

FWIW, it never occurred to me to try and use the task response in that way. For long running jobs we are returning IN_PROGRESS in the task result along with "callbackAfter" time set and a "job ID" placed in the task output. Then when the task is available to be polled again the worker that receives it finds the job ID and calls the service that is doing the processing to check on the status.
The worker is just a kind of proxy between the service and conductor, not the thing doing the actual work.
If the service expects a callback interface to call instead of being polled it gets slightly more complicated, but still basically works the same way.

@funston
Copy link

funston commented Nov 22, 2019

@Shole4ever I'm not sure if I understood what you meant by resurfacing the task. But, if you mean the task is being polled and executed immediately, and if you want to delay the task execution for a while, you can set callbackAfterSeconds in task result, while updating the task with IN_PROGRESS state. This keeps the task hidden for provided duration, and no worker can poll for it meanwhile.

I'm doing the following, but as soon as I sent this response, another worker picks up that same tasks, ie it isn't hidden

        task['status'] = 'IN_PROGRESS'
        task['callbackAfterSeconds'] = 6

the task has a timeout set to 60 seconds, responseTimeoutSeconds: 10.

Per the original poster, we have the same scenerio. A worker grabs a task. It may take a minute or hours. We can't set an accurate timeout on the initial task, so we'd like to extend a heartbeat on the task while we are processing and IF that heartbeat stops, the task is requeued as timeout_out

@kishorebanala
Copy link
Contributor

@Shole4ever

-> The task is resurfaced with the same taskId. This means there are 2 workers working on exactly same task.

Correct. In this case, the task lifecycle is not completed yet, for it to be rescheduled with a new task id. The worker working on this task returns the task to Conductor saying it couldn't complete the task. For eg., the worker might be waiting on some external dependency (say a file to be written to S3).

But, the 2 workers won't work on the task at the same time. The second worker can poll for this task only after first worker returns it to Conductor. Conductor expects the tasks to be idempotent, and promises at least once task scheduling guarantee. Exactly once will have to handled in the worker logic.

-> When the task is resurfaced, the retry count is unchanged. This means the number of times a task is picked by worker now can never be determined [as there will be cases of callbackAfterSeconds as well as the actual responseTimeoutSeconds]

Following up from above, the task lifecycle determines the retry count. The number of times a task is picked by a worker can by found from pollCount, which increments every time the task is given to worker.

My scenario is: I have a task that takes variable time, say 5 minutes on average, but can go upto 1 hour. Now, i wouldn't to wait 1 hour to retry for every failed task. Hence the use of heartbeat, with the assurance that if there are any retries, they are traceable [which they aren't in case of callbackAfterSeconds] (as per my understanding).

Yes, callbackAfterSeconds wouldn't apply for task that still need more time to finish it's execution. Conductor doesn't have worker lease extension, which would help your scenario. However, if you set your responseTimeoutSeconds to 1 hr (say max expected time your task would take to finish), you'd only wait for an hour when the worker processing that task has died, which should happen very rarely.

@kishorebanala
Copy link
Contributor

@funston LMK if my above comment doesn't answer your question.

@funston
Copy link

funston commented Nov 26, 2019

I don’t see this behavior or not understanding it. If one worker takes task 1 and returns an IN_PROGRESS with a callback time, a second worker does see the task and is able to pull it, work on it and respond to complete it (I’ve verified with some simple code)

I guess I’m confused on what an IN PROGRESS is for? Looking at the code it seems to just follow the same logic that rescheduled the task

The proposed solution to set a super long timeout won’t work for our needs. We need a heartbeat mechanism

@kishorebanala
Copy link
Contributor

If one worker takes task 1 and returns an IN_PROGRESS with a callback time, a second worker does see the task and is able to pull it, work on it and respond to complete it (I’ve verified with some simple code).

When worker_1 updates the task with IN_PROGRESS state, it is giving up the claim on that task. The task will be put back to task queue, and the workers can poll for it after callback time. This is helpful in cases like above eg. of waiting for some external dependency to be available.

But, if the worker is expected to take an hour, and if the responseTimeout is configured to 5min, the task would always be timed out. Conductor doesn't have a worker lease extension mechanism today. We shall add it to roadmap, but if anyone is waiting for it, please feel free to start a conversation leading to a PR.

@Shole4ever
Copy link
Author

@kishorebanala Why should sending the 'IN_PROGRESS' update be treated as 'giving up the claim on that task'. Rather it should be opposite, that the worker is claiming this task saying it is processing it.

@kishorebanala
Copy link
Contributor

@Shole4ever This comes back to Task lease extension mechanism, which Conductor doesn't support today. We're considering implementing this in next major release, but if you'd like this sooner, we'd very much welcome contributions from the community.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants