New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrying tasks wait to retry in the TaskRunner, blocking other tasks from running #3516
Comments
This makes a lot of sense! I'm not sure off the top of my head why it's behaving that way but I'll try to dig into it soon. |
I am unable to reproduce this. Do you have a working code? |
I can send a minimum viable example later today. |
I also failed to reproduce this with the given code (added |
This has to do with how retrying is currently implemented in prefect. If a task retries and the retry time is < 10 min (hardcoded), the retry wait will be handled in the |
Some context: if we don't wait within the Task Runner, then the retry delay isn't really being respected (as the entire graph will need to be visited before returning to the retrying task, which generally takes time). When running against a Prefect backend, if we don't retry within the runner, then that will result in a new Agent job submission at the retry time which gets really inefficient as the number of retries grows. |
Right. The retry delay is more of a minimum bound though (we can't guarantee that the task will retry exactly after that delay, that's up to the OS/runtime on how threads are scheduled). Handling it at the We could do a quick-fix this for
|
I don't see any benefit in changing the current behavior, especially if all solutions involve increasing complexity. |
I definitely don't see a benefit of the more complicated solution. The |
Do you still need code to reproduce? I still think there is a lot of added value. Sometime when requesting multiple (say hundreds or thousands or more) URL, if one fails (e.g. the endpoint in the URL is not yet available and will be in some minutes or hours later), it makes to keep requesting the rest of the URLs in the meantime; or at least have the option to specify such behaviour (this will actually be ideal) and keep current behaviour as default for example. |
@snenkov It may be useful if you provided the code for the future, but per the discussion above we're not going to be able to address this right now--the implications are too complex. I'll leave this issue open and hopefully we can address it down the road. |
We also run flows with mapped tasks that frequently comprise thousands of task runs. Therefore, this creates significant delays and costs on our side as well. I don't have a deep understanding of prefect yet, but here is a possible conceptual solution from my perspective: |
This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment. |
This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it. |
Current behavior
Let's say we have a task like this (pseudo code below):
If we execute this Flow like so :
Then if one of the requests fails and the task waits for 5 min to be retried, none of the other URLs mapped to this task are executed. At the same time, the worker is idle.
Proposed behavior
Ideally I believe when a single job in the mapped task fails, while waiting for the retry, the worker should move to the next job.
The text was updated successfully, but these errors were encountered: