-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Durable function fails with TaskCanceledException
and never gets retried
#2454
Comments
Moving the conversation back to the issue so we don't lose context. I'll repeat some core points here:
The error you're seeing here is not your function failing, it's the platform shutting down and since there was still an invocation request, we fail the function so that it does not run- we don't actually send any invocation requests to your function.
It sounds like you're using durable functions so this: "if you have inflight invocations, you should be able to handle the cancellation token" does not apply as the cancellation token is not supported in durable:
Second, this is a little concerning:
This is probably because the host is shutting down, and durable doesn't support the invocation request (as @jviau mentioned). I think there are two things to be called out here:
Please open a support ticket for this issue so that we can commit the time to doing a deep dive into the issue(s) you might be experiencing. I should be starting my servicing loop this week so I might be able to pick this up myself. |
@liliankasem Thanks for you help, just to be clear do you mean a support request within our Azure subscription? |
Yup! |
Support Request Id: 2304190050002817 We can provide as many invocation ids as required across multiple environments. Thanks again for all the responses, it's appreciated. |
@RichardBurns1982 , which version of the Durable Functions worker extension are you using?
And these never start back up? Can you record the instance ID of one of these and call the get instance status API for one of these orchestrations: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-http-api#get-instance-status I want to know what state they are in. |
It would also be helpful to get the type of trigger that is used by "***ShippingOrchestrationFunction" |
Hello Apologies for the slow reply! We having been keeping pace with latest version as released. We are currently on: Microsoft.Azure.Functions.Worker - 1.13.0 ***ShippingOrchestrationFunction is an orchestration instance which is started by a Timer trigger. These orchestration functions run a queue down so we aren't too concerned about these failing as the next timer trigger will get it back up and running again and continue to work the queue down. Because of this, we try catch at the orchestration function in this instance and even in the event of an exception we handle and let the orchestration function complete successfully as we know it'll start a new instance from our timer trigger if one isn't running. In the example details I have below the orchestration instance has a status of Failed which can't happen from our code as we are handling and logging exceptions but not letting them bubble up and fail the orchestration instance. Hopefully this helps to track down the problem as something before our Orchestration function is invoked has failed the durable task. App Insights: Associated Table Storage Durable Function:
|
Why is the host shutting down:Scheduled shut down of worker instance or scaling; this leads to host shut down and potential cancellation of invocations (these intentionally by design never reach worker or user code). This is normal behaviour, there's nothing out of place here. Why are we "losing orchestration functions mid flow and the retry logic not isn't working":TL;DR: Because DF manages ongoing invocations independently of the function host, and so when the host cancels an invocation and fails it (without sending it to the worker), it considers the invocation complete and does not retry it (even though that function invocation never actually executed).
Next steps:We are going to try to repro this issue and work with the durable functions team to determine if this is a bug on their side or if we need to design some host changes to help durable handle cancellations so they can retry these invocations later on a different host instance than the one that is shutting down. |
TaskCanceledException
and never gets retried
Thanks for the detailed response! For most of our durable functions this isn't a problem as we are primarily using them to work down a persisted queue or generate denormalized data so we run them again without issue, there are a few exceptions where this isn't true and causing us a problem but we can work around in the short term. Regarding the TaskCancelledException my only concern is that these errors are appearing in AppInsights which is what led to us investigating these issues. The two scenarios that are affecting us are:
Ideally, if the TaskCancelledException is happening internally in the host code we'd prefer never to see it or have to handle it as when it occurs and is logged in AppInsights we investigate, triage, etc. I'd be interested to know if you think we should be factoring this exception scenario in and specifically coding for it if it happens on an Activity or do you think this is an error we should never be seeing in AppInsights or handling ourselves. Many thanks Richard |
Hi, we are upgrading two large durable function projects (5k+ orchestrations,150k+ function executions total daily) from in-process 3.1 to isolated 7. We believe we are hitting this issue in our dev environments while testing and are investigating further. Happy to provide further info if needed. Thanks |
I believe I understand what is going on. This appears to be a bug with the durable extension. Well, a misunderstanding of how function tirggering works. We were relying on the invocation cancellation exception to bubble out and we catch then abort this invocation. However, it does not appear to bubble out and instead surfaces as a function invocation result with a failed status - so we assume this invocation did finish. @liliankasem something that we can consider is to review this behavior. We can sync offline and I'll go over what I have found. For the meantime, I am going to transfer this issue to the durable extension repo and address it there. |
Addressed, pending release of 2.9.5 |
Thank you, is there any timeframe or plan for this release? Our upgrades are currently blocked pending this fix. |
Thanks @jviau, does this need fixing in Microsoft.Azure.Functions.Worker.Extensions.DurableTask as we are running in isolated which is where we have encountered this problem. This is having a pretty significant impact on us, today out of 156k functions 6.07k failed with TaskCanceledException. |
@RichardBurns1982 - yes, this fix is for Java and dotnet isolated durable extensions. We will be aiming to release soon here. |
I see that 2.9.5 of Microsoft.Azure.WebJobs.Extensions.DurableTask was released yesterday, do you know when you will be releasing Microsoft.Azure.Functions.Worker.Extensions.DurableTask update which will include this fix for isolated? We are still seeing a lot of TaskCancelledExceptions, as mentioned in the original post they are not always on durable orchestration functions. We have seen them on our CosmosDbTrigger and ServiceBusTrigger functions as well so while this may reduce them I am not certain it will completely eliminate the TaskCancelledExceptions we are seeing. |
No release is needed. Rebuilding your worker app should pick up the newer version of |
Thanks @jviau I might need a little help understanding that. We are not referencing Microsoft.Azure.WebJobs.Extensions.DurableTask we are only referencing Microsoft.Azure.Functions.Worker.Extensions.DurableTask in isolated, this hasn't had a release since 1.0.2 (April 7th 2023). I'm not seeing any dependencies to Microsoft.Azure.WebJobs.Extensions.DurableTask. As I understand it Microsoft.Azure.WebJobs.Extensions.DurableTask is for the legacy and soon to be deprecated in-process with .net 8 I believe. We've moved all our functions to isolated and removed all legacy packages for in-process and referencing the new ones such as Microsoft.Azure.Functions.Worker.Extensions.DurableTask. This is where we are seeing the problem, we never saw it in in-process on core/net5/6. Many thanks. |
We have the same problem, also using Microsoft.Azure.Functions.Worker.Extensions.DurableTask Will there be an upgrade for this extension as well? |
The upgrade experience does look like you aren't doing anything, but dotnet isolated function apps have a build target which resolves the webjobs (host) extensions. In this case, the package So getting the new fix should be as simple as re-building and deploying your app. |
Thank you @jviau, I did not know that and will keep an eye out on the code in future. I will raise a separate bug for the TaskCancelledException as regardless of Durable this is still a problem filling up our App Insights errors |
We have recently upgraded from Net6 In-Process to Net7 Isolated functions and since this upgrade we are seeing intermittent TaskCanceledException across multiple functions doing a variety of different work so no common/consistent reason we can find. It does not appear to ever hit our code and the exception is happening before the function is invoked within the WorkerFunctionInvoker.
One example below but we can provide more:
Additional Invocation Ids from last 24 hours on a sandbox we are running:
a0a7c75d-e997-4361-aa3b-a4c5501d0ec7
c6b083bf-60ab-4cb1-9ab6-8ca5f883c388
Stack
Repro steps
We cannot reproduce this consistently, the same function will work without issue 100 times then for no reason we will get a TaskCancelledException.
The text was updated successfully, but these errors were encountered: