Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Durable function fails with TaskCanceledException and never gets retried #2454

Closed
RichardBurns1982 opened this issue Mar 29, 2023 · 21 comments · Fixed by #2456
Closed

Durable function fails with TaskCanceledException and never gets retried #2454

RichardBurns1982 opened this issue Mar 29, 2023 · 21 comments · Fixed by #2456
Assignees
Labels

Comments

@RichardBurns1982
Copy link

RichardBurns1982 commented Mar 29, 2023

We have recently upgraded from Net6 In-Process to Net7 Isolated functions and since this upgrade we are seeing intermittent TaskCanceledException across multiple functions doing a variety of different work so no common/consistent reason we can find. It does not appear to ever hit our code and the exception is happening before the function is invoked within the WorkerFunctionInvoker.

One example below but we can provide more:

  • Timestamp: 2023-03-28T09:20:19.6220000Z
  • Function App version: v4
  • Function App name: -
  • Function name(s) (as appropriate): -
  • Invocation ID: c7ea2e1e-1e12-4d0b-ab60-fe3b75a92467
  • Region: Central US

Additional Invocation Ids from last 24 hours on a sandbox we are running:
a0a7c75d-e997-4361-aa3b-a4c5501d0ec7
c6b083bf-60ab-4cb1-9ab6-8ca5f883c388

Stack

System.Threading.Tasks.TaskCanceledException:
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker+<InvokeCore>d__9.MoveNext (Microsoft.Azure.WebJobs.Script, Version=4.16.0.0, Culture=neutral, PublicKeyToken=null: /_/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:96)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase+<Invoke>d__24.MoveNext (Microsoft.Azure.WebJobs.Script, Version=4.16.0.0, Culture=neutral, PublicKeyToken=null: /_/src/WebJobs.Script/Description/FunctionInvokerBase.cs:82)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator+<Coerce>d__3`1.MoveNext (Microsoft.Azure.WebJobs.Script, Version=4.16.0.0, Culture=neutral, PublicKeyToken=null: /_/src/WebJobs.Script/Description/FunctionGenerator.cs:225)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2+<InvokeAsync>d__10.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:52)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcMiddleware+<>c__DisplayClass10_0+<<CallOrchestratorAsync>b__0>d.MoveNext (Microsoft.Azure.WebJobs.Extensions.DurableTask, Version=2.0.0.0, Culture=neutral, PublicKeyToken=014045d636e89289: D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\OutOfProcMiddleware.cs:124)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.TriggeredFunctionExecutor`1+<>c__DisplayClass7_0+<<TryExecuteAsync>b__0>d.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\TriggeredFunctionExecutor.cs:50)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<InvokeWithTimeoutAsync>d__33.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:581)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithWatchersAsync>d__32.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:527)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithLoggingAsync>d__26.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:306)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithLoggingAsync>d__26.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:352)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<TryExecuteAsync>d__18.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:108)

Repro steps

We cannot reproduce this consistently, the same function will work without issue 100 times then for no reason we will get a TaskCancelledException.

@liliankasem
Copy link
Member

liliankasem commented Apr 18, 2023

Moving the conversation back to the issue so we don't lose context. I'll repeat some core points here:

Lilian: Unfortunately, when an invocation request occurs and the platform cancellation token has been signaled to cancel, it's not possible to just get rid of that invocation - that invocation request has still come through and we have to do something with it. In the cancellation scenario, we don't process new invocations and we "cancel" them before we send the work to the worker. Our only options here are to fail or succeeded the invocation as there is no "cancelled" state. We choose "fail" because we never send the invocation to the worker and a "successful" invocation at this point would be misleading.

The error you're seeing here is not your function failing, it's the platform shutting down and since there was still an invocation request, we fail the function so that it does not run- we don't actually send any invocation requests to your function.

Richard: Do you know why these are getting cancelled? This problem has plagued us since we moved to net7 isolated functions from net6 in-process. There is nothing in our code that is cancelling anything and we can find no evidence of any issues within our code that would be causing it. We are intermittently losing orchestration functions mid flow and retry logic not working.

Lilian: Typically, platform cancellations happen when we need to restart or shutdown the host, or a function has reached the timeout limit. E.g. scale down event occurs, a worker gets recycled, we request to shutdown the host, the platform cancellation token gets cancelled. When this happens, if you have inflight invocations, you should be able to handle the cancellation token. If you get new invocations during this period, they're not handled.

It sounds like you're using durable functions so this: "if you have inflight invocations, you should be able to handle the cancellation token" does not apply as the cancellation token is not supported in durable:

Jacob: DF isolated does not integrate with the invocation request cancellation of Azure Functions. The DF code does not listen to that cancellation token nor anything requests cancellation of said token

Second, this is a little concerning:

We are intermittently losing orchestration functions mid flow and retry logic not working.

This is probably because the host is shutting down, and durable doesn't support the invocation request (as @jviau mentioned). I think there are two things to be called out here:

  1. The issue of cancelled invocations, the invocations that are never sent to the worker, typically those errors can be ignored. However, if those functions that are getting cancelled are never retried - this could be a problem. @jviau I wonder how much this might be related to the other issue we have around signaling if we ever reach customer code so that extensions can deal with retires properly (#8994)?

  2. "losing orchestration functions mid flow" - this is a little different as it's not the same as an invocation never reaching the worker (# 1). This is again probably due to the platform shutting down the host. If it's drain mode, it should allow inflight invocations to finish. I'm not 100% clear on this issue.

Please open a support ticket for this issue so that we can commit the time to doing a deep dive into the issue(s) you might be experiencing. I should be starting my servicing loop this week so I might be able to pick this up myself.

@RichardBurns1982
Copy link
Author

@liliankasem Thanks for you help, just to be clear do you mean a support request within our Azure subscription?

@liliankasem
Copy link
Member

@liliankasem Thanks for you help, just to be clear do you mean a support request within our Azure subscription?

Yup!

@RichardBurns1982
Copy link
Author

Support Request Id: 2304190050002817

We can provide as many invocation ids as required across multiple environments.

Thanks again for all the responses, it's appreciated.

@jviau
Copy link
Contributor

jviau commented Apr 19, 2023

@RichardBurns1982 , which version of the Durable Functions worker extension are you using?

"losing orchestration functions mid flow"

And these never start back up? Can you record the instance ID of one of these and call the get instance status API for one of these orchestrations: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-http-api#get-instance-status

I want to know what state they are in.

@liliankasem
Copy link
Member

@RichardBurns1982 , which version of the Durable Functions worker extension are you using?

"losing orchestration functions mid flow"

And these never start back up? Can you record the instance ID of one of these and call the get instance status API for one of these orchestrations: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-http-api#get-instance-status

I want to know what state they are in.

It would also be helpful to get the type of trigger that is used by "***ShippingOrchestrationFunction"

@RichardBurns1982
Copy link
Author

@RichardBurns1982 , which version of the Durable Functions worker extension are you using?

"losing orchestration functions mid flow"

And these never start back up? Can you record the instance ID of one of these and call the get instance status API for one of these orchestrations: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-http-api#get-instance-status
I want to know what state they are in.

It would also be helpful to get the type of trigger that is used by "***ShippingOrchestrationFunction"

Hello

Apologies for the slow reply!

We having been keeping pace with latest version as released. We are currently on:

Microsoft.Azure.Functions.Worker - 1.13.0
Microsoft.Azure.Functions.Worker.Extensions.DurableTask - 1.0.2

***ShippingOrchestrationFunction is an orchestration instance which is started by a Timer trigger.

These orchestration functions run a queue down so we aren't too concerned about these failing as the next timer trigger will get it back up and running again and continue to work the queue down. Because of this, we try catch at the orchestration function in this instance and even in the event of an exception we handle and let the orchestration function complete successfully as we know it'll start a new instance from our timer trigger if one isn't running.

In the example details I have below the orchestration instance has a status of Failed which can't happen from our code as we are handling and logging exceptions but not letting them bubble up and fail the orchestration instance.

Hopefully this helps to track down the problem as something before our Orchestration function is invoked has failed the durable task.

App Insights:
InvocationId: 1e02dcd1-36e3-4bf3-b154-41acec35ed95
Exception type: System.Threading.Tasks.TaskCanceledException
FormattedMessage: Executed 'Functions.ReconcileTransactionLocationsShippingOrchestrationFunction' (Failed, Id=1e02dcd1-36e3-4bf3-b154-41acec35ed95, Duration=24ms)

Associated Table Storage Durable Function:
Partition Key: 74e3409e5dec5610b457dbfd1a25691d
ExecutionId: 16cb566b7a164ca3b656c26c393bf1e3
Created Time: 2023-04-23T02:00:00.4598939Z
Runtime Status: Failed
Output:

{"message":"Function 'ReconcileTransactionLocationsShippingOrchestrationFunction' failed with an unhandled exception.","details":"System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n   at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker.InvokeCore(Object[] parameters, FunctionInvocationContext context) in /_/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:line 96\r\n   at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase.Invoke(Object[] parameters) in /_/src/WebJobs.Script/Description/FunctionInvokerBase.cs:line 82\r\n   at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator.Coerce[T](Task`1 src) in /_/src/WebJobs.Script/Description/FunctionGenerator.cs:line 225\r\n   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionInvoker.cs:line 52\r\n   at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcMiddleware.<>c__DisplayClass10_0.<<CallOrchestratorAsync>b__0>d.MoveNext() in D:\\a\\_work\\1\\s\\src\\WebJobs.Extensions.DurableTask\\OutOfProcMiddleware.cs:line 124\r\n--- End of stack trace from previous location ---\r\n   at Microsoft.Azure.WebJobs.Host.Executors.TriggeredFunctionExecutor`1.<>c__DisplayClass7_0.<<TryExecuteAsync>b__0>d.MoveNext() in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\TriggeredFunctionExecutor.cs:line 50\r\n--- End of stack trace from previous location ---\r\n   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 581\r\n   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 527\r\n   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 306\r\n   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 352\r\n   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in C:\\projects\\azure-webjobs-sdk-rqm4t\\src\\Microsoft.Azure.WebJobs.Host\\Executors\\FunctionExecutor.cs:line 108"}

@liliankasem
Copy link
Member

(Still haven't received the support case so will respond here for now)

Why is the host shutting down:

Scheduled shut down of worker instance or scaling; this leads to host shut down and potential cancellation of invocations (these intentionally by design never reach worker or user code). This is normal behaviour, there's nothing out of place here.

Why are we "losing orchestration functions mid flow and the retry logic not isn't working":

TL;DR: Because DF manages ongoing invocations independently of the function host, and so when the host cancels an invocation and fails it (without sending it to the worker), it considers the invocation complete and does not retry it (even though that function invocation never actually executed).

  • The orchestrator function get trigger and executes, as part of that an activity function gets invoked.
  • We then get a shut down request and so we wait for this activity function to finish.
  • We wait for the activity invocation but the orchestrator function itself doesn't have an active invocation so we don't consider it active; durable functions manages its own notion of "ongoing invocation" that is separate from the host.
  • When the activity function finishes, we flag all invocations as complete and continue to shut down the host fully.
  • Then, the orchestrator gets triggered again (because the activity finished), however this invocation request happens after we have begun full shut down so it gets rejected - we cancel and don't invoke worker or user code.

Screenshot 2023-04-24 at 1 52 30 PM

Next steps:

We are going to try to repro this issue and work with the durable functions team to determine if this is a bug on their side or if we need to design some host changes to help durable handle cancellations so they can retry these invocations later on a different host instance than the one that is shutting down.

@liliankasem liliankasem changed the title System.Threading.Tasks.TaskCanceledException Durable function fails with TaskCanceledException and never gets retried Apr 24, 2023
@RichardBurns1982
Copy link
Author

(Still haven't received the support case so will respond here for now)

Why is the host shutting down:

Scheduled shut down of worker instance or scaling; this leads to host shut down and potential cancellation of invocations (these intentionally by design never reach worker or user code). This is normal behaviour, there's nothing out of place here.

Why are we "losing orchestration functions mid flow and the retry logic not isn't working":

TL;DR: Because DF manages ongoing invocations independently of the function host, and so when the host cancels an invocation and fails it (without sending it to the worker), it considers the invocation complete and does not retry it (even though that function invocation never actually executed).

  • The orchestrator function get trigger and executes, as part of that an activity function gets invoked.
  • We then get a shut down request and so we wait for this activity function to finish.
  • We wait for the activity invocation but the orchestrator function itself doesn't have an active invocation so we don't consider it active; durable functions manages its own notion of "ongoing invocation" that is separate from the host.
  • When the activity function finishes, we flag all invocations as complete and continue to shut down the host fully.
  • Then, the orchestrator gets triggered again (because the activity finished), however this invocation request happens after we have begun full shut down so it gets rejected - we cancel and don't invoke worker or user code.

Screenshot 2023-04-24 at 1 52 30 PM

Next steps:

We are going to try to repro this issue and work with the durable functions team to determine if this is a bug on their side or if we need to design some host changes to help durable handle cancellations so they can retry these invocations later on a different host instance than the one that is shutting down.

Thanks for the detailed response!

For most of our durable functions this isn't a problem as we are primarily using them to work down a persisted queue or generate denormalized data so we run them again without issue, there are a few exceptions where this isn't true and causing us a problem but we can work around in the short term.

Regarding the TaskCancelledException my only concern is that these errors are appearing in AppInsights which is what led to us investigating these issues.

The two scenarios that are affecting us are:

  1. When we get a TaskCancelledException on a standard function or on the Durable Orchestration entry. These we can't recover from as it never hits our code but we are seeing the errors in AppInsights, sounds like this isn't an issue for standard functions as while we are seeing the error in AppInsights it sounds like they are retried by the host and do run. Not the same for Durable where the instance entry in table storage is being failed.
  2. When we get a TaskCancelledException on an activity, this is bubbling up to the Orchestration function and we can try to handle this error and retry the Activity but that would be a lot of manual code in every Orchestration function we have.

Ideally, if the TaskCancelledException is happening internally in the host code we'd prefer never to see it or have to handle it as when it occurs and is logged in AppInsights we investigate, triage, etc. I'd be interested to know if you think we should be factoring this exception scenario in and specifically coding for it if it happens on an Activity or do you think this is an error we should never be seeing in AppInsights or handling ourselves.

Many thanks

Richard

@apringle
Copy link

Hi, we are upgrading two large durable function projects (5k+ orchestrations,150k+ function executions total daily) from in-process 3.1 to isolated 7.

We believe we are hitting this issue in our dev environments while testing and are investigating further. Happy to provide further info if needed. Thanks

@jviau
Copy link
Contributor

jviau commented Apr 25, 2023

I believe I understand what is going on. This appears to be a bug with the durable extension. Well, a misunderstanding of how function tirggering works. We were relying on the invocation cancellation exception to bubble out and we catch then abort this invocation. However, it does not appear to bubble out and instead surfaces as a function invocation result with a failed status - so we assume this invocation did finish.

@liliankasem something that we can consider is to review this behavior. We can sync offline and I'll go over what I have found.

For the meantime, I am going to transfer this issue to the durable extension repo and address it there.

@ghost ghost added the Needs: Triage 🔍 label Apr 25, 2023
@jviau jviau transferred this issue from Azure/azure-functions-host Apr 25, 2023
@jviau jviau self-assigned this Apr 25, 2023
@jviau
Copy link
Contributor

jviau commented Apr 26, 2023

Addressed, pending release of 2.9.5

@apringle
Copy link

apringle commented May 9, 2023

Addressed, pending release of 2.9.5

Thank you, is there any timeframe or plan for this release? Our upgrades are currently blocked pending this fix.

@RichardBurns1982
Copy link
Author

RichardBurns1982 commented May 11, 2023

Addressed, pending release of 2.9.5

Thanks @jviau, does this need fixing in Microsoft.Azure.Functions.Worker.Extensions.DurableTask as we are running in isolated which is where we have encountered this problem.

This is having a pretty significant impact on us, today out of 156k functions 6.07k failed with TaskCanceledException.

@jviau
Copy link
Contributor

jviau commented May 11, 2023

@RichardBurns1982 - yes, this fix is for Java and dotnet isolated durable extensions. We will be aiming to release soon here.

@RichardBurns1982
Copy link
Author

RichardBurns1982 commented May 23, 2023

@RichardBurns1982 - yes, this fix is for Java and dotnet isolated durable extensions. We will be aiming to release soon here.

I see that 2.9.5 of Microsoft.Azure.WebJobs.Extensions.DurableTask was released yesterday, do you know when you will be releasing Microsoft.Azure.Functions.Worker.Extensions.DurableTask update which will include this fix for isolated?

We are still seeing a lot of TaskCancelledExceptions, as mentioned in the original post they are not always on durable orchestration functions. We have seen them on our CosmosDbTrigger and ServiceBusTrigger functions as well so while this may reduce them I am not certain it will completely eliminate the TaskCancelledExceptions we are seeing.

@jviau
Copy link
Contributor

jviau commented Jun 22, 2023

I see that 2.9.5 of Microsoft.Azure.WebJobs.Extensions.DurableTask was released yesterday, do you know when you will be releasing Microsoft.Azure.Functions.Worker.Extensions.DurableTask update which will include this fix for isolated?

No release is needed. Rebuilding your worker app should pick up the newer version of Microsoft.Azure.WebJobs.Extensions.DurableTask

@RichardBurns1982
Copy link
Author

RichardBurns1982 commented Jun 22, 2023

Thanks @jviau

I might need a little help understanding that. We are not referencing Microsoft.Azure.WebJobs.Extensions.DurableTask we are only referencing Microsoft.Azure.Functions.Worker.Extensions.DurableTask in isolated, this hasn't had a release since 1.0.2 (April 7th 2023). I'm not seeing any dependencies to Microsoft.Azure.WebJobs.Extensions.DurableTask.

As I understand it Microsoft.Azure.WebJobs.Extensions.DurableTask is for the legacy and soon to be deprecated in-process with .net 8 I believe.

We've moved all our functions to isolated and removed all legacy packages for in-process and referencing the new ones such as Microsoft.Azure.Functions.Worker.Extensions.DurableTask. This is where we are seeing the problem, we never saw it in in-process on core/net5/6.

Many thanks.

@nhuurnink
Copy link

We have the same problem, also using Microsoft.Azure.Functions.Worker.Extensions.DurableTask

Will there be an upgrade for this extension as well?

@jviau
Copy link
Contributor

jviau commented Jul 5, 2023

The upgrade experience does look like you aren't doing anything, but dotnet isolated function apps have a build target which resolves the webjobs (host) extensions. In this case, the package Microsoft.Azure.Functions.Worker.Extensions.DurableTask has an attribute which tells the build task to pull in Microsoft.Azure.WebJobs.Extensions.DurableTask/2.9.*. Since we use a * in the version, simply building your app again should trigger a restore of the latest version.

So getting the new fix should be as simple as re-building and deploying your app.

@RichardBurns1982
Copy link
Author

The upgrade experience does look like you aren't doing anything, but dotnet isolated function apps have a build target which resolves the webjobs (host) extensions. In this case, the package Microsoft.Azure.Functions.Worker.Extensions.DurableTask has an attribute which tells the build task to pull in Microsoft.Azure.WebJobs.Extensions.DurableTask/2.9.*. Since we use a * in the version, simply building your app again should trigger a restore of the latest version.

So getting the new fix should be as simple as re-building and deploying your app.

Thank you @jviau, I did not know that and will keep an eye out on the code in future.

I will raise a separate bug for the TaskCancelledException as regardless of Durable this is still a problem filling up our App Insights errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants