Orchestration instance started stuck in pending status (Functions v2 breaking change) #342

pluralmonad · 2018-06-12T19:53:07Z

I have a timer that fires ever 6 hours. It triggers a orchestration starter function that checks for an existing instance and starts/terminates/restarts the orchestration.

Recently the trigger log output for each trigger event has indicated an existing instance was found in pending status and with a last update time equal to its start time (6 hours previously from the last trigger event).

Here is a representative trigger function output.

Triggered by <Timer>
Getting orchestrator state from table storage...
Found LastKnownInstanceID: b9ec9716-2768-4775-a84b-522440fa6240
Existing instance found with name Orchestrator
Started at 6/12/2018 12:00:01 PM
Last updated at 6/12/2018 12:00:01 PM
Current status: Pending
Termination requested for reason: <Existing instance passed time limit>
Terminating existing instance with instanceID <b9ec9716-2768-4775-a84b-522440fa6240>...
Instance terminated
Starting new instance of orchestrator <Orchestrator> with instanceID <7c72ba4c-6459-4fae-84c2-23ba242ceda4>...
Instance started
Updating orchestrator state in table storage...
Orchestrator state updated

Any ideas why it might never leave pending status?

I recently had a feral orchestration persist in a retry and fail state for an extended period of time. Might I need to do some clean up in the dtfx control mechanisms in storage?

This application is running in the Central US region.

The text was updated successfully, but these errors were encountered:

pluralmonad · 2018-06-12T20:05:26Z

After looking over the execution history a little more, it appears that the last time the orchestration was successfully started, one of its sub orchestrations never returned. Or maybe the parent never picked up again after the sub completed.

The parent orchestration was with instance ID 29a352c4-aec5-4154-add8-fc6ba4b977db.

It calls two subs. The execution history and output indicates the first sub's final activity completed successfully (and finished its work correctly). Then there is never a returning invocation of the sub orchestration nor execution of the second sub within the parent.

cgillum · 2018-06-12T20:14:10Z

Thanks. Two questions:

What is a "feral" orchestration that you referenced previously?
What's the relationship between the pending instance you mentioned originally and the instance that had a non-ocompleting sub-orchestration? Are these related or are we describing two separate issues?

pluralmonad · 2018-06-12T20:18:59Z

I meant a misbehaving durable task function, mentioned here. I manually cleared a control queue in order to kill the fail/retry cycle. I do not know the internals of the framework and wondered if this might have created a bad state with ripple effects into what I am seeing now.

The instances are not directly related. They are referencing the same orchestration and the one with the non-completing sub was the last time the job was triggered and did not sit in pending status indefinitely.

pluralmonad · 2018-06-12T20:34:19Z

After restarting the app service, the orchestration most recently started began executing and ran to completion.

cgillum · 2018-06-12T20:39:20Z

Clearing control queues is generally fine. It will cause all pending instances to stop processing, but shouldn't otherwise put anything into a bad state.

Glad to know you were able to restart the function app to mitigate the stuck instance. I would still like to investigate that cause - it seems something must have gotten stuck in memory. Is this the same 29a352c4-aec5-4154-add8-fc6ba4b977db instance which you mentioned previously?

pluralmonad · 2018-06-12T22:03:18Z

Yes, that was the ID of the instance that hung during execution. All steps that ran completed successfully, but somewhere in between it got hung and did not run all the steps. It was in a running status when the next timer trigger fired and was terminated at that time. From then on is when the indefinite pending began.

nagoh · 2018-06-13T13:00:13Z

I've hit this issue too. The only thing I can add as far as what I've seen is the runtime version.
I tested this out on West US 2 (2.0.11857.0 (beta)), after seeing this thread: Azure/azure-functions-host#2856

Then, moved the to Australia East yesterday (12 Jun) and it was working fine - can recall the runtime version, but it was != 2.0.11857.0 (beta)

Now, its failing on Aus East again, and the Aus East runtime is showing as 2.0.11857.0 (beta)

cgillum · 2018-06-13T13:23:05Z

Ah, I feel dumb for not asking this right away but are both of you running Functions v2? All of my testing trying to (unsuccessfully) reproduce this were using v1.

nagoh · 2018-06-13T20:04:29Z

For me, yes.

OTedikov · 2018-06-14T11:24:41Z

It does not work currently in East US, East US2. Durable Azure Functions stopped working, DurableFunctionsHubInstances Table records states that Runtime Status for Activity is always Pending, durablefunctionshubcontrol Queues messages can not be properly read, Dequeue Count always increases.

vludax · 2018-06-14T13:02:40Z

I suspect I have run into this problem too. I've deleted the entire resource group with the function app and the associated storage account and redeployed. No luck.

The logs reveal that a new instance of the orchestrator function gets scheduled, but it is never picked up from the control queue.

.csproj

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netstandard2.0</TargetFramework>
    <AzureFunctionsVersion>v2</AzureFunctionsVersion>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Microsoft.NET.Sdk.Functions" Version="1.0.13"/>
    <PackageReference Include="Microsoft.Azure.Management.Sql" Version="1.16.0-preview"/>
    <PackageReference Include="Microsoft.Azure.Services.AppAuthentication" Version="1.1.0-preview"/>
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="1.4.1"/>
    <PackageReference Include="WindowsAzure.Storage" Version="8.6.0"/>
  </ItemGroup>
</Project>

OTedikov · 2018-06-14T13:07:54Z

@vludax Exactly the same issue

cgillum · 2018-06-14T13:17:09Z

Thanks for confirming. This may have been caused by the recent update of the Azure Functions v2 runtime into Azure mentioned here.

Can you please try setting the FUNCTIONS_EXTENSION_VERSION app setting to 2.0.11776-alpha to see if that mitigates the issue? This is a temporary way to rollback the runtime update for your function app.

OTedikov · 2018-06-14T13:37:46Z

@cgillum Thank you, FUNCTIONS_EXTENSION_VERSION setting change seemed to help.

vludax · 2018-06-14T13:42:05Z

Yep, worked for me too, cheers!

kashimiz · 2018-06-14T20:22:34Z

Trying to reproduce this to get to the root cause. @nagoh , @OTedikov , @pluralmonad , can you provide the following information about the environment you experienced this issue in?

What version of the Durable Functions extension were you using?
How were you developing your function app? (Were you creating it locally in Visual Studio or Visual Studio Code and deploying it to Azure, or building it directly in the Portal?)
What language were your functions written in? C#? JS?
Can you share an instance ID and the region your app ran in?

Thanks!

mmas0n · 2018-06-14T21:05:50Z

Same issue here. Environment information shown below:

Using Microsoft.Azure.WebJobs.Extensions.DurableTask 1.4.1
Function developed in Visual Studio & deployed to Azure
C#
Region = East US

nagoh · 2018-06-14T21:14:01Z

Using Microsoft.Azure.WebJobs.Extensions.DurableTask 1.4.1
Function developed in Visual Studio & deployed to Azure
C#
Region = Australia East

kashimiz · 2018-06-15T01:41:03Z

Thanks for the information! I was able to reproduce this - the functions runtime host was recently updated to use Newtonsoft.Json 11.0.2, while one of the Durable Functions extension's dependencies was still on 10.0.3. There's a breaking change between the two versions that was causing an exception to be thrown whenever the Durable Task Framework tried to process a work item, which is why orchestrators were stuck in Pending state.

I've submitted a fix which will be going out in a release. In the meantime, please use Chris's workaround.

jeffhollan · 2018-06-15T13:03:24Z

@cgillum @kashimiz Just realized one of my apps was effected too - went fine once I dropped back to previous version. I wonder if there's another task here we could track which would enable a CreateCheckStatusResponse to wait until the orchestrator has picked up the job before returning? I don't know if it's the right behavior but as I was exploring my app all the metrics seemed green (my HTTP function was executing and returning a response just fine), but my orchestrator was never picking it up (guessing from a queue). Would have almost rather my app returned a 5xx if enough time elapsed and the job was never started (opt-in, potentially similar to WaitForCompletionOrCreateCheckStatusResponse. Again not sure just trying to think of any ways I could have caught the cause (or at least source) on my own.

cgillum · 2018-06-15T13:22:50Z

We actually used to wait for the orchestrator to start before returning from the StartNewAsync API, but that created a lot of confusion for people when they ran Durable under load. They would start getting timeout exceptions because their start messages weren't getting picked up right away, but the orchestration did eventually run fine.

Were you able to observe the problem if you checked your app insights function logs? It ideally should have been visible in the host logs (Katy saw it in Kusto - not sure if that necessarily means it surfaced in AI). If so, I wonder if we could add some portal UX to make it clear that the orchestrator and activity trigger functions were unhealthy (I thought I saw an email thread discussing this possibility).

jeffhollan · 2018-06-15T13:41:15Z

Yes unfortunately I see the request go into HTTP and execute succeeded, but nothing in "exceptions" table in last 24 hours. So I at least don't see anything in my AI. Agree on the load piece. If interested to see if exceptions were thrown at a different layer the function app is named coreysa-store-function

MaximRouiller · 2018-06-21T12:34:33Z

Please note that after many tests, fixing FUNCTIONS_EXTENSION_VERSION app setting to 2.0.11776-alpha and using EF Core 2.0 will bring back the error Azure/azure-functions-core-tools#335 due to the fact that it's been resolved after 2.0.11776-alpha.

Just an FYI for everyone.

pluralmonad · 2018-06-21T15:16:12Z

Getting caught up on this discussion...

The FUNCTIONS_EXTENSION_VERSION in my app settings is ~1.

The specific Runtime version is 1.0.11820.0.

MaximRouiller · 2018-06-21T19:33:32Z

@pluralmonad My issue was with .NET Core which is only supported in the beta v2 runtime.

cgillum · 2018-06-21T21:03:27Z

@pluralmonad Unfortunately I think your original issue may have been overtaken by a similar issue which only impacted Functions v2. We've fixed the v2 issue, and several people are watching this issue to track that. It will be available in the 1.5 release, which is on nuget.org now (release announcement pending). For that reason, I'd like to close this issue. Can you open a new issue with your latest information, and specify that this is impacting Functions v1? Sorry for the mixup...

cgillum added the Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. label Jun 12, 2018

nagoh mentioned this issue Jun 14, 2018

Can not run from Portal #337

Closed

cgillum mentioned this issue Jun 14, 2018

Azure Functions Runtime 2.0.11857 Preview Breaking Change Notice Azure/app-service-announcements#112

Closed

kashimiz mentioned this issue Jun 15, 2018

Updated Newtonsoft.Json to v11.0.2, WindowsAzure.Storage to v8.6.0. Azure/durabletask#193

Merged

cgillum added bug fix-ready Indicates that an issue has been fixed and will be available in the next release. and removed Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. labels Jun 15, 2018

cgillum assigned kashimiz Jun 15, 2018

cgillum added this to the v1.5.0 Feature Release milestone Jun 15, 2018

kashimiz mentioned this issue Jun 18, 2018

NewtonSoft.Json 10.0.3 needed, but getting 11.0.0.* - Breaking Durables #347

Closed

cgillum mentioned this issue Jun 19, 2018

Update to v1.5.0 and additional fixes #349

Merged

MaximRouiller mentioned this issue Jun 20, 2018

Pre-Compiled assembly not working with EFCore 2.0 Azure/azure-functions-core-tools#335

Closed

cgillum closed this as completed Jun 21, 2018

pluralmonad mentioned this issue Jun 22, 2018

Orchestration instance stuck in pending status - Functions v1 #360

Closed

ghills mentioned this issue Dec 6, 2018

Orchestration instances not starting (stuck in Pending state) #540

Closed

cgillum changed the title ~~Orchestration instance started stuck in pending status~~ Orchestration instance started stuck in pending status (Functions v2 breaking change) Dec 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestration instance started stuck in pending status (Functions v2 breaking change) #342

Orchestration instance started stuck in pending status (Functions v2 breaking change) #342

pluralmonad commented Jun 12, 2018 •

edited

Loading

pluralmonad commented Jun 12, 2018 •

edited

Loading

cgillum commented Jun 12, 2018

pluralmonad commented Jun 12, 2018 •

edited

Loading

pluralmonad commented Jun 12, 2018

cgillum commented Jun 12, 2018

pluralmonad commented Jun 12, 2018

nagoh commented Jun 13, 2018

cgillum commented Jun 13, 2018

nagoh commented Jun 13, 2018 via email •

edited

Loading

OTedikov commented Jun 14, 2018

vludax commented Jun 14, 2018

OTedikov commented Jun 14, 2018

cgillum commented Jun 14, 2018

OTedikov commented Jun 14, 2018

vludax commented Jun 14, 2018

kashimiz commented Jun 14, 2018 •

edited

Loading

mmas0n commented Jun 14, 2018

nagoh commented Jun 14, 2018

kashimiz commented Jun 15, 2018

jeffhollan commented Jun 15, 2018

cgillum commented Jun 15, 2018

jeffhollan commented Jun 15, 2018

MaximRouiller commented Jun 21, 2018

pluralmonad commented Jun 21, 2018 •

edited

Loading

MaximRouiller commented Jun 21, 2018

cgillum commented Jun 21, 2018

Orchestration instance started stuck in pending status (Functions v2 breaking change) #342

Orchestration instance started stuck in pending status (Functions v2 breaking change) #342

Comments

pluralmonad commented Jun 12, 2018 • edited Loading

pluralmonad commented Jun 12, 2018 • edited Loading

cgillum commented Jun 12, 2018

pluralmonad commented Jun 12, 2018 • edited Loading

pluralmonad commented Jun 12, 2018

cgillum commented Jun 12, 2018

pluralmonad commented Jun 12, 2018

nagoh commented Jun 13, 2018

cgillum commented Jun 13, 2018

nagoh commented Jun 13, 2018 via email • edited Loading

OTedikov commented Jun 14, 2018

vludax commented Jun 14, 2018

OTedikov commented Jun 14, 2018

cgillum commented Jun 14, 2018

OTedikov commented Jun 14, 2018

vludax commented Jun 14, 2018

kashimiz commented Jun 14, 2018 • edited Loading

mmas0n commented Jun 14, 2018

nagoh commented Jun 14, 2018

kashimiz commented Jun 15, 2018

jeffhollan commented Jun 15, 2018

cgillum commented Jun 15, 2018

jeffhollan commented Jun 15, 2018

MaximRouiller commented Jun 21, 2018

pluralmonad commented Jun 21, 2018 • edited Loading

MaximRouiller commented Jun 21, 2018

cgillum commented Jun 21, 2018

pluralmonad commented Jun 12, 2018 •

edited

Loading

pluralmonad commented Jun 12, 2018 •

edited

Loading

pluralmonad commented Jun 12, 2018 •

edited

Loading

nagoh commented Jun 13, 2018 via email •

edited

Loading

kashimiz commented Jun 14, 2018 •

edited

Loading

pluralmonad commented Jun 21, 2018 •

edited

Loading