New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestration instance started stuck in pending status #342

Closed
pluralmonad opened this Issue Jun 12, 2018 · 26 comments

Comments

Projects
None yet
9 participants
@pluralmonad

pluralmonad commented Jun 12, 2018

I have a timer that fires ever 6 hours. It triggers a orchestration starter function that checks for an existing instance and starts/terminates/restarts the orchestration.

Recently the trigger log output for each trigger event has indicated an existing instance was found in pending status and with a last update time equal to its start time (6 hours previously from the last trigger event).

Here is a representative trigger function output.

Triggered by <Timer>
Getting orchestrator state from table storage...
Found LastKnownInstanceID: b9ec9716-2768-4775-a84b-522440fa6240
Existing instance found with name Orchestrator
Started at 6/12/2018 12:00:01 PM
Last updated at 6/12/2018 12:00:01 PM
Current status: Pending
Termination requested for reason: <Existing instance passed time limit>
Terminating existing instance with instanceID <b9ec9716-2768-4775-a84b-522440fa6240>...
Instance terminated
Starting new instance of orchestrator <Orchestrator> with instanceID <7c72ba4c-6459-4fae-84c2-23ba242ceda4>...
Instance started
Updating orchestrator state in table storage...
Orchestrator state updated

Any ideas why it might never leave pending status?

I recently had a feral orchestration persist in a retry and fail state for an extended period of time. Might I need to do some clean up in the dtfx control mechanisms in storage?

This application is running in the Central US region.

@pluralmonad

This comment has been minimized.

pluralmonad commented Jun 12, 2018

After looking over the execution history a little more, it appears that the last time the orchestration was successfully started, one of its sub orchestrations never returned. Or maybe the parent never picked up again after the sub completed.

The parent orchestration was with instance ID 29a352c4-aec5-4154-add8-fc6ba4b977db.

It calls two subs. The execution history and output indicates the first sub's final activity completed successfully (and finished its work correctly). Then there is never a returning invocation of the sub orchestration nor execution of the second sub within the parent.

@cgillum

This comment has been minimized.

Collaborator

cgillum commented Jun 12, 2018

Thanks. Two questions:

  1. What is a "feral" orchestration that you referenced previously?
  2. What's the relationship between the pending instance you mentioned originally and the instance that had a non-ocompleting sub-orchestration? Are these related or are we describing two separate issues?
@pluralmonad

This comment has been minimized.

pluralmonad commented Jun 12, 2018

I meant a misbehaving durable task function, mentioned here. I manually cleared a control queue in order to kill the fail/retry cycle. I do not know the internals of the framework and wondered if this might have created a bad state with ripple effects into what I am seeing now.

The instances are not directly related. They are referencing the same orchestration and the one with the non-completing sub was the last time the job was triggered and did not sit in pending status indefinitely.

@pluralmonad

This comment has been minimized.

pluralmonad commented Jun 12, 2018

After restarting the app service, the orchestration most recently started began executing and ran to completion.

@cgillum

This comment has been minimized.

Collaborator

cgillum commented Jun 12, 2018

Clearing control queues is generally fine. It will cause all pending instances to stop processing, but shouldn't otherwise put anything into a bad state.

Glad to know you were able to restart the function app to mitigate the stuck instance. I would still like to investigate that cause - it seems something must have gotten stuck in memory. Is this the same 29a352c4-aec5-4154-add8-fc6ba4b977db instance which you mentioned previously?

@pluralmonad

This comment has been minimized.

pluralmonad commented Jun 12, 2018

Yes, that was the ID of the instance that hung during execution. All steps that ran completed successfully, but somewhere in between it got hung and did not run all the steps. It was in a running status when the next timer trigger fired and was terminated at that time. From then on is when the indefinite pending began.

@nagoh

This comment has been minimized.

nagoh commented Jun 13, 2018

I've hit this issue too. The only thing I can add as far as what I've seen is the runtime version.
I tested this out on West US 2 (2.0.11857.0 (beta)), after seeing this thread: Azure/azure-functions-host#2856

Then, moved the to Australia East yesterday (12 Jun) and it was working fine - can recall the runtime version, but it was != 2.0.11857.0 (beta)

Now, its failing on Aus East again, and the Aus East runtime is showing as 2.0.11857.0 (beta)

@cgillum

This comment has been minimized.

Collaborator

cgillum commented Jun 13, 2018

Ah, I feel dumb for not asking this right away but are both of you running Functions v2? All of my testing trying to (unsuccessfully) reproduce this were using v1.

@nagoh

This comment has been minimized.

nagoh commented Jun 13, 2018

@OTedikov

This comment has been minimized.

OTedikov commented Jun 14, 2018

It does not work currently in East US, East US2. Durable Azure Functions stopped working, DurableFunctionsHubInstances Table records states that Runtime Status for Activity is always Pending, durablefunctionshubcontrol Queues messages can not be properly read, Dequeue Count always increases.

@vludax

This comment has been minimized.

vludax commented Jun 14, 2018

I suspect I have run into this problem too. I've deleted the entire resource group with the function app and the associated storage account and redeployed. No luck.

The logs reveal that a new instance of the orchestrator function gets scheduled, but it is never picked up from the control queue.

.csproj

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netstandard2.0</TargetFramework>
    <AzureFunctionsVersion>v2</AzureFunctionsVersion>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Microsoft.NET.Sdk.Functions" Version="1.0.13"/>
    <PackageReference Include="Microsoft.Azure.Management.Sql" Version="1.16.0-preview"/>
    <PackageReference Include="Microsoft.Azure.Services.AppAuthentication" Version="1.1.0-preview"/>
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="1.4.1"/>
    <PackageReference Include="WindowsAzure.Storage" Version="8.6.0"/>
  </ItemGroup>
</Project>

@OTedikov

This comment has been minimized.

OTedikov commented Jun 14, 2018

@vludax Exactly the same issue

@cgillum

This comment has been minimized.

Collaborator

cgillum commented Jun 14, 2018

Thanks for confirming. This may have been caused by the recent update of the Azure Functions v2 runtime into Azure mentioned here.

Can you please try setting the FUNCTIONS_EXTENSION_VERSION app setting to 2.0.11776-alpha to see if that mitigates the issue? This is a temporary way to rollback the runtime update for your function app.

@OTedikov

This comment has been minimized.

OTedikov commented Jun 14, 2018

@cgillum Thank you, FUNCTIONS_EXTENSION_VERSION setting change seemed to help.

@vludax

This comment has been minimized.

vludax commented Jun 14, 2018

Yep, worked for me too, cheers!

@kashimiz

This comment has been minimized.

Member

kashimiz commented Jun 14, 2018

Trying to reproduce this to get to the root cause. @nagoh , @OTedikov , @pluralmonad , can you provide the following information about the environment you experienced this issue in?

  • What version of the Durable Functions extension were you using?
  • How were you developing your function app? (Were you creating it locally in Visual Studio or Visual Studio Code and deploying it to Azure, or building it directly in the Portal?)
  • What language were your functions written in? C#? JS?
  • Can you share an instance ID and the region your app ran in?

Thanks!

@mmas0n

This comment has been minimized.

mmas0n commented Jun 14, 2018

Same issue here. Environment information shown below:

@nagoh

This comment has been minimized.

nagoh commented Jun 14, 2018

Using Microsoft.Azure.WebJobs.Extensions.DurableTask 1.4.1
Function developed in Visual Studio & deployed to Azure
C#
Region = Australia East

@kashimiz

This comment has been minimized.

Member

kashimiz commented Jun 15, 2018

Thanks for the information! I was able to reproduce this - the functions runtime host was recently updated to use Newtonsoft.Json 11.0.2, while one of the Durable Functions extension's dependencies was still on 10.0.3. There's a breaking change between the two versions that was causing an exception to be thrown whenever the Durable Task Framework tried to process a work item, which is why orchestrators were stuck in Pending state.

I've submitted a fix which will be going out in a release. In the meantime, please use Chris's workaround.

@jeffhollan

This comment has been minimized.

Member

jeffhollan commented Jun 15, 2018

@cgillum @kashimiz Just realized one of my apps was effected too - went fine once I dropped back to previous version. I wonder if there's another task here we could track which would enable a CreateCheckStatusResponse to wait until the orchestrator has picked up the job before returning? I don't know if it's the right behavior but as I was exploring my app all the metrics seemed green (my HTTP function was executing and returning a response just fine), but my orchestrator was never picking it up (guessing from a queue). Would have almost rather my app returned a 5xx if enough time elapsed and the job was never started (opt-in, potentially similar to WaitForCompletionOrCreateCheckStatusResponse. Again not sure just trying to think of any ways I could have caught the cause (or at least source) on my own.

@cgillum

This comment has been minimized.

Collaborator

cgillum commented Jun 15, 2018

We actually used to wait for the orchestrator to start before returning from the StartNewAsync API, but that created a lot of confusion for people when they ran Durable under load. They would start getting timeout exceptions because their start messages weren't getting picked up right away, but the orchestration did eventually run fine.

Were you able to observe the problem if you checked your app insights function logs? It ideally should have been visible in the host logs (Katy saw it in Kusto - not sure if that necessarily means it surfaced in AI). If so, I wonder if we could add some portal UX to make it clear that the orchestrator and activity trigger functions were unhealthy (I thought I saw an email thread discussing this possibility).

@jeffhollan

This comment has been minimized.

Member

jeffhollan commented Jun 15, 2018

Yes unfortunately I see the request go into HTTP and execute succeeded, but nothing in "exceptions" table in last 24 hours. So I at least don't see anything in my AI. Agree on the load piece. If interested to see if exceptions were thrown at a different layer the function app is named coreysa-store-function

@MaximRouiller

This comment has been minimized.

MaximRouiller commented Jun 21, 2018

Please note that after many tests, fixing FUNCTIONS_EXTENSION_VERSION app setting to 2.0.11776-alpha and using EF Core 2.0 will bring back the error Azure/azure-functions-core-tools#335 due to the fact that it's been resolved after 2.0.11776-alpha.

Just an FYI for everyone.

@pluralmonad

This comment has been minimized.

pluralmonad commented Jun 21, 2018

Getting caught up on this discussion...

The FUNCTIONS_EXTENSION_VERSION in my app settings is ~1.

The specific Runtime version is 1.0.11820.0.

@MaximRouiller

This comment has been minimized.

MaximRouiller commented Jun 21, 2018

@pluralmonad My issue was with .NET Core which is only supported in the beta v2 runtime.

@cgillum

This comment has been minimized.

Collaborator

cgillum commented Jun 21, 2018

@pluralmonad Unfortunately I think your original issue may have been overtaken by a similar issue which only impacted Functions v2. We've fixed the v2 issue, and several people are watching this issue to track that. It will be available in the 1.5 release, which is on nuget.org now (release announcement pending). For that reason, I'd like to close this issue. Can you open a new issue with your latest information, and specify that this is impacting Functions v1? Sorry for the mixup...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment