Orchestration instances hang until host restart after history fetch failure #376

cgillum · 2018-07-02T21:22:28Z

This issue was found as part of #368 and is believed to be one of multiple root causes of #371.

Issue

It appears that any failure to fetch the instance history from Azure Storage will cause the orchestration instance to hang indefinitely, or until the host restarts. The problem is with the activeOrchestrationSessions dictionary, which tracks active orchestrations. This dictionary gets populated as soon as we read a message from the control queue for an instance. If the work item completes or fails, we remove this entry from the dictionary except for if there is a failure fetching the orchestration history. If any instance remains "leaked" in this table, then it will not be possible to process any more messages for that instance (we will keep dequeueing and keep issuing duplicate message warnings).

Here is the specific exception that was found that caused us to get into this state (the exception was hard to find because of #373):

TaskOrchestrationDispatcher-6ecfa37ed51f4d689c28be72394a1bd3-0: Exception while fetching workItem: The client could not finish the operation within specified timeout.
Exception: Microsoft.WindowsAzure.Storage.StorageException : The client could not finish the operation within specified timeout.
	   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
   at Microsoft.WindowsAzure.Storage.Table.CloudTable.EndExecuteQuerySegmented(IAsyncResult asyncResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Table\CloudTable.cs:line 336
   at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass1`1.<CreateCallback>b__0(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 66
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.<GetHistoryEventsAsync>d__31.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.<GetOrchestrationRuntimeStateAsync>d__78.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.<LockNextTaskOrchestrationWorkItemAsync>d__72.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.Core.WorkItemDispatcher`1.<DispatchAsync>d__33.MoveNext()
Inner Exception: System.TimeoutException: The client could not finish the operation within specified timeout.

Here is the internal Kusto query which ultimately revealed the issue

DurableFunctionsEvents
| where TIMESTAMP between (datetime(2018-06-28 17:52:45) .. datetime(2018-06-29 16:56:22.2930228))
| where InstanceId == "instance_076773" or sessionId == "instance_076773" or (Level < 4 and (sessionId == "" or sessionId == "instance_076773"))
| where RoleInstance == "LargeDedicatedWebWorkerRole_IN_20" and Tenant == "2fb52e3fde3e4b029a78225d9a5d43e9"
| where IsReplay != true
| order by TIMESTAMP asc
| project-away PreciseTimeStamp, Tenant, Role, RoleInstance, SourceNamespace, SourceMoniker, SourceVersion, IsReplay
| take 10000

Fix

The fix is to handle any exceptions that occur during this window and removing the entry from the activeOrchestrationSessions dictionary.

The text was updated successfully, but these errors were encountered:

cgillum added bug Performance dtfx labels Jul 2, 2018

cgillum added this to the Quality Milestone (v1.5.1) milestone Jul 2, 2018

cgillum self-assigned this Jul 2, 2018

cgillum mentioned this issue Aug 3, 2018

DurableTask.AzureStorage performance and reliability improvements: Azure/durabletask#201

Merged

cgillum mentioned this issue Aug 28, 2018

DurableTask.AzureStorage v1.3.0 release payload Azure/durabletask#212

Merged

4 tasks

cgillum closed this as completed Aug 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestration instances hang until host restart after history fetch failure #376

Orchestration instances hang until host restart after history fetch failure #376

cgillum commented Jul 2, 2018

Orchestration instances hang until host restart after history fetch failure #376

Orchestration instances hang until host restart after history fetch failure #376

Comments

cgillum commented Jul 2, 2018

Issue

Fix