Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestration instances hang until host restart after history fetch failure #376

Closed
cgillum opened this issue Jul 2, 2018 · 0 comments
Closed

Comments

@cgillum
Copy link
Collaborator

cgillum commented Jul 2, 2018

This issue was found as part of #368 and is believed to be one of multiple root causes of #371.

Issue

It appears that any failure to fetch the instance history from Azure Storage will cause the orchestration instance to hang indefinitely, or until the host restarts. The problem is with the activeOrchestrationSessions dictionary, which tracks active orchestrations. This dictionary gets populated as soon as we read a message from the control queue for an instance. If the work item completes or fails, we remove this entry from the dictionary except for if there is a failure fetching the orchestration history. If any instance remains "leaked" in this table, then it will not be possible to process any more messages for that instance (we will keep dequeueing and keep issuing duplicate message warnings).

Here is the specific exception that was found that caused us to get into this state (the exception was hard to find because of #373):

TaskOrchestrationDispatcher-6ecfa37ed51f4d689c28be72394a1bd3-0: Exception while fetching workItem: The client could not finish the operation within specified timeout.
Exception: Microsoft.WindowsAzure.Storage.StorageException : The client could not finish the operation within specified timeout.
	   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
   at Microsoft.WindowsAzure.Storage.Table.CloudTable.EndExecuteQuerySegmented(IAsyncResult asyncResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Table\CloudTable.cs:line 336
   at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass1`1.<CreateCallback>b__0(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 66
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.<GetHistoryEventsAsync>d__31.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.<GetOrchestrationRuntimeStateAsync>d__78.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.<LockNextTaskOrchestrationWorkItemAsync>d__72.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.Core.WorkItemDispatcher`1.<DispatchAsync>d__33.MoveNext()
Inner Exception: System.TimeoutException: The client could not finish the operation within specified timeout.

Here is the internal Kusto query which ultimately revealed the issue

DurableFunctionsEvents
| where TIMESTAMP between (datetime(2018-06-28 17:52:45) .. datetime(2018-06-29 16:56:22.2930228))
| where InstanceId == "instance_076773" or sessionId == "instance_076773" or (Level < 4 and (sessionId == "" or sessionId == "instance_076773"))
| where RoleInstance == "LargeDedicatedWebWorkerRole_IN_20" and Tenant == "2fb52e3fde3e4b029a78225d9a5d43e9"
| where IsReplay != true
| order by TIMESTAMP asc
| project-away PreciseTimeStamp, Tenant, Role, RoleInstance, SourceNamespace, SourceMoniker, SourceVersion, IsReplay
| take 10000

Fix

The fix is to handle any exceptions that occur during this window and removing the entry from the activeOrchestrationSessions dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant