You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue was found as part of #368 and is believed to be one of multiple root causes of #371.
Issue
It appears that any failure to fetch the instance history from Azure Storage will cause the orchestration instance to hang indefinitely, or until the host restarts. The problem is with the activeOrchestrationSessions dictionary, which tracks active orchestrations. This dictionary gets populated as soon as we read a message from the control queue for an instance. If the work item completes or fails, we remove this entry from the dictionary except for if there is a failure fetching the orchestration history. If any instance remains "leaked" in this table, then it will not be possible to process any more messages for that instance (we will keep dequeueing and keep issuing duplicate message warnings).
Here is the specific exception that was found that caused us to get into this state (the exception was hard to find because of #373):
TaskOrchestrationDispatcher-6ecfa37ed51f4d689c28be72394a1bd3-0: Exception while fetching workItem: The client could not finish the operation within specified timeout.
Exception: Microsoft.WindowsAzure.Storage.StorageException : The client could not finish the operation within specified timeout.
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
at Microsoft.WindowsAzure.Storage.Table.CloudTable.EndExecuteQuerySegmented(IAsyncResult asyncResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Table\CloudTable.cs:line 336
at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass1`1.<CreateCallback>b__0(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 66
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.<GetHistoryEventsAsync>d__31.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at DurableTask.AzureStorage.AzureStorageOrchestrationService.<GetOrchestrationRuntimeStateAsync>d__78.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
at DurableTask.AzureStorage.AzureStorageOrchestrationService.<LockNextTaskOrchestrationWorkItemAsync>d__72.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at DurableTask.Core.WorkItemDispatcher`1.<DispatchAsync>d__33.MoveNext()
Inner Exception: System.TimeoutException: The client could not finish the operation within specified timeout.
Here is the internal Kusto query which ultimately revealed the issue
DurableFunctionsEvents
| where TIMESTAMP between (datetime(2018-06-28 17:52:45) .. datetime(2018-06-29 16:56:22.2930228))
| where InstanceId == "instance_076773" or sessionId == "instance_076773" or (Level < 4 and (sessionId == "" or sessionId == "instance_076773"))
| where RoleInstance == "LargeDedicatedWebWorkerRole_IN_20" and Tenant == "2fb52e3fde3e4b029a78225d9a5d43e9"
| where IsReplay != true
| order by TIMESTAMP asc
| project-away PreciseTimeStamp, Tenant, Role, RoleInstance, SourceNamespace, SourceMoniker, SourceVersion, IsReplay
| take 10000
Fix
The fix is to handle any exceptions that occur during this window and removing the entry from the activeOrchestrationSessions dictionary.
The text was updated successfully, but these errors were encountered:
This issue was found as part of #368 and is believed to be one of multiple root causes of #371.
Issue
It appears that any failure to fetch the instance history from Azure Storage will cause the orchestration instance to hang indefinitely, or until the host restarts. The problem is with the
activeOrchestrationSessions
dictionary, which tracks active orchestrations. This dictionary gets populated as soon as we read a message from the control queue for an instance. If the work item completes or fails, we remove this entry from the dictionary except for if there is a failure fetching the orchestration history. If any instance remains "leaked" in this table, then it will not be possible to process any more messages for that instance (we will keep dequeueing and keep issuing duplicate message warnings).Here is the specific exception that was found that caused us to get into this state (the exception was hard to find because of #373):
Here is the internal Kusto query which ultimately revealed the issue
Fix
The fix is to handle any exceptions that occur during this window and removing the entry from the
activeOrchestrationSessions
dictionary.The text was updated successfully, but these errors were encountered: