reset pending orchestrations when worker restart#1354
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Improves partition drain behavior for Azure Storage control queues by immediately re-exposing dequeued-but-undispatched control messages when a partition is released, reducing throughput gaps during lease transitions.
Changes:
- Abandon pending (in-memory) control queue messages for a drained partition with zero visibility timeout before removing the partition.
- Guard dispatch logic to skip “ready” nodes that were drained/removed from the pending list.
- Add a unit test covering the drained-ready-node scenario.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/DurableTask.AzureStorage/OrchestrationSessionManager.cs | Abandons pending batches during drain and skips drained nodes during dispatch. |
| src/DurableTask.AzureStorage/Messaging/ControlQueue.cs | Adds a drain-specific abandon path that immediately re-queues messages (visibility timeout = 0). |
| Test/DurableTask.AzureStorage.Tests/OrchestrationSessionTests.cs | Adds test ensuring drained nodes in the ready queue are ignored. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR feedback 01 Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…tch block' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| this.settings.WorkerId, | ||
| this.Name, | ||
| $"Failed to abandon message {queueMessage.MessageId} during drain: {e}"); | ||
| } |
| catch (RequestFailedException e) | ||
| { | ||
| this.settings.Logger.PartitionManagerWarning( | ||
| this.storageAccountName, | ||
| this.settings.TaskHubName, | ||
| this.settings.WorkerId, | ||
| this.Name, | ||
| $"Failed to abandon message {queueMessage.MessageId} during drain: {e}"); | ||
| } |
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Applied in dd23148. AbandonMessageForDrainAsync now catches broader Exception so drain-abandon stays best-effort and won’t bubble failures that can interfere with partition cleanup; warning logging continues to include full exception details via {e}.
| static object CreatePendingBatch(ControlQueue controlQueue) | ||
| { | ||
| Type pendingBatchType = typeof(OrchestrationSessionManager) | ||
| .GetNestedType("PendingMessageBatch", BindingFlags.NonPublic); | ||
|
|
||
| return Activator.CreateInstance( | ||
| pendingBatchType, | ||
| BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic, | ||
| binder: null, | ||
| args: new object[] { controlQueue, "instance1", "execution1" }, | ||
| culture: null); | ||
| } | ||
|
|
||
| static object AddPendingBatchNode(OrchestrationSessionManager manager, object pendingBatch) | ||
| { | ||
| object pendingBatches = GetPrivateField(manager, "pendingOrchestrationMessageBatches"); | ||
| MethodInfo addLast = pendingBatches.GetType().GetMethod("AddLast", new[] { pendingBatch.GetType() }); | ||
| return addLast.Invoke(pendingBatches, new[] { pendingBatch }); | ||
| } | ||
|
|
||
| static void RemovePendingBatchNode(OrchestrationSessionManager manager, object node) | ||
| { | ||
| object pendingBatches = GetPrivateField(manager, "pendingOrchestrationMessageBatches"); | ||
| MethodInfo remove = pendingBatches.GetType().GetMethod("Remove", new[] { node.GetType() }); | ||
| remove.Invoke(pendingBatches, new[] { node }); | ||
| } | ||
|
|
||
| static void EnqueueReadyForProcessingNode(OrchestrationSessionManager manager, object node) | ||
| { | ||
| object readyQueue = GetPrivateField(manager, "orchestrationsReadyForProcessingQueue"); | ||
| MethodInfo enqueue = readyQueue.GetType().GetMethod("Enqueue"); | ||
| enqueue.Invoke(readyQueue, new[] { node }); |
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Applied in dd23148. I added explicit Assert.IsNotNull(...) checks for reflection targets used by the helpers (PendingMessageBatch nested type and the AddLast/Remove/Enqueue methods), plus null-checks on constructed/invoked reflection results.
Agent-Logs-Url: https://github.com/Azure/durabletask/sessions/3b99ae3b-6739-4610-8ba6-45c74a048f77 Co-authored-by: kaibocai <89094811+kaibocai@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/durabletask/sessions/eba6ff1a-308c-42eb-aca5-08cf29ab5b84 Co-authored-by: kaibocai <89094811+kaibocai@users.noreply.github.com>
| catch (Exception e) | ||
| { | ||
| this.settings.Logger.PartitionManagerWarning( | ||
| this.storageAccountName, | ||
| this.settings.TaskHubName, | ||
| this.settings.WorkerId, | ||
| partitionId, | ||
| $"Failed to abandon pending messages during drain: {e}"); | ||
| } |
…project Agent-Logs-Url: https://github.com/Azure/durabletask/sessions/9804332d-7955-4611-9934-454c9dcfa570 Co-authored-by: kaibocai <89094811+kaibocai@users.noreply.github.com>
| catch (Exception e) | ||
| { | ||
| this.settings.Logger.PartitionManagerWarning( | ||
| this.storageAccountName, | ||
| this.settings.TaskHubName, | ||
| this.settings.WorkerId, | ||
| this.Name, | ||
| $"Failed to abandon message {queueMessage.MessageId} during drain: {e}"); | ||
| } |
This PR improves partition drain behavior for Azure Storage control queues. When a partition is released, any control queue messages that were already dequeued but not yet dispatched to an active orchestration session are now abandoned with zero visibility timeout, making them immediately visible for the next partition owner.
The change prevents a throughput gap during lease transitions where pending in-memory messages could otherwise remain invisible until their original visibility timeout expired.
Related ICM: https://portal.microsofticm.com/imp/v5/incidents/details/21000001021644/summary