Race condition causes orchestrations to get stuck during lease failover #481

cgillum · 2018-10-20T01:19:58Z

This is most frequently observed when running in the Azure Consumption plan under heavy load because leases will move around more rapidly than they do with the dedicated plans.

The problem occurs in the following cases:

Orchestration running on VM1 starts and schedules activity 1
Lease fails over from VM1 to VM2
activity 1 completes very quickly, before the orchestration commits its history
Orchestration running on VM2 receives activity 1 response before VM1 finished committing history
Orchestration running on VM2 doesn't see any history, so it deletes activity 1's response message

VM1 finally commits the history, but at this point the activity 1 response message has been deleted, so the orchestration will never progress forward.

The lease failover contributes to the problem because normally a VM can detect out of order messages using in-memory tracking. However, when the lease fails over to VM 2, that new VM does not know that the orchestration is still running on VM 1 and thus tries to process its messages immediately.

cgillum · 2018-10-23T04:46:58Z

Closing as a duplicate of #460

cgillum added bug dtfx labels Oct 20, 2018

cgillum mentioned this issue Oct 20, 2018

[WIP] Update DurableTask.AzureStorage dependency #482

Closed

cgillum added this to the Durable Functions v1.7.0 Release milestone Oct 23, 2018

cgillum closed this as completed Oct 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition causes orchestrations to get stuck during lease failover #481

Race condition causes orchestrations to get stuck during lease failover #481

cgillum commented Oct 20, 2018

cgillum commented Oct 23, 2018

Race condition causes orchestrations to get stuck during lease failover #481

Race condition causes orchestrations to get stuck during lease failover #481

Comments

cgillum commented Oct 20, 2018

cgillum commented Oct 23, 2018