Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition causes orchestrations to get stuck during lease failover #481

Closed
cgillum opened this issue Oct 20, 2018 · 1 comment
Closed

Comments

@cgillum
Copy link
Collaborator

cgillum commented Oct 20, 2018

This is most frequently observed when running in the Azure Consumption plan under heavy load because leases will move around more rapidly than they do with the dedicated plans.

The problem occurs in the following cases:

  • Orchestration running on VM1 starts and schedules activity 1
  • Lease fails over from VM1 to VM2
  • activity 1 completes very quickly, before the orchestration commits its history
  • Orchestration running on VM2 receives activity 1 response before VM1 finished committing history
  • Orchestration running on VM2 doesn't see any history, so it deletes activity 1's response message

VM1 finally commits the history, but at this point the activity 1 response message has been deleted, so the orchestration will never progress forward.

The lease failover contributes to the problem because normally a VM can detect out of order messages using in-memory tracking. However, when the lease fails over to VM 2, that new VM does not know that the orchestration is still running on VM 1 and thus tries to process its messages immediately.

@cgillum
Copy link
Collaborator Author

cgillum commented Oct 23, 2018

Closing as a duplicate of #460

@cgillum cgillum closed this as completed Oct 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant