-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run a special set of subreconcilers when coordinator pods are missing #1949
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 1 or more coordinator pods are deleted, quorum becomes unreachable and any attempt to read the cluster status will fail.
This change adds logic to special-case the first updateStatus reconciler so that in case of failure, only some reconcilers apt to re-add the missing pods are run.
Only if a quorum of coordinators gets unreachable the FDB cluster should get unreachable, otherwise the operator should still be able to fetch the status and change the coordinators to a new set of running Pods. I made some changes a few weeks ago that will help to bring the cluster back if the majority of coordinators are deleted: #1932. One important note here is that this will only work if DNS entries are used in the cluster file, otherwise the newly created Pods will get a new IP address.
Thanks for your PR.
Yes, my bad - I was too short on that sentence, I'll amend it.
Ok, in this case shall I close the PR? |
No worries :)
I think we can close this PR, I try to merge the other PR today, so it will be included in the next release today. I have to track down an issue in the FDB go bindings that you initially discovered, but I will do that in a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks! Will definitely test it out ASAP.
That would be great! On this topic: I have seen the |
If you observed an issue and used DNS entries in the cluster file you might have hit: apple/foundationdb#11222. Especially in the case when the coordinators are not reachable. |
Description
If the majority of coordinator pods are deleted quorum becomes unreachable and any attempt to read the cluster status will fail.
This change adds logic to special-case the first
updateStatus
reconciler so that in case of failure, only some reconcilers apt to re-add the missing pods are run.This change has been experimentally tested to bring back the cluster online after coordinator pods are manually deleted.
Fixes #1931
I expect it would become even more efficient if/when #1757 is merged.
Type of change
Discussion
Feedback welcome for:
Testing
I would like to add an E2E test for this, but I am new and would like some feedback about it first.
Documentation
No documentation was updated, but possibly should - as one of the cases covered.
Follow-up
None needed.