Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run a special set of subreconcilers when coordinator pods are missing #1949

Closed
wants to merge 1 commit into from

Conversation

gm42
Copy link
Contributor

@gm42 gm42 commented Feb 22, 2024

Description

If the majority of coordinator pods are deleted quorum becomes unreachable and any attempt to read the cluster status will fail.
This change adds logic to special-case the first updateStatus reconciler so that in case of failure, only some reconcilers apt to re-add the missing pods are run.

This change has been experimentally tested to bring back the cluster online after coordinator pods are manually deleted.

Fixes #1931

I expect it would become even more efficient if/when #1757 is merged.

Type of change

  • New feature (non-breaking change which adds functionality)

Discussion

Feedback welcome for:

  • the subset of reconcilers to run when the initial status fetch fails
  • test coverage

Testing

I would like to add an E2E test for this, but I am new and would like some feedback about it first.

Documentation

No documentation was updated, but possibly should - as one of the cases covered.

Follow-up

None needed.

@johscheuer johscheuer self-requested a review February 22, 2024 12:25
Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 1 or more coordinator pods are deleted, quorum becomes unreachable and any attempt to read the cluster status will fail.
This change adds logic to special-case the first updateStatus reconciler so that in case of failure, only some reconcilers apt to re-add the missing pods are run.

Only if a quorum of coordinators gets unreachable the FDB cluster should get unreachable, otherwise the operator should still be able to fetch the status and change the coordinators to a new set of running Pods. I made some changes a few weeks ago that will help to bring the cluster back if the majority of coordinators are deleted: #1932. One important note here is that this will only work if DNS entries are used in the cluster file, otherwise the newly created Pods will get a new IP address.

Thanks for your PR.

@gm42
Copy link
Contributor Author

gm42 commented Feb 23, 2024

Only if a quorum of coordinators gets unreachable the FDB cluster should get unreachable, otherwise the operator should still be able to fetch the status and change the coordinators to a new set of running Pods.

Yes, my bad - I was too short on that sentence, I'll amend it.

I made some changes a few weeks ago that will help to bring the cluster back if the majority of coordinators are deleted: #1932. One important note here is that this will only work if DNS entries are used in the cluster file, otherwise the newly created Pods will get a new IP address.

Ok, in this case shall I close the PR?

@johscheuer
Copy link
Member

Only if a quorum of coordinators gets unreachable the FDB cluster should get unreachable, otherwise the operator should still be able to fetch the status and change the coordinators to a new set of running Pods.

Yes, my bad - I was too short on that sentence, I'll amend it.

No worries :)

I made some changes a few weeks ago that will help to bring the cluster back if the majority of coordinators are deleted: #1932. One important note here is that this will only work if DNS entries are used in the cluster file, otherwise the newly created Pods will get a new IP address.

Ok, in this case shall I close the PR?

I think we can close this PR, I try to merge the other PR today, so it will be included in the next release today. I have to track down an issue in the FDB go bindings that you initially discovered, but I will do that in a separate PR.

Copy link
Member

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved in #1932, @gm42 if you still observe issues, please feel free to raise an issue or another PR.

Thanks again!

@johscheuer johscheuer closed this Feb 25, 2024
@gm42
Copy link
Contributor Author

gm42 commented Feb 26, 2024

Solved in #1932, @gm42 if you still observe issues, please feel free to raise an issue or another PR.

Thanks again!

Great, thanks! Will definitely test it out ASAP.

I have to track down an issue in the FDB go bindings that you initially discovered, but I will do that in a separate PR.

That would be great! On this topic: I have seen the run_network crash happen also in the operator. A mitigation for this would be to use a wait group to track all the C.fdb_run_network() goroutines created, and then in the Go finalizer for the client connection wait for such wait group after calling C.fdb_stop_network().

@johscheuer
Copy link
Member

On this topic: I have seen the run_network crash happen also in the operator. A mitigation for this would be to use a wait group to track all the C.fdb_run_network() goroutines created, and then in the Go finalizer for the client connection wait for such wait group after calling C.fdb_stop_network().

If you observed an issue and used DNS entries in the cluster file you might have hit: apple/foundationdb#11222. Especially in the case when the coordinators are not reachable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FDB operator stuck without recreating pods
2 participants