Run a special set of subreconcilers when coordinator pods are missing #1949

gm42 · 2024-02-22T12:09:58Z

Description

If the majority of coordinator pods are deleted quorum becomes unreachable and any attempt to read the cluster status will fail.
This change adds logic to special-case the first updateStatus reconciler so that in case of failure, only some reconcilers apt to re-add the missing pods are run.

This change has been experimentally tested to bring back the cluster online after coordinator pods are manually deleted.

Fixes #1931

I expect it would become even more efficient if/when #1757 is merged.

Type of change

New feature (non-breaking change which adds functionality)

Discussion

Feedback welcome for:

the subset of reconcilers to run when the initial status fetch fails
test coverage

Testing

I would like to add an E2E test for this, but I am new and would like some feedback about it first.

Documentation

No documentation was updated, but possibly should - as one of the cases covered.

Follow-up

None needed.

Fixes FoundationDB#1931

johscheuer

If 1 or more coordinator pods are deleted, quorum becomes unreachable and any attempt to read the cluster status will fail.
This change adds logic to special-case the first updateStatus reconciler so that in case of failure, only some reconcilers apt to re-add the missing pods are run.

Only if a quorum of coordinators gets unreachable the FDB cluster should get unreachable, otherwise the operator should still be able to fetch the status and change the coordinators to a new set of running Pods. I made some changes a few weeks ago that will help to bring the cluster back if the majority of coordinators are deleted: #1932. One important note here is that this will only work if DNS entries are used in the cluster file, otherwise the newly created Pods will get a new IP address.

Thanks for your PR.

gm42 · 2024-02-23T07:38:17Z

Only if a quorum of coordinators gets unreachable the FDB cluster should get unreachable, otherwise the operator should still be able to fetch the status and change the coordinators to a new set of running Pods.

Yes, my bad - I was too short on that sentence, I'll amend it.

I made some changes a few weeks ago that will help to bring the cluster back if the majority of coordinators are deleted: #1932. One important note here is that this will only work if DNS entries are used in the cluster file, otherwise the newly created Pods will get a new IP address.

Ok, in this case shall I close the PR?

johscheuer · 2024-02-23T11:37:15Z

Only if a quorum of coordinators gets unreachable the FDB cluster should get unreachable, otherwise the operator should still be able to fetch the status and change the coordinators to a new set of running Pods.

Yes, my bad - I was too short on that sentence, I'll amend it.

No worries :)

I made some changes a few weeks ago that will help to bring the cluster back if the majority of coordinators are deleted: #1932. One important note here is that this will only work if DNS entries are used in the cluster file, otherwise the newly created Pods will get a new IP address.

Ok, in this case shall I close the PR?

I think we can close this PR, I try to merge the other PR today, so it will be included in the next release today. I have to track down an issue in the FDB go bindings that you initially discovered, but I will do that in a separate PR.

johscheuer

Solved in #1932, @gm42 if you still observe issues, please feel free to raise an issue or another PR.

Thanks again!

gm42 · 2024-02-26T08:37:14Z

Solved in #1932, @gm42 if you still observe issues, please feel free to raise an issue or another PR.

Thanks again!

Great, thanks! Will definitely test it out ASAP.

I have to track down an issue in the FDB go bindings that you initially discovered, but I will do that in a separate PR.

That would be great! On this topic: I have seen the run_network crash happen also in the operator. A mitigation for this would be to use a wait group to track all the C.fdb_run_network() goroutines created, and then in the Go finalizer for the client connection wait for such wait group after calling C.fdb_stop_network().

johscheuer · 2024-02-26T10:50:05Z

On this topic: I have seen the run_network crash happen also in the operator. A mitigation for this would be to use a wait group to track all the C.fdb_run_network() goroutines created, and then in the Go finalizer for the client connection wait for such wait group after calling C.fdb_stop_network().

If you observed an issue and used DNS entries in the cluster file you might have hit: apple/foundationdb#11222. Especially in the case when the coordinators are not reachable.

Run a special set of subreconcilers when coordinator pods are missing

7d3af87

Fixes FoundationDB#1931

johscheuer self-requested a review February 22, 2024 12:25

johscheuer reviewed Feb 22, 2024

View reviewed changes

johscheuer reviewed Feb 25, 2024

View reviewed changes

johscheuer closed this Feb 25, 2024

This was referenced Feb 26, 2024

Using DNS entries in cluster file can cause SIGSEV apple/foundationdb#11222

Closed

FDB operator stuck without recreating pods #1931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run a special set of subreconcilers when coordinator pods are missing #1949

Run a special set of subreconcilers when coordinator pods are missing #1949

gm42 commented Feb 22, 2024 •

edited

Loading

johscheuer left a comment •

edited

Loading

gm42 commented Feb 23, 2024

johscheuer commented Feb 23, 2024

johscheuer left a comment

gm42 commented Feb 26, 2024 •

edited

Loading

johscheuer commented Feb 26, 2024

Run a special set of subreconcilers when coordinator pods are missing #1949

Run a special set of subreconcilers when coordinator pods are missing #1949

Conversation

gm42 commented Feb 22, 2024 • edited Loading

Description

Type of change

Discussion

Testing

Documentation

Follow-up

johscheuer left a comment • edited Loading

Choose a reason for hiding this comment

gm42 commented Feb 23, 2024

johscheuer commented Feb 23, 2024

johscheuer left a comment

Choose a reason for hiding this comment

gm42 commented Feb 26, 2024 • edited Loading

johscheuer commented Feb 26, 2024

gm42 commented Feb 22, 2024 •

edited

Loading

johscheuer left a comment •

edited

Loading

gm42 commented Feb 26, 2024 •

edited

Loading