Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataPlane: add self healing for "live" resources when BG controller is enabled #160

Open
pmalek opened this issue Aug 28, 2023 · 5 comments
Labels
area/dataplane blocked migrated-from-archive Issues migrated from the archived KGO repository

Comments

@pmalek
Copy link
Member

pmalek commented Aug 28, 2023

Problem statement

Kong/gateway-operator-archive#91 introduced "self-healing" concept which made the operator replace the subresources (managed by the operator) to be not only updated in case a configuration drift happened but also recreated when they got deleted for some reason. So e.g. DataPlane Deployment would get recreated whenever it would get deleted.

After the introduction of BlueGreen DataPlane controller this stopped being the case when said controller is enabled because the execution of a reconciliation is only delegated to DataPlaneReconciler under concrete conditions: currently whenever a DataPlane doesn't have a BlueGreen rollout strategy defined and whenever it's "not ready":

// Blue Green rollout strategy is not enabled, delegate to DataPlane controller.
if dataplane.Spec.Deployment.Rollout == nil || dataplane.Spec.Deployment.Rollout.Strategy.BlueGreen == nil {
trace(log, "no Rollout with BlueGreen strategy specified, delegating to DataPlaneReconciler", req)
return r.DataPlaneController.Reconcile(ctx, req)
}
if !k8sutils.IsReady(&dataplane) {
debug(log, "DataPlane is not ready yet to proceed with BlueGreen rollout, delegating to DataPlaneReconciler", req)
return r.DataPlaneController.Reconcile(ctx, req)
}

What works now with regards to the above:

  • self-healing of "preview" subresources
  • recreation of "live" resources on promotion

This issue tracks the effort of re-introducing the self-healing aspect to DataPlanes with BlueGreen rollout strategy for "live" resources.

Proposed solution(s)

Additional information

Blocked by https://github.com/Kong/gateway-operator/issues/1031.

Acceptance criteria

  • As a User I can expect the operator to re-create "live" DataPlane subresources whenever they are deleted
  • As a User I can expect the operator to update/patch "live" DataPlane subresources whenever they are changed without promotion
@czeslavo czeslavo self-assigned this Aug 28, 2023
@czeslavo
Copy link
Contributor

czeslavo commented Aug 30, 2023

I wasn't able to find that stated directly in the api-server documentation, but it appears from the code that there's garbage collection which prunes the oldest resource versions. It runs every 5 minutes and is not configurable. That makes storing ResourceVersion and relying on it to get the proper "live" spec not robust enough as it would break if the persisted version was garbage collected.

For confirmation, there's also this reply under a k8s issue which confirms this behavior.

I think we have to make our own way to persist the "live" spec explicitly in DataPlane. I'll try to go with storing the whole spec as a JSON blob.

@pmalek
Copy link
Member Author

pmalek commented Aug 30, 2023

I'm not saying we do this now but the CRD to be implemented in #159 has the potential for serving the purpose of holding said spec.

It would also be easier for the user to reason what spec was used in particular rollout.

@czeslavo
Copy link
Contributor

czeslavo commented Aug 30, 2023

Yeah, definitely that's a good idea to try to mix those two to not repeat the same job twice. 👍 I'll see if I can make it a minimally viable solution that would just carry the spec for now, making it ready for extension.

@czeslavo
Copy link
Contributor

czeslavo commented Aug 31, 2023

Together with @pmalek we came to conclusions:

@czeslavo
Copy link
Contributor

As for now https://github.com/Kong/gateway-operator/issues/1048 will be a simpler solution to the problem of accidental removals of DataPlane-owned resources, I'm moving this one out of Cloud Gateways Phase 0 milestone @pmalek.

@czeslavo czeslavo removed their assignment Sep 19, 2023
@pmalek pmalek added the migrated-from-archive Issues migrated from the archived KGO repository label Apr 19, 2024
@czeslavo czeslavo transferred this issue from another repository Apr 22, 2024
@czeslavo czeslavo transferred this issue from another repository Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dataplane blocked migrated-from-archive Issues migrated from the archived KGO repository
Projects
None yet
Development

No branches or pull requests

2 participants