-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(discovery) update non-terminating instances #4515
Conversation
Use not terminating instead of ready condition to determine which Kong instances are available for configuration updates. This provides compatibility with Kong instances that use the 3.3+ /status/ready endpoint instead of /status.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## release/2.10.x #4515 +/- ##
================================================
- Coverage 61.1% 61.0% -0.1%
================================================
Files 152 152
Lines 16959 16959
================================================
- Hits 10367 10353 -14
- Misses 5953 5966 +13
- Partials 639 640 +1
☔ View full report in Codecov by Sentry. |
a78a623
to
ade87ba
Compare
@pmalek do you recall what the pending/ready distinction does in practice? I read through #4368 again and am not actually sure based on the description alone--it describes what it does re various subsystems in the code, but not the practical implications. Without it (2f420a9 only, essentially--the rest of the changes here are tests and backport compat stuff), everything still appears to work normally. Deploying the latest ingress chart with 2.10.4 results in the known behavior where instances of the proxy never become ready because the controller refuses to push configuration. Deploying with the image built from this branch and some config does result in ready gateway instances with the expected configuration:
I thought maybe scaling would be an issue, but don't see an obvious one: scaling up spawns new gateways that all do eventually (<10s) get configuration and become ready. Scaling down doesn't do much of interest since it just deletes instances. The controller shows some errors in logs indicating that it failed to push config due to instances going offline, but they don't continue indefinitely, just briefly, presumably when an update happens to catch a terminating instance before we see that it's terminating. I thought maybe excluding the additional code would prevent the controller from properly finding new instances or hold onto old ones indefinitely, but that doesn't appear to be the case. It could maybe do something to Konnect reporting--I haven't checked that yet. |
@rainest pending clients are those that were discovered but are not ready (the The bit of code that checks this is specifically in kubernetes-ingress-controller/internal/clients/readiness.go Lines 79 to 87 in 06f95cb
kubernetes-ingress-controller/internal/clients/readiness.go Lines 89 to 113 in 06f95cb
kubernetes-ingress-controller/internal/clients/manager.go Lines 138 to 151 in 06f95cb
Then there's the synchronous part of it which cannot react to objects in k8s (but now that I think of it it could be rewritten to react to endpoints similarly as kubernetes-ingress-controller/internal/clients/manager.go Lines 221 to 222 in 06f95cb
Here's the doc written by @czeslavo for this problem: https://docs.google.com/document/d/1FdTBPD-8PACD93TjSzrBG_CSeMHir9CrVAoWFM5ru_4/edit#heading=h.ogl5esaoc0om Having said that I'm surprised that it works in 2.10 with just the changes from this branch 🤔 To my understanding you'd need |
So it looks like the crux of it is:
AFAICT the additional code is just ensuring that that admin API is up (a Based on that I think the minimal backport is fine, and the minor additional error reporting cost is outweighed by the benefit of maintaining compatibility. |
Yeah, that's the gist of it. 👍 The code that wasn't ported basically operates on the same set of non-terminating endpoints but instead of using them blindly, it runs the readiness check manually. It also detects potential ready->not ready transitions and withdraws a gateway in such a case from ready to pending list. If we assume that the Gateway is very unlikely to get stuck in a state in which it's not ready to serve Admin API requests, I believe this backport should work (with the quirk that @rainest mentions - there will be logs telling about failed configuration updates during rollouts, and some mess in metrics because KIC will try to send config to Gateway Pods that were just provisioned). I imagine that if we'd like to go cheap, but not the cheapest, we could add a synchronous status check before every config update to ensure a Gateway client is ready. If not, we do not treat it as a failure and we just skip that Gateway in the update cycle. |
I'm 🤷 on the transient errors, and the simpler synchronous check we can probably add separately if we want, so marking this ready for review. Ripping out depguard since I don't want to bother backporting the golangci changes. Backports from future shouldn't be introducing unwanted deps anyway; they've passed those checks elsewhere already. 2.9 looks more annoying to backport due to further drift elsewhere from when these changes were introduced. Dunno if we try for it as well. |
I removed setting up the initial config in the integration tests suite - as we bump to KTF v0.38.0, Kong's readiness probe is set to |
What this PR does / why we need it:
Use not terminating instead of ready condition to determine which Kong instances are available for configuration updates. This provides compatibility with Kong instances that use the 3.3+ /status/ready endpoint instead of /status.
Partial backport of #4368.
Which issue this PR fixes:
2.11 is the first release fully compatible with the new
/status/ready
endpoint. It adds both a phantom upstream on empty config and this change, to send configuration to non-ready endpoints.Backporting this should allow broader compatibility with versions that use
/status/ready
without most of the 2.11 changes.Special notes for your reviewer:
This is not the only change in #4368, and I don't know the full extent of other discovery changes in 2.11 either. This works in a spot check (a 2.10 image that includes it does come up successfully when Kong instances use
/status/ready
), but I'm unsure if there are further unexpected issues by making a minimal backport.PR Readiness Checklist:
Complete these before marking the PR as
ready to review
:CHANGELOG.md
release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR