fix(discovery) update non-terminating instances #4515

rainest · 2023-08-15T18:53:31Z

What this PR does / why we need it:

Use not terminating instead of ready condition to determine which Kong instances are available for configuration updates. This provides compatibility with Kong instances that use the 3.3+ /status/ready endpoint instead of /status.

Partial backport of #4368.

Which issue this PR fixes:

2.11 is the first release fully compatible with the new /status/ready endpoint. It adds both a phantom upstream on empty config and this change, to send configuration to non-ready endpoints.

Backporting this should allow broader compatibility with versions that use /status/ready without most of the 2.11 changes.

Special notes for your reviewer:

This is not the only change in #4368, and I don't know the full extent of other discovery changes in 2.11 either. This works in a spot check (a 2.10 image that includes it does come up successfully when Kong instances use /status/ready), but I'm unsure if there are further unexpected issues by making a minimal backport.

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

Use not terminating instead of ready condition to determine which Kong instances are available for configuration updates. This provides compatibility with Kong instances that use the 3.3+ /status/ready endpoint instead of /status.

codecov · 2023-08-15T19:31:46Z

Codecov Report

Patch coverage: 100.0% and project coverage change: -0.1% ⚠️

Comparison is base (72cd759) 61.1% compared to head (3f58b5d) 61.0%.

Additional details and impacted files

@@               Coverage Diff                @@
##           release/2.10.x   #4515     +/-   ##
================================================
- Coverage            61.1%   61.0%   -0.1%     
================================================
  Files                 152     152             
  Lines               16959   16959             
================================================
- Hits                10367   10353     -14     
- Misses               5953    5966     +13     
- Partials              639     640      +1

Files Changed	Coverage Δ
internal/adminapi/endpoints.go	`88.1% <100.0%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rainest · 2023-08-23T22:49:58Z

@pmalek do you recall what the pending/ready distinction does in practice? I read through #4368 again and am not actually sure based on the description alone--it describes what it does re various subsystems in the code, but not the practical implications.

Without it (2f420a9 only, essentially--the rest of the changes here are tests and backport compat stuff), everything still appears to work normally.

Deploying the latest ingress chart with 2.10.4 results in the known behavior where instances of the proxy never become ready because the controller refuses to push configuration. Deploying with the image built from this branch and some config does result in ready gateway instances with the expected configuration:

helm install ana kong/ingress -f /tmp/values.yaml
kubectl apply -f https://raw.githubusercontent.com/rainest/kube-example/main/httpbin-basic.yaml

values.yaml.txt

I thought maybe scaling would be an issue, but don't see an obvious one: scaling up spawns new gateways that all do eventually (<10s) get configuration and become ready. Scaling down doesn't do much of interest since it just deletes instances.

The controller shows some errors in logs indicating that it failed to push config due to instances going offline, but they don't continue indefinitely, just briefly, presumably when an update happens to catch a terminating instance before we see that it's terminating.

I thought maybe excluding the additional code would prevent the controller from properly finding new instances or hold onto old ones indefinitely, but that doesn't appear to be the case. It could maybe do something to Konnect reporting--I haven't checked that yet.

pmalek · 2023-08-24T10:29:30Z

@rainest pending clients are those that were discovered but are not ready (the /status/ready probe doesn't return 200 for them because of lack of config.)

The bit of code that checks this is specifically in

kubernetes-ingress-controller/internal/clients/readiness.go

Lines 79 to 87 in 06f95cb

    
           // checkPendingGatewayClients checks if the pending clients are ready to be used and returns the ones that are. 
        
           func (c DefaultReadinessChecker) checkPendingGatewayClients(ctx context.Context, lastPending []adminapi.DiscoveredAdminAPI) (turnedReady []*adminapi.Client) { 
        
           	for _, adminAPI := range lastPending { 
        
           		if client := c.checkPendingClient(ctx, adminAPI); client != nil { 
        
           			turnedReady = append(turnedReady, client) 
        
           		} 
        
           	} 
        
           	return turnedReady 
        
           }

and

kubernetes-ingress-controller/internal/clients/readiness.go

Lines 89 to 113 in 06f95cb

    
           // checkPendingClient indirectly check readiness of the client by trying to create it. If it succeeds then it 
        
           // means that the client is ready to be used. It returns a non-nil client if the client is ready to be used, otherwise 
        
           // nil is returned. 
        
           func (c DefaultReadinessChecker) checkPendingClient( 
        
           	ctx context.Context, 
        
           	pendingClient adminapi.DiscoveredAdminAPI, 
        
           ) (client *adminapi.Client) { 
        
           	defer func() { 
        
           		c.logger.V(util.DebugLevel). 
        
           			Info(fmt.Sprintf("checking readiness of pending client for %q", pendingClient.Address), 
        
           				"ok", client != nil, 
        
           			) 
        
           	}() 
        
           	ctx, cancel := context.WithTimeout(ctx, readinessCheckTimeout) 
        
           	defer cancel() 
        
           	client, err := c.factory.CreateAdminAPIClient(ctx, pendingClient) 
        
           	if err != nil { 
        
           		// Despite the error reason we still want to keep the client in the pending list to retry later. 
        
           		c.logger.V(util.DebugLevel).Error(err, fmt.Sprintf("pending client for %q is not ready yet", pendingClient.Address)) 
        
           		return nil 
        
           	} 
        
           	return client 
        
           }

checking if a discovered Admin API (through this subscription

kubernetes-ingress-controller/internal/clients/manager.go

Lines 138 to 151 in 06f95cb

    
           func (c *AdminAPIClientsManager) Notify(discoveredAPIs []adminapi.DiscoveredAdminAPI) { 
        
           	// Ensure here that we're not done. 
        
           	select { 
        
           	case <-c.ctx.Done(): 
        
           		return 
        
           	default: 
        
           	} 
        
           	// And here also listen on c.ctx.Done() to allow the notification to be interrupted. 
        
           	select { 
        
           	case <-c.ctx.Done(): 
        
           	case c.discoveredAdminAPIsNotifyChan <- discoveredAPIs: 
        
           	} 
        
           }

function called in https://github.com/Kong/kubernetes-ingress-controller/blob/main/internal/controllers/configuration/kongadminapi_controller.go#L175-L181) is ready ( you can create a client out of it https://github.com/Kong/kubernetes-ingress-controller/blob/06f95cb8bc79e13faca5284f3043efd63daf7227/internal/clients/readiness.go#L105C27-L105C47

Then there's the synchronous part of it which cannot react to objects in k8s (but now that I think of it it could be rewritten to react to endpoints similarly as KongAdminAPIServiceReconciler does but filtering the unready ones 🤔 instead of relying on synchronous nature in code here. But that's another topic) in

kubernetes-ingress-controller/internal/clients/manager.go

Lines 221 to 222 in 06f95cb

    
           case <-c.readinessReconciliationTicker.Channel(): 
        
           	c.onReadinessReconciliationTick()

which periodically checks and reorganizes ready clients and pending discovered admin APIs verifying whether they respond.

Here's the doc written by @czeslavo for this problem: https://docs.google.com/document/d/1FdTBPD-8PACD93TjSzrBG_CSeMHir9CrVAoWFM5ru_4/edit#heading=h.ogl5esaoc0om

Having said that I'm surprised that it works in 2.10 with just the changes from this branch 🤔 To my understanding you'd need internal/clients/readiness.go and internal/clients/manager.go from 2.11 or main

rainest · 2023-08-24T22:42:00Z

So it looks like the crux of it is:

AdminAPIClientsManager instead of blindly accepting all the endpoints discovered by KongAdminAPIServiceReconciler, actively runs Admin API health checks (using the old /status) on all the endpoints it receives from the KongAdminAPIServiceReconciler.

If an endpoint is not admin-ready yet, AdminAPIClientsManager keeps it on a pending list.

If an endpoint is admin-ready, it’s kept on an active list.

AdminAPIClientsManager periodically (with the readinessProbe’s default period and timeout) runs Admin API health checks of all the endpoints from both pending and active lists.

If an endpoint that was admin-ready becomes not ready - it’s moved to the pending list.

If an endpoint that wasn’t admin-ready becomes ready - it’s moved to the active list.

AFAICT the additional code is just ensuring that that admin API is up (a/status 200 doesn't tell you much else) before attempting to send config to it. Not including that means you may attempt to send config to instances that aren't listening yet, which lines up with the errors I see during scaling. In practice that's fine though--the controller will just retry again later, and at worst you get some transient failures reported in metrics.

Based on that I think the minimal backport is fine, and the minor additional error reporting cost is outweighed by the benefit of maintaining compatibility.

czeslavo · 2023-08-28T17:02:38Z

AFAICT the additional code is just ensuring that that admin API is up (a/status 200 doesn't tell you much else) before attempting to send config to it.

Yeah, that's the gist of it. 👍 The code that wasn't ported basically operates on the same set of non-terminating endpoints but instead of using them blindly, it runs the readiness check manually. It also detects potential ready->not ready transitions and withdraws a gateway in such a case from ready to pending list.

If we assume that the Gateway is very unlikely to get stuck in a state in which it's not ready to serve Admin API requests, I believe this backport should work (with the quirk that @rainest mentions - there will be logs telling about failed configuration updates during rollouts, and some mess in metrics because KIC will try to send config to Gateway Pods that were just provisioned).

I imagine that if we'd like to go cheap, but not the cheapest, we could add a synchronous status check before every config update to ensure a Gateway client is ready. If not, we do not treat it as a failure and we just skip that Gateway in the update cycle.

rainest · 2023-08-28T19:21:39Z

I'm 🤷 on the transient errors, and the simpler synchronous check we can probably add separately if we want, so marking this ready for review.

Ripping out depguard since I don't want to bother backporting the golangci changes. Backports from future shouldn't be introducing unwanted deps anyway; they've passed those checks elsewhere already.

2.9 looks more annoying to backport due to further drift elsewhere from when these changes were introduced. Dunno if we try for it as well.

czeslavo · 2023-08-29T10:20:16Z

I removed setting up the initial config in the integration tests suite - as we bump to KTF v0.38.0, Kong's readiness probe is set to /status when building the addon with the controller disabled which is enough for it to become ready.

fix(discovery) update non-terminating instances

2f420a9

Use not terminating instead of ready condition to determine which Kong instances are available for configuration updates. This provides compatibility with Kong instances that use the 3.3+ /status/ready endpoint instead of /status.

pull-request-size bot added the size/S label Aug 15, 2023

chore: backport units

e9f4f34

pull-request-size bot added size/M and removed size/S labels Aug 16, 2023

rainest added 2 commits August 22, 2023 17:34

chore(deps) update KTF

b40c3ca

wip: inject stub config for integration

aec63a9

pull-request-size bot added size/L and removed size/M labels Aug 23, 2023

rainest and others added 2 commits August 23, 2023 13:37

wip: bump KTF

23ea18c

chore(deps): bump dependencies versions in go.mod (#4454)

ade87ba

rainest force-pushed the backport/terminating-discovery branch from a78a623 to ade87ba Compare August 23, 2023 22:34

rainest added 2 commits August 28, 2023 10:57

chore(lint) skip depguard

c731531

chore(*) make generate && make manifests

36ee0b9

pull-request-size bot added size/XL and removed size/L labels Aug 28, 2023

rainest marked this pull request as ready for review August 28, 2023 19:09

rainest requested a review from a team as a code owner August 28, 2023 19:09

remove init config in integration tests

3f58b5d

czeslavo approved these changes Aug 29, 2023

View reviewed changes

rainest merged commit 82db2db into release/2.10.x Aug 29, 2023
26 checks passed

rainest deleted the backport/terminating-discovery branch August 29, 2023 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discovery) update non-terminating instances #4515

fix(discovery) update non-terminating instances #4515

rainest commented Aug 15, 2023

codecov bot commented Aug 15, 2023 •

edited

rainest commented Aug 23, 2023 •

edited

pmalek commented Aug 24, 2023

rainest commented Aug 24, 2023

czeslavo commented Aug 28, 2023

rainest commented Aug 28, 2023

czeslavo commented Aug 29, 2023

fix(discovery) update non-terminating instances #4515

fix(discovery) update non-terminating instances #4515

Conversation

rainest commented Aug 15, 2023

codecov bot commented Aug 15, 2023 • edited

Codecov Report

rainest commented Aug 23, 2023 • edited

pmalek commented Aug 24, 2023

rainest commented Aug 24, 2023

czeslavo commented Aug 28, 2023

rainest commented Aug 28, 2023

czeslavo commented Aug 29, 2023

codecov bot commented Aug 15, 2023 •

edited

rainest commented Aug 23, 2023 •

edited