feat(manager) block synchronizer on cache readiness #5483

rainest · 2024-01-25T01:59:32Z

What this PR does / why we need it:

Removes the time-based startup delay in favor of obtaining the manager cache and calling its wait function during synchronizer start. In situations where the cache has not fully synced, the dataplane synchronizer will block until it has. If the cache never syncs within its grace period, a cache sync timeout will terminate the controller, preventing it from ever syncing config.

This better addresses situations where the controller may incorrectly delete configuration upon gaining leadership because it cannot yet see all resources on the cluster. Configuration previously added by another controller replica should remain intact.

The previous naive time-based delay addressed this for some cases, but relied on an arbitrary "good enough" wait time. AFAIK this time did work in practice during normal startup. However, it did not work if the cache sync got stuck, such as when the controller role was broken or when API resources were not defined.

Which issue this PR fixes:

Fix #2315

Special notes for your reviewer:

I could have sworn that we had no means of getting the manager cache, that it was reserved for the manager's internal use. I then saw a GetCache() call in one of the controllers and re-checked that, and apparently it's (maybe always?) been accessible despite being in an internal.go.

Manual testing supports the notion that this works in problem scenarios. I deployed a test image to a normal install and confirmed successful start and config population. While watching the admin API for configuration changes, I deleted the Ingress permissions from the controller role, scaled to 0, and scaled back to 1. The new replica started and blocked without syncing configuration before eventually hitting a cache sync timeout and dying.

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

Obtain the manager cache during synchronizer setup and store it in the synchronizer. Block synchronizer start on cache sync. When the manager gains leadership, it starts populating cache and starts runnables (like the synchronizer). It waits for cache to sync (i.e. for it to read all entities present when it started) to block certain tasks, but not generic runnables. The synchronizer indirectly accesses the cache contents. Controllers popluate the store with a filtered cache. Depending on whether sync has completed, the store may have an incomplete view of resources in the cluster and may delete Kong configuration for as-yet uncached resources. This is bad and can lead to catastrophic wiping of proxy config associated with resources that are actually still present. Blocking the synchronizer start on cache sync ensures this does not happen.

Remove the naive time-based synchronizer delay. This was a hack to avoid configuration deletion at startup without knowledge of the manager cache state. It is unnecessary now that we have manager cache state.

rainest · 2024-01-25T02:03:35Z

Although most of the behavior here works as expected, I'm seeing something I can't explain where, after the sync timeout, the synchronizer does run a sync (deleting config) during shutdown. Don't know exactly what's causing that, but it needs to not happen.

Best guess is that the manager cache marks itself "finished" after timing out, allowing the sync to proceed. This can hopefully be addressed by finding the appropriate context and checking if it's expired also.

codecov · 2024-01-25T02:24:24Z

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (dcea670) 69.7% compared to head (694ec5e) 51.3%.

Files	Patch %	Lines
internal/dataplane/synchronizer.go	71.4%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##            main   #5483      +/-   ##
========================================
- Coverage   69.7%   51.3%   -18.5%     
========================================
  Files        176     176              
  Lines      22726   22731       +5     
========================================
- Hits       15861   11671    -4190     
- Misses      5928    9984    +4056     
- Partials     937    1076     +139

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rainest · 2024-02-06T02:13:40Z

There's some further weirdness where our manager only considers some caches relevant, apparently.

https://github.com/kubernetes-sigs/controller-runtime/blob/v0.17.0/pkg/cache/internal/informers.go#L244-L261 gets all the per-resource cache wait functions. In our case, for reasons I cannot determine, this only includes four kinds. I'm not sure why we actually wait on these and not the others:

caches.txt

Offhand I'm not sure what would cause IngressClass to show up but not Ingress from their controller definition. Neither is a dynamic controller that waits for the type to appear and starts the controller if it does; both of those should initialize immediately at start.

The upshot of this is that simply disabling the permission for Ingress (or any other resource not in that list) does nothing, initially, and the manager cache will happily report that it has synced, though it will still eventually time out and exit everything. Unsure if that would matter in practice for the issue that prompted this initially (where there were presumably no permissions), but it could potentially result in issues where, for example, the IngressClass cache syncs before the Ingress one does, because there are more of the latter.

rainest added 2 commits January 24, 2024 17:24

chore(manager) remove time synchronizer delay

694ec5e

Remove the naive time-based synchronizer delay. This was a hack to avoid configuration deletion at startup without knowledge of the manager cache state. It is unnecessary now that we have manager cache state.

pull-request-size bot added the size/S label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(manager) block synchronizer on cache readiness #5483

feat(manager) block synchronizer on cache readiness #5483

rainest commented Jan 25, 2024

rainest commented Jan 25, 2024

codecov bot commented Jan 25, 2024

rainest commented Feb 6, 2024

feat(manager) block synchronizer on cache readiness #5483

Are you sure you want to change the base?

feat(manager) block synchronizer on cache readiness #5483

Conversation

rainest commented Jan 25, 2024

rainest commented Jan 25, 2024

codecov bot commented Jan 25, 2024

Codecov Report

rainest commented Feb 6, 2024