Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(manager) block synchronizer on cache readiness #5483

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rainest
Copy link
Contributor

@rainest rainest commented Jan 25, 2024

What this PR does / why we need it:

Removes the time-based startup delay in favor of obtaining the manager cache and calling its wait function during synchronizer start. In situations where the cache has not fully synced, the dataplane synchronizer will block until it has. If the cache never syncs within its grace period, a cache sync timeout will terminate the controller, preventing it from ever syncing config.

This better addresses situations where the controller may incorrectly delete configuration upon gaining leadership because it cannot yet see all resources on the cluster. Configuration previously added by another controller replica should remain intact.

The previous naive time-based delay addressed this for some cases, but relied on an arbitrary "good enough" wait time. AFAIK this time did work in practice during normal startup. However, it did not work if the cache sync got stuck, such as when the controller role was broken or when API resources were not defined.

Which issue this PR fixes:

Fix #2315

Special notes for your reviewer:

I could have sworn that we had no means of getting the manager cache, that it was reserved for the manager's internal use. I then saw a GetCache() call in one of the controllers and re-checked that, and apparently it's (maybe always?) been accessible despite being in an internal.go.

Manual testing supports the notion that this works in problem scenarios. I deployed a test image to a normal install and confirmed successful start and config population. While watching the admin API for configuration changes, I deleted the Ingress permissions from the controller role, scaled to 0, and scaled back to 1. The new replica started and blocked without syncing configuration before eventually hitting a cache sync timeout and dying.

PR Readiness Checklist:

Complete these before marking the PR as ready to review:

  • the CHANGELOG.md release notes have been updated to reflect any significant (and particularly user-facing) changes introduced by this PR

Obtain the manager cache during synchronizer setup and store it in the
synchronizer. Block synchronizer start on cache sync.

When the manager gains leadership, it starts populating cache and starts
runnables (like the synchronizer). It waits for cache to sync (i.e. for
it to read all entities present when it started) to block certain tasks,
but not generic runnables.

The synchronizer indirectly accesses the cache contents. Controllers
popluate the store with a filtered cache. Depending on whether sync has
completed, the store may have an incomplete view of resources in the
cluster and may delete Kong configuration for as-yet uncached resources.
This is bad and can lead to catastrophic wiping of proxy config
associated with resources that are actually still present.

Blocking the synchronizer start on cache sync ensures this does not
happen.
Remove the naive time-based synchronizer delay. This was a hack to avoid
configuration deletion at startup without knowledge of the manager cache
state. It is unnecessary now that we have manager cache state.
@rainest
Copy link
Contributor Author

rainest commented Jan 25, 2024

Although most of the behavior here works as expected, I'm seeing something I can't explain where, after the sync timeout, the synchronizer does run a sync (deleting config) during shutdown. Don't know exactly what's causing that, but it needs to not happen.

Best guess is that the manager cache marks itself "finished" after timing out, allowing the sync to proceed. This can hopefully be addressed by finding the appropriate context and checking if it's expired also.

Copy link

codecov bot commented Jan 25, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (dcea670) 69.7% compared to head (694ec5e) 51.3%.

Files Patch % Lines
internal/dataplane/synchronizer.go 71.4% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main   #5483      +/-   ##
========================================
- Coverage   69.7%   51.3%   -18.5%     
========================================
  Files        176     176              
  Lines      22726   22731       +5     
========================================
- Hits       15861   11671    -4190     
- Misses      5928    9984    +4056     
- Partials     937    1076     +139     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rainest
Copy link
Contributor Author

rainest commented Feb 6, 2024

There's some further weirdness where our manager only considers some caches relevant, apparently.

https://github.com/kubernetes-sigs/controller-runtime/blob/v0.17.0/pkg/cache/internal/informers.go#L244-L261 gets all the per-resource cache wait functions. In our case, for reasons I cannot determine, this only includes four kinds. I'm not sure why we actually wait on these and not the others:

caches.txt

Offhand I'm not sure what would cause IngressClass to show up but not Ingress from their controller definition. Neither is a dynamic controller that waits for the type to appear and starts the controller if it does; both of those should initialize immediately at start.

The upshot of this is that simply disabling the permission for Ingress (or any other resource not in that list) does nothing, initially, and the manager cache will happily report that it has synced, though it will still eventually time out and exit everything. Unsure if that would matter in practice for the issue that prompted this initially (where there were presumably no permissions), but it could potentially result in issues where, for example, the IngressClass cache syncs before the Ingress one does, because there are more of the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sleep-based synchronization between cache warmup and the first sync to Kong
1 participant