feat: shard rule-ConfigMaps to avoid etcd size limit#1922
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements rule sharding to handle large rule sets that exceed the Kubernetes ConfigMap size limit. It introduces a ConfigMapSyncer in the config-reloader sidecar to aggregate rule shards identified by labels and updates the operator to manage these shards. Review feedback points out a potential path traversal vulnerability in the syncer, the risk of filenames exceeding filesystem limits, redundant RBAC definitions, and the importance of implementing atomic file writes to avoid partial rule sets during synchronization.
| } | ||
| } | ||
|
|
||
| func (s *ConfigMapSyncer) writeFiles(files map[string][]byte) error { |
There was a problem hiding this comment.
The writeFiles method updates files non-atomically. If the rule evaluator or the reloader polls the directory while files are being written or removed, it may encounter a partial or inconsistent state. Consider using an atomic update pattern, such as writing to a temporary directory within the same volume and performing an atomic swap (e.g., using a symlink or renaming the directory) to ensure the downstream consumer always sees a consistent set of rules.
There was a problem hiding this comment.
This is fine in practice - the config-reloader already hashes the directory on each poll and only triggers a reload when the hash changes. If it catches a mid-write state, it just picks it up on the next cycle. Not worth the complexity of an atomic swap imho.
There was a problem hiding this comment.
Hm, so it worked with one file, but now we have many. Are we sure rule-eval will not load inconsistent state of rule groups, especially during 1->3 and 3->1 e.g. causing duplicate groups or key recording rule missing? What am I missing?
One alternative is writing to a big file locally (: Another is integrating K8s API into Thanos config-reloader (we could have our copy of it, it's not a lot of code). Symlink is odd but might work.
There was a problem hiding this comment.
True, fair point. There's a real window when going between one and multiple shards, where we can briefly see duplicates or gaps, but it self-heals on the next sync within about 10 seconds.
Having one merged file would fix the atomicity thingy, but it adds a global group-name uniqueness constraint that we don't have today, not sure we wanna deal with that.
If a few seconds out-of-sync'ness isn't acceptable, I would propose to implement the same ..data symlink swap that K8s does for configmap mounts. Then no rule-eval changes would be needed then. It seems like the least invasive option to me. What do you think? Can we live with the self-healing?
|
Hi @yama6a. Thanks for the PR. This is definitely an interesting feature. I approved the workflow to run, and it looks like the new e2e test is not passing, so the other ones got cancelled. We also use the conventional commits standard for commit message. Running The e2e/Kind tests are a bit tricky, because they've really be built around what's convenient for our small team of core developers. Sorry to hear they weren't working for you. It's somewhere on my list to make that barrier to entry a bit lower. Feel free to push your changes and I'll approve the workflow to run here in CI. From a release planning perspective, I would rather not delete the old configmap in the same update as the shards are introduced. That way, if we need to do a rollback for any reason, we're a bit less likely to be broken. At first glance, it seems like they may not conflict if the old configmap doesn't have the Thanks again for this contribution. We'll do our best to make sure you get timely reviews. |
63db833 to
040ed6e
Compare
@bernot-dev The old rule-evaluator deployment mounts it by name, so deleting it here would briefly crashloop the rule-evaluator on rollback, until the old operator recreates the CM. Can be dropped (along with the RBAC resourceNames entry) in a follow-up release once this one has been out for a while. I pushed changes to make it conform as well. Can you enable auto-approval for the e2e-test-runs? I think we did that in some of my previous PRs to this project, so I get a quicker feedback loop to fix the e2e tests. |
bwplotka
left a comment
There was a problem hiding this comment.
Great job! Some initial thoughts, thanks!
| } | ||
| } | ||
|
|
||
| func (s *ConfigMapSyncer) writeFiles(files map[string][]byte) error { |
There was a problem hiding this comment.
Hm, so it worked with one file, but now we have many. Are we sure rule-eval will not load inconsistent state of rule groups, especially during 1->3 and 3->1 e.g. causing duplicate groups or key recording rule missing? What am I missing?
One alternative is writing to a big file locally (: Another is integrating K8s API into Thanos config-reloader (we could have our copy of it, it's not a lot of code). Symlink is odd but might work.
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: Role | ||
| metadata: | ||
| name: collector |
There was a problem hiding this comment.
We probably should avoid touching collector here and switching recording rule-eval only. Should it be rule-eval?
Sharding Prometheus config would be also be interesting and possible, but switching nodes to use more K8s API is tricky.
There was a problem hiding this comment.
Hm, I see what you mean. But looking at how it works today, rule-eval is seems to be running under the "collector" SA, and it already uses kubernetes_sd_configs for Alertmanager discovery, so its SA isn't just leaning on configmaps from the collector ClusterRole, it also needs the endpoints/services/pods reads. So a clean split would mean a parallel ClusterRole with all of those, and if we miss one, ... not so good 😅
The only thing we'd actually be taking away from the collector SA is namespaced configmap list/watch (it already has cluster-wide configmap get anyway). Feels like a rather risky/invasive change. I can give it a shot if you want, tho. Your call, what do you think?
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: RoleBinding | ||
| metadata: | ||
| name: collector |
There was a problem hiding this comment.
ditto, did you mean rule-eval?
|
I changed the workflows so you don't need approve. |
|
@bwplotka @bernot-dev And If you let me know how you'd like me to move forward, I can work on the remaining open things next week some time. |
Problem
Rules CRDs (Rules, ClusterRules, GlobalRules) get packed into a single
rules-generatedConfigMap. Once you have enough rules, it blows past the 1MB etcd limit and the operator starts failing. Gzip helps but only buys time.Solution
The operator now bin-packs rule entries across N sharded ConfigMaps (
rules-generated-0,rules-generated-1, ...) based on actual measured size with an 800KB threshold. The number of shards is fully dynamic, nothing hardcoded.On the config-reloader side, instead of mounting a single ConfigMap as a volume, a new syncer goroutine discovers shards via a label selector (
monitoring.googleapis.com/rules-shard=true) through the K8s API and writes them to the existing emptyDir volume at/etc/rules. It hashes content so downstream reloads only fire when something actually changed.Stale shards and the legacy
rules-generatedConfigMap get cleaned up automatically on each reconciliation.Things worth noting
** E2E Tests**: I was not able to run them locally, I ran into issues running a Kind cluster on my machine. Perhaps we can run them here in CI? Don't have proof that they actually work. Unit tests pass tho.
Rule Evaluator: The rule evaluator just reads yaml files from a directory, it has no idea where they came from, so sharding is invisible to it and it should work as-is without changes, regardless if it's deployed stand-alone or not.
Standalone rule-evaluator is untouched.
charts/rule-evaluator/andmanifests/rule-evaluator.yamldon't change at all. If you deploy the rule-evaluator yourself without the operator, everything works as before - you just don't get sharding.Should be a no-op upgrade. On first reconciliation the operator creates shard-0 with the same content, deletes the old ConfigMap, and the syncer picks it up. Volume name and mount path are unchanged so nothing else needs to care.
RBAC got a bit wider. The operator Role now has unrestricted ConfigMap verbs (no resourceNames) because shard names are dynamic. Feels fine since it's namespace-scoped anyway. The collector gets a new namespace-scoped Role for list/watch - the ClusterRole stays untouched. Not sure if there;s a better way to do this, but I hope this is reasonable.
Couldn't run the full
make presubmit. The frontend stuff needs amd64 and I'm on an ARM Mac - fought with it for a bit and gave up. Go side compiles and tests pass, and I regeneratedmanifests/operator.yaml, but would appreciate a second pair of eyes on whether anything is missing.Might have gone overboard with tests. There's unit tests for the sharding logic, the syncer, the predicate, plus e2e for multi-shard with both no-compression and gzip. If any of them feel like noise, happy to trim.