feat: shard rule-ConfigMaps to avoid etcd size limit by yama6a · Pull Request #1922 · GoogleCloudPlatform/prometheus-engine

yama6a · 2026-05-16T12:00:35Z

Problem

Rules CRDs (Rules, ClusterRules, GlobalRules) get packed into a single rules-generated ConfigMap. Once you have enough rules, it blows past the 1MB etcd limit and the operator starts failing. Gzip helps but only buys time.

Solution

The operator now bin-packs rule entries across N sharded ConfigMaps (rules-generated-0, rules-generated-1, ...) based on actual measured size with an 800KB threshold. The number of shards is fully dynamic, nothing hardcoded.

On the config-reloader side, instead of mounting a single ConfigMap as a volume, a new syncer goroutine discovers shards via a label selector (monitoring.googleapis.com/rules-shard=true) through the K8s API and writes them to the existing emptyDir volume at /etc/rules. It hashes content so downstream reloads only fire when something actually changed.

Stale shards and the legacy rules-generated ConfigMap get cleaned up automatically on each reconciliation.

Things worth noting

** E2E Tests**: I was not able to run them locally, I ran into issues running a Kind cluster on my machine. Perhaps we can run them here in CI? Don't have proof that they actually work. Unit tests pass tho.
Rule Evaluator: The rule evaluator just reads yaml files from a directory, it has no idea where they came from, so sharding is invisible to it and it should work as-is without changes, regardless if it's deployed stand-alone or not.
Standalone rule-evaluator is untouched. charts/rule-evaluator/ and manifests/rule-evaluator.yaml don't change at all. If you deploy the rule-evaluator yourself without the operator, everything works as before - you just don't get sharding.
Should be a no-op upgrade. On first reconciliation the operator creates shard-0 with the same content, deletes the old ConfigMap, and the syncer picks it up. Volume name and mount path are unchanged so nothing else needs to care.
RBAC got a bit wider. The operator Role now has unrestricted ConfigMap verbs (no resourceNames) because shard names are dynamic. Feels fine since it's namespace-scoped anyway. The collector gets a new namespace-scoped Role for list/watch - the ClusterRole stays untouched. Not sure if there;s a better way to do this, but I hope this is reasonable.
Couldn't run the full make presubmit. The frontend stuff needs amd64 and I'm on an ARM Mac - fought with it for a bit and gave up. Go side compiles and tests pass, and I regenerated manifests/operator.yaml, but would appreciate a second pair of eyes on whether anything is missing.
Might have gone overboard with tests. There's unit tests for the sharding logic, the syncer, the predicate, plus e2e for multi-shard with both no-compression and gzip. If any of them feel like noise, happy to trim.

gemini-code-assist

Code Review

This pull request implements rule sharding to handle large rule sets that exceed the Kubernetes ConfigMap size limit. It introduces a ConfigMapSyncer in the config-reloader sidecar to aggregate rule shards identified by labels and updates the operator to manage these shards. Review feedback points out a potential path traversal vulnerability in the syncer, the risk of filenames exceeding filesystem limits, redundant RBAC definitions, and the importance of implementing atomic file writes to avoid partial rule sets during synchronization.

gemini-code-assist · 2026-05-16T12:03:28Z

+	}
+}
+
+func (s *ConfigMapSyncer) writeFiles(files map[string][]byte) error {


The writeFiles method updates files non-atomically. If the rule evaluator or the reloader polls the directory while files are being written or removed, it may encounter a partial or inconsistent state. Consider using an atomic update pattern, such as writing to a temporary directory within the same volume and performing an atomic swap (e.g., using a symlink or renaming the directory) to ensure the downstream consumer always sees a consistent set of rules.

This is fine in practice - the config-reloader already hashes the directory on each poll and only triggers a reload when the hash changes. If it catches a mid-write state, it just picks it up on the next cycle. Not worth the complexity of an atomic swap imho.

Hm, so it worked with one file, but now we have many. Are we sure rule-eval will not load inconsistent state of rule groups, especially during 1->3 and 3->1 e.g. causing duplicate groups or key recording rule missing? What am I missing?

One alternative is writing to a big file locally (: Another is integrating K8s API into Thanos config-reloader (we could have our copy of it, it's not a lot of code). Symlink is odd but might work.

True, fair point. There's a real window when going between one and multiple shards, where we can briefly see duplicates or gaps, but it self-heals on the next sync within about 10 seconds.

Having one merged file would fix the atomicity thingy, but it adds a global group-name uniqueness constraint that we don't have today, not sure we wanna deal with that.

If a few seconds out-of-sync'ness isn't acceptable, I would propose to implement the same ..data symlink swap that K8s does for configmap mounts. Then no rule-eval changes would be needed then. It seems like the least invasive option to me. What do you think? Can we live with the self-healing?

bernot-dev · 2026-05-18T21:33:50Z

Hi @yama6a. Thanks for the PR. This is definitely an interesting feature.

I approved the workflow to run, and it looks like the new e2e test is not passing, so the other ones got cancelled. We also use the conventional commits standard for commit message. Running make conform will tell you if the commit messages are compliant.

The e2e/Kind tests are a bit tricky, because they've really be built around what's convenient for our small team of core developers. Sorry to hear they weren't working for you. It's somewhere on my list to make that barrier to entry a bit lower. Feel free to push your changes and I'll approve the workflow to run here in CI.

From a release planning perspective, I would rather not delete the old configmap in the same update as the shards are introduced. That way, if we need to do a rollback for any reason, we're a bit less likely to be broken. At first glance, it seems like they may not conflict if the old configmap doesn't have the monitoring.googleapis.com/rules-shard=true label. However, if you're actually hitting the limit, that may not be helpful. Thoughts on this?

Thanks again for this contribution. We'll do our best to make sure you get timely reviews.

…eck happy)

…e rollback

yama6a · 2026-05-19T10:02:41Z

Hi @yama6a. Thanks for the PR. This is definitely an interesting feature.

I approved the workflow to run, and it looks like the new e2e test is not passing, so the other ones got cancelled. We also use the conventional commits standard for commit message. Running make conform will tell you if the commit messages are compliant.

The e2e/Kind tests are a bit tricky, because they've really be built around what's convenient for our small team of core developers. Sorry to hear they weren't working for you. It's somewhere on my list to make that barrier to entry a bit lower. Feel free to push your changes and I'll approve the workflow to run here in CI.

From a release planning perspective, I would rather not delete the old configmap in the same update as the shards are introduced. That way, if we need to do a rollback for any reason, we're a bit less likely to be broken. At first glance, it seems like they may not conflict if the old configmap doesn't have the monitoring.googleapis.com/rules-shard=true label. However, if you're actually hitting the limit, that may not be helpful. Thoughts on this?

Thanks again for this contribution. We'll do our best to make sure you get timely reviews.

@bernot-dev
Good catch, pushed a change to keep the legacy rules-generated ConfigMap around for this release.

The old rule-evaluator deployment mounts it by name, so deleting it here would briefly crashloop the rule-evaluator on rollback, until the old operator recreates the CM.

Can be dropped (along with the RBAC resourceNames entry) in a follow-up release once this one has been out for a while.

I pushed changes to make it conform as well. Can you enable auto-approval for the e2e-test-runs? I think we did that in some of my previous PRs to this project, so I get a quicker feedback loop to fix the e2e tests.

bwplotka

Great job! Some initial thoughts, thanks!

bwplotka · 2026-05-19T13:26:31Z

+	}
+}
+
+func (s *ConfigMapSyncer) writeFiles(files map[string][]byte) error {


Hm, so it worked with one file, but now we have many. Are we sure rule-eval will not load inconsistent state of rule groups, especially during 1->3 and 3->1 e.g. causing duplicate groups or key recording rule missing? What am I missing?

One alternative is writing to a big file locally (: Another is integrating K8s API into Thanos config-reloader (we could have our copy of it, it's not a lot of code). Symlink is odd but might work.

bwplotka · 2026-05-19T13:31:24Z

+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: collector


We probably should avoid touching collector here and switching recording rule-eval only. Should it be rule-eval?

Sharding Prometheus config would be also be interesting and possible, but switching nodes to use more K8s API is tricky.

Hm, I see what you mean. But looking at how it works today, rule-eval is seems to be running under the "collector" SA, and it already uses kubernetes_sd_configs for Alertmanager discovery, so its SA isn't just leaning on configmaps from the collector ClusterRole, it also needs the endpoints/services/pods reads. So a clean split would mean a parallel ClusterRole with all of those, and if we miss one, ... not so good 😅

The only thing we'd actually be taking away from the collector SA is namespaced configmap list/watch (it already has cluster-wide configmap get anyway). Feels like a rather risky/invasive change. I can give it a shot if you want, tho. Your call, what do you think?

bwplotka · 2026-05-19T14:30:57Z

+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: collector


ditto, did you mean rule-eval?

bwplotka · 2026-05-19T14:39:14Z

I changed the workflows so you don't need approve.

…r help

yama6a · 2026-05-29T11:01:17Z

@bwplotka @bernot-dev
Could you take a peek at my two remaining comments:

#1922 (comment)

And

#1922 (comment)

If you let me know how you'd like me to move forward, I can work on the remaining open things next week some time.

feat: shard rule-ConfigMaps to avoid etcd size limit

6fe4c18

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

yama6a mentioned this pull request May 16, 2026

[Solution] rules-generated ConfigMap hits the 1MB etcd size limit when there are many Rules/ClusterRules/GlobalRules #1923

Open

yama6a added 4 commits May 19, 2026 11:37

fix(config-reloader): add defensive file name check

5d0fb47

build: bump conform image to v0.1.0-alpha.31 for arm64 support

f970e1c

fix(deps): bump golang.org/x/net from 0.48.0 to 0.53.0 (make govulnch…

34f9262

…eck happy)

test(e2e): sanitize hyphens in shard test record name

040ed6e

yama6a force-pushed the config-map-sharding branch from 63db833 to 040ed6e Compare May 19, 2026 09:45

refactor(operator): preserve legacy rules-generated ConfigMap for saf…

11f3a78

…e rollback

yama6a added 2 commits May 19, 2026 14:30

test(e2e): allow non-string fields in config-reloader log lines

b63f30e

fix(rule-evaluator): seed rules-out with placeholder.yaml at startup

dfe7ed5

bwplotka reviewed May 19, 2026

View reviewed changes

yama6a added 3 commits May 20, 2026 14:24

test(e2e): surface evaluator state on multi-shard rule poll failure

2900249

fix(config-reloader): ignore ConfigMaps with unexpected names

d325876

fix(rule-evaluator): route /api/v1/rules through the live rule manager

2424828

yama6a commented May 20, 2026

View reviewed changes

Comment thread cmd/rule-evaluator/main.go

yama6a added 2 commits May 20, 2026 15:58

chore(lint): exclude go-kit Logger.Log from errcheck

907c2e9

docs(config-reloader): generalize --config-dir-from-configmap-selecto…

6b78901

…r help

bernot-dev self-assigned this May 20, 2026

yama6a requested a review from bwplotka May 22, 2026 07:14

Conversation

yama6a commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Things worth noting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

yama6a May 16, 2026

Choose a reason for hiding this comment

Uh oh!

bwplotka May 19, 2026

Choose a reason for hiding this comment

Uh oh!

yama6a May 20, 2026

Choose a reason for hiding this comment

Uh oh!

bernot-dev commented May 18, 2026

Uh oh!

yama6a commented May 19, 2026

Uh oh!

bwplotka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwplotka May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bwplotka May 19, 2026

Choose a reason for hiding this comment

Uh oh!

yama6a May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwplotka May 19, 2026

Choose a reason for hiding this comment

Uh oh!

bwplotka commented May 19, 2026

Uh oh!

Uh oh!

yama6a commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yama6a commented May 16, 2026 •

edited

Loading

yama6a May 20, 2026 •

edited

Loading