Skip to content

feat: shard rule-ConfigMaps to avoid etcd size limit#1922

Open
yama6a wants to merge 13 commits into
GoogleCloudPlatform:mainfrom
yama6a:config-map-sharding
Open

feat: shard rule-ConfigMaps to avoid etcd size limit#1922
yama6a wants to merge 13 commits into
GoogleCloudPlatform:mainfrom
yama6a:config-map-sharding

Conversation

@yama6a
Copy link
Copy Markdown
Contributor

@yama6a yama6a commented May 16, 2026

Problem

Rules CRDs (Rules, ClusterRules, GlobalRules) get packed into a single rules-generated ConfigMap. Once you have enough rules, it blows past the 1MB etcd limit and the operator starts failing. Gzip helps but only buys time.

Solution

The operator now bin-packs rule entries across N sharded ConfigMaps (rules-generated-0, rules-generated-1, ...) based on actual measured size with an 800KB threshold. The number of shards is fully dynamic, nothing hardcoded.

On the config-reloader side, instead of mounting a single ConfigMap as a volume, a new syncer goroutine discovers shards via a label selector (monitoring.googleapis.com/rules-shard=true) through the K8s API and writes them to the existing emptyDir volume at /etc/rules. It hashes content so downstream reloads only fire when something actually changed.

Stale shards and the legacy rules-generated ConfigMap get cleaned up automatically on each reconciliation.

Things worth noting

  • ** E2E Tests**: I was not able to run them locally, I ran into issues running a Kind cluster on my machine. Perhaps we can run them here in CI? Don't have proof that they actually work. Unit tests pass tho.

  • Rule Evaluator: The rule evaluator just reads yaml files from a directory, it has no idea where they came from, so sharding is invisible to it and it should work as-is without changes, regardless if it's deployed stand-alone or not.

  • Standalone rule-evaluator is untouched. charts/rule-evaluator/ and manifests/rule-evaluator.yaml don't change at all. If you deploy the rule-evaluator yourself without the operator, everything works as before - you just don't get sharding.

  • Should be a no-op upgrade. On first reconciliation the operator creates shard-0 with the same content, deletes the old ConfigMap, and the syncer picks it up. Volume name and mount path are unchanged so nothing else needs to care.

  • RBAC got a bit wider. The operator Role now has unrestricted ConfigMap verbs (no resourceNames) because shard names are dynamic. Feels fine since it's namespace-scoped anyway. The collector gets a new namespace-scoped Role for list/watch - the ClusterRole stays untouched. Not sure if there;s a better way to do this, but I hope this is reasonable.

  • Couldn't run the full make presubmit. The frontend stuff needs amd64 and I'm on an ARM Mac - fought with it for a bit and gave up. Go side compiles and tests pass, and I regenerated manifests/operator.yaml, but would appreciate a second pair of eyes on whether anything is missing.

  • Might have gone overboard with tests. There's unit tests for the sharding logic, the syncer, the predicate, plus e2e for multi-shard with both no-compression and gzip. If any of them feel like noise, happy to trim.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements rule sharding to handle large rule sets that exceed the Kubernetes ConfigMap size limit. It introduces a ConfigMapSyncer in the config-reloader sidecar to aggregate rule shards identified by labels and updates the operator to manage these shards. Review feedback points out a potential path traversal vulnerability in the syncer, the risk of filenames exceeding filesystem limits, redundant RBAC definitions, and the importance of implementing atomic file writes to avoid partial rule sets during synchronization.

Comment thread cmd/config-reloader/internal/syncer.go
Comment thread cmd/config-reloader/internal/syncer.go
Comment thread charts/operator/templates/role.yaml
}
}

func (s *ConfigMapSyncer) writeFiles(files map[string][]byte) error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The writeFiles method updates files non-atomically. If the rule evaluator or the reloader polls the directory while files are being written or removed, it may encounter a partial or inconsistent state. Consider using an atomic update pattern, such as writing to a temporary directory within the same volume and performing an atomic swap (e.g., using a symlink or renaming the directory) to ensure the downstream consumer always sees a consistent set of rules.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine in practice - the config-reloader already hashes the directory on each poll and only triggers a reload when the hash changes. If it catches a mid-write state, it just picks it up on the next cycle. Not worth the complexity of an atomic swap imho.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so it worked with one file, but now we have many. Are we sure rule-eval will not load inconsistent state of rule groups, especially during 1->3 and 3->1 e.g. causing duplicate groups or key recording rule missing? What am I missing?

One alternative is writing to a big file locally (: Another is integrating K8s API into Thanos config-reloader (we could have our copy of it, it's not a lot of code). Symlink is odd but might work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, fair point. There's a real window when going between one and multiple shards, where we can briefly see duplicates or gaps, but it self-heals on the next sync within about 10 seconds.

Having one merged file would fix the atomicity thingy, but it adds a global group-name uniqueness constraint that we don't have today, not sure we wanna deal with that.

If a few seconds out-of-sync'ness isn't acceptable, I would propose to implement the same ..data symlink swap that K8s does for configmap mounts. Then no rule-eval changes would be needed then. It seems like the least invasive option to me. What do you think? Can we live with the self-healing?

@bernot-dev
Copy link
Copy Markdown
Collaborator

Hi @yama6a. Thanks for the PR. This is definitely an interesting feature.

I approved the workflow to run, and it looks like the new e2e test is not passing, so the other ones got cancelled. We also use the conventional commits standard for commit message. Running make conform will tell you if the commit messages are compliant.

The e2e/Kind tests are a bit tricky, because they've really be built around what's convenient for our small team of core developers. Sorry to hear they weren't working for you. It's somewhere on my list to make that barrier to entry a bit lower. Feel free to push your changes and I'll approve the workflow to run here in CI.

From a release planning perspective, I would rather not delete the old configmap in the same update as the shards are introduced. That way, if we need to do a rollback for any reason, we're a bit less likely to be broken. At first glance, it seems like they may not conflict if the old configmap doesn't have the monitoring.googleapis.com/rules-shard=true label. However, if you're actually hitting the limit, that may not be helpful. Thoughts on this?

Thanks again for this contribution. We'll do our best to make sure you get timely reviews.

@yama6a yama6a force-pushed the config-map-sharding branch from 63db833 to 040ed6e Compare May 19, 2026 09:45
@yama6a
Copy link
Copy Markdown
Contributor Author

yama6a commented May 19, 2026

Hi @yama6a. Thanks for the PR. This is definitely an interesting feature.

I approved the workflow to run, and it looks like the new e2e test is not passing, so the other ones got cancelled. We also use the conventional commits standard for commit message. Running make conform will tell you if the commit messages are compliant.

The e2e/Kind tests are a bit tricky, because they've really be built around what's convenient for our small team of core developers. Sorry to hear they weren't working for you. It's somewhere on my list to make that barrier to entry a bit lower. Feel free to push your changes and I'll approve the workflow to run here in CI.

From a release planning perspective, I would rather not delete the old configmap in the same update as the shards are introduced. That way, if we need to do a rollback for any reason, we're a bit less likely to be broken. At first glance, it seems like they may not conflict if the old configmap doesn't have the monitoring.googleapis.com/rules-shard=true label. However, if you're actually hitting the limit, that may not be helpful. Thoughts on this?

Thanks again for this contribution. We'll do our best to make sure you get timely reviews.

@bernot-dev
Good catch, pushed a change to keep the legacy rules-generated ConfigMap around for this release.

The old rule-evaluator deployment mounts it by name, so deleting it here would briefly crashloop the rule-evaluator on rollback, until the old operator recreates the CM.

Can be dropped (along with the RBAC resourceNames entry) in a follow-up release once this one has been out for a while.

I pushed changes to make it conform as well. Can you enable auto-approval for the e2e-test-runs? I think we did that in some of my previous PRs to this project, so I get a quicker feedback loop to fix the e2e tests.

Copy link
Copy Markdown
Collaborator

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Some initial thoughts, thanks!

Comment thread cmd/config-reloader/internal/syncer.go Outdated
Comment thread cmd/config-reloader/internal/syncer.go
Comment thread cmd/config-reloader/internal/syncer.go
Comment thread cmd/config-reloader/internal/syncer.go
Comment thread cmd/config-reloader/internal/syncer.go
Comment thread cmd/config-reloader/internal/syncer_test.go Outdated
}
}

func (s *ConfigMapSyncer) writeFiles(files map[string][]byte) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so it worked with one file, but now we have many. Are we sure rule-eval will not load inconsistent state of rule groups, especially during 1->3 and 3->1 e.g. causing duplicate groups or key recording rule missing? What am I missing?

One alternative is writing to a big file locally (: Another is integrating K8s API into Thanos config-reloader (we could have our copy of it, it's not a lot of code). Symlink is odd but might work.

Comment thread cmd/config-reloader/README.md Outdated
Comment thread manifests/operator.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: collector
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should avoid touching collector here and switching recording rule-eval only. Should it be rule-eval?

Sharding Prometheus config would be also be interesting and possible, but switching nodes to use more K8s API is tricky.

Copy link
Copy Markdown
Contributor Author

@yama6a yama6a May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I see what you mean. But looking at how it works today, rule-eval is seems to be running under the "collector" SA, and it already uses kubernetes_sd_configs for Alertmanager discovery, so its SA isn't just leaning on configmaps from the collector ClusterRole, it also needs the endpoints/services/pods reads. So a clean split would mean a parallel ClusterRole with all of those, and if we miss one, ... not so good 😅

The only thing we'd actually be taking away from the collector SA is namespaced configmap list/watch (it already has cluster-wide configmap get anyway). Feels like a rather risky/invasive change. I can give it a shot if you want, tho. Your call, what do you think?

Comment thread manifests/operator.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: collector
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, did you mean rule-eval?

@bwplotka
Copy link
Copy Markdown
Collaborator

I changed the workflows so you don't need approve.

Comment thread cmd/rule-evaluator/main.go
@bernot-dev bernot-dev self-assigned this May 20, 2026
@yama6a yama6a requested a review from bwplotka May 22, 2026 07:14
@yama6a
Copy link
Copy Markdown
Contributor Author

yama6a commented May 29, 2026

@bwplotka @bernot-dev
Could you take a peek at my two remaining comments:

#1922 (comment)

And

#1922 (comment)

If you let me know how you'd like me to move forward, I can work on the remaining open things next week some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants