-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd maintenance #258
Comments
k8s should have limited to 110 pods per node to prevent this. Has this limit been lifted in our cluster? |
rate limit needed for k8s API needed to protect cluster |
@janiskemper to document recommended default rate limits |
Monitoring/Alerting needed to warn before etcd runs out of space and compact/defragment as needed. |
Discussion:
|
I would also recommend to use auto-compaction-retention. But I would suggest something like:
I really like the idea of having a systemd service (more than cron) that calls etcdctl. EventRateLimit is one of the really important configuration we need to do for beeing production-ready. And no it's not rancher specific. |
Our flags for etcd to be set in KubeadmControlPlane:
postKubeadmCommand:
In the node image for the control-planes:
|
Do you enable extra cipher-suites here? Or disable some that you deemed not strong enough?
These are default in kubeadm AFAICS.
trusted-ca-file and peer-client-cert-auth are defaults. Why did you disable the two auto-tls flags?
You have chosen similar values as we have (250 and 2500), looks like we made similar experience with the storage in the clouds we tested :-)
Half the default history (which defaults to 10000).
Which is the implicit default as well.
OK, these are new -- are they needed in addition to the snapshot-count limit? Thanks for sharing these! |
Actually I wanted to do both |
usually no reboot should happen. If a reboot is necessary we will remediate the node. So it's not critical to have the commands non reboot safe. |
a lot of the args came from CIS Benchmark.Here a link to one of the args: https://www.tenable.com/audits/items/CIS_Kubernetes_v1.6.1_Level_1_Master.audit:73819e79f9fcb340a2c29b9efa2a8b71 |
As cluster operator, I want to avoid that etcd causes trouble.
etcd could become big and slow and ultimately so large, that it refuses writes.
We have observed this debugging the GXFS staging cluster.
https://input.osb-alliance.de/p/2022-scs-gxfs-cluster-debugging
To learn from this, we may want to tweak the etcd setup on the control nodes.
The text was updated successfully, but these errors were encountered: