etcd maintenance #258

garloff · 2022-08-24T06:48:33Z

As cluster operator, I want to avoid that etcd causes trouble.

etcd could become big and slow and ultimately so large, that it refuses writes.
We have observed this debugging the GXFS staging cluster.
https://input.osb-alliance.de/p/2022-scs-gxfs-cluster-debugging

To learn from this, we may want to tweak the etcd setup on the control nodes.

etcd parameters that limit the history and cause compaction and/or defragmentation
regular maintenance jobs on the node to maintain etcd's health
detecting trouble on the node and raising alarms
create documentation how to address potential (remaining) challenges

garloff · 2022-08-29T08:43:52Z

k8s should have limited to 110 pods per node to prevent this. Has this limit been lifted in our cluster?

garloff · 2022-08-29T08:46:27Z

rate limit needed for k8s API needed to protect cluster
https://kubernetes.io/docs/concepts/cluster-administration/flow-control/

garloff · 2022-08-29T08:47:59Z

@janiskemper to document recommended default rate limits

garloff · 2022-08-29T08:57:07Z

Monitoring/Alerting needed to warn before etcd runs out of space and compact/defragment as needed.
Enable auto-compaction in etcd -- together with rate-limit we should be safe in 99+% of the cases.

garloff · 2022-08-31T16:19:52Z

Discussion:

We could add --auto-compaction-retention=10 parameter to etcd (to keep 10hrs of key history, removing anything older than 10hrs every hour), BUT it seems that kubernetes controller already does compaction for us, so this would only be a second line of defense in case this somehow got broken. Is this really advisable?
Have a nightly systemd.timer job (or cron job) that calls etcdctl defrag on the control plane nodes; put some random delay to avoid all nodes doing defrag at the same time. (Ideally, defrag runs much shorter than a reelection timeout, but you never know)
For kubeapi rate limit, I found some rancher docs ... adding a rate-limiting admission controller. https://rancher.com/docs/rke/latest/en/config-options/rate-limiting/ Is this rancher specific?
Apart from this, we could add a qos policy (limit the bandwidth) to the loadbalancer address (in front of kubeapi) at openstack level, which seems somewhat suboptimal.

batistein · 2022-08-31T16:51:17Z

I would also recommend to use auto-compaction-retention. But I would suggest something like:

auto-compaction-mode: revision
auto-compaction-retention: 10000

I really like the idea of having a systemd service (more than cron) that calls etcdctl.
I would also recommend to add a target on the node problem detector.

EventRateLimit is one of the really important configuration we need to do for beeing production-ready. And no it's not rancher specific.
We should also set: max-requests-inflight, max-mutating-requests-inflight and min-request-timeout on the kube-apiserver
We could also think about using FlowControl.
Limiting anything on the openstack level is a bad idea, because this isn't portable to any other provider.

batistein · 2022-09-12T08:50:50Z

Our flags for etcd to be set in KubeadmControlPlane:

spec:
  kubeadmConfigSpec:
    clusterConfiguration:
      etcd:
        local:
          extraArgs:
            cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
            cert-file: /etc/kubernetes/pki/etcd/server.crt
            key-file: /etc/kubernetes/pki/etcd/server.key
            client-cert-auth: "true"
            auto-tls: "false"
            peer-client-cert-auth: "true"
            peer-auto-tls: "false"
            trusted-ca-file: /etc/kubernetes/pki/etcd/ca.crt
            heartbeat-interval: "300"
            election-timeout: "3000"
            snapshot-count: "5000"
            quota-backend-bytes: "2147483648" # 2×1024×1024×1024 - 2GB
            auto-compaction-mode: periodic
            auto-compaction-retention: 6h

postKubeadmCommand:

  - ionice -c2 -n0 -p `pgrep etcd`

In the node image for the control-planes:

ETCD_VER=v3.4.20 #https://github.com/etcd-io/etcd/releases
mkdir -p /tmp/etcd-download-test
curl -L https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
mv /tmp/etcd-download-test/etcdctl /usr/local/sbin/etcdctl && chmod +x /usr/local/sbin/etcdctl
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
rm -rf /tmp/etcd-download-test

mkdir -p /var/lib/etcd
chmod 700 /var/lib/etcd

cat > /etc/systemd/system/etcd-defrag.service <<'EOF'
[Unit]
Description=Run etcdctl defrag
Documentation=https://etcd.io/docs/v3.3.12/op-guide/maintenance/#defragmentation
After=network.target
[Service]
Type=oneshot
Environment="LOG_DIR=/var/log"
Environment="ETCDCTL_API=3"
ExecStart=/usr/local/sbin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt defrag
[Install]
WantedBy=multi-user.target
EOF


cat > /etc/systemd/system/etcd-defrag.timer <<'EOF'
[Unit]
Description=Run etcd-defrag.service every day
After=network.target
[Timer]
OnCalendar=*-*-* 02:00:0
RandomizedDelaySec=10m
[Install]
WantedBy=multi-user.target
EOF

systemctl enable etcd-defrag.service
systemctl enable etcd-defrag.timer

garloff · 2022-09-18T12:58:37Z

Our flags for etcd to be set in KubeadmControlPlane:

spec:
  kubeadmConfigSpec:
    clusterConfiguration:
      etcd:
        local:
          extraArgs:
            cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256

Do you enable extra cipher-suites here? Or disable some that you deemed not strong enough?

        cert-file: /etc/kubernetes/pki/etcd/server.crt
        key-file: /etc/kubernetes/pki/etcd/server.key
        client-cert-auth: "true"

These are default in kubeadm AFAICS.

        auto-tls: "false"
        peer-client-cert-auth: "true"
        peer-auto-tls: "false"
        trusted-ca-file: /etc/kubernetes/pki/etcd/ca.crt

trusted-ca-file and peer-client-cert-auth are defaults. Why did you disable the two auto-tls flags?

        heartbeat-interval: "300"
        election-timeout: "3000"

You have chosen similar values as we have (250 and 2500), looks like we made similar experience with the storage in the clouds we tested :-)

        snapshot-count: "5000"

Half the default history (which defaults to 10000).

        quota-backend-bytes: "2147483648" # 2×1024×1024×1024 - 2GB

Which is the implicit default as well.

        auto-compaction-mode: periodic
        auto-compaction-retention: 6h

OK, these are new -- are they needed in addition to the snapshot-count limit?

Thanks for sharing these!

garloff · 2022-09-18T13:02:29Z

postKubeadmCommand:
  - ionice -c2 -n0 -p `pgrep etcd`

Actually I wanted to do both nice -n-10 on the CPU scheduling side (to make etcd preempt other processes to minimize latency) and ionice. Definitely a good tuning measure.
BUT: This is not reboot-safe, is it? Or have you done additional things to achieve this to persist across reboots? Or will a rebooted node fail to join the cluster back anyway?

batistein · 2022-09-18T20:27:48Z

usually no reboot should happen. If a reboot is necessary we will remediate the node. So it's not critical to have the commands non reboot safe.

batistein · 2022-09-18T20:29:57Z

a lot of the args came from CIS Benchmark.Here a link to one of the args: https://www.tenable.com/audits/items/CIS_Kubernetes_v1.6.1_Level_1_Master.audit:73819e79f9fcb340a2c29b9efa2a8b71

garloff assigned garloff, joshmue and batistein Aug 24, 2022

garloff added enhancement New feature or request Container Issues or pull requests relevant for Team 2: Container Infra and Tooling epic Issues that are spread across multiple sprints labels Aug 24, 2022

tibeer mentioned this issue Mar 29, 2023

Stale issues SovereignCloudStack/issues#11

Closed

mbuechse mentioned this issue Jun 8, 2023

SCS K8s cluster standardization SovereignCloudStack/issues#181

Closed

65 tasks

jschoone added the on hold Is on hold label Oct 10, 2023

jschoone closed this as completed Oct 10, 2023

jschoone reopened this Oct 11, 2023

jschoone closed this as not planned Won't fix, can't repro, duplicate, stale Oct 11, 2023

cah-hbaum mentioned this issue Oct 17, 2023

[Standardization] KaaS Robustness features SovereignCloudStack/issues#414

Closed

2 tasks

jschoone added the Sprint Montreal Sprint Montreal (2023, cwk 40+41) label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd maintenance #258

etcd maintenance #258

garloff commented Aug 24, 2022 •

edited

Loading

garloff commented Aug 29, 2022

garloff commented Aug 29, 2022 •

edited

Loading

garloff commented Aug 29, 2022

garloff commented Aug 29, 2022

garloff commented Aug 31, 2022

batistein commented Aug 31, 2022 •

edited

Loading

batistein commented Sep 12, 2022

garloff commented Sep 18, 2022

garloff commented Sep 18, 2022 •

edited

Loading

batistein commented Sep 18, 2022

batistein commented Sep 18, 2022

etcd maintenance #258

etcd maintenance #258

Comments

garloff commented Aug 24, 2022 • edited Loading

garloff commented Aug 29, 2022

garloff commented Aug 29, 2022 • edited Loading

garloff commented Aug 29, 2022

garloff commented Aug 29, 2022

garloff commented Aug 31, 2022

batistein commented Aug 31, 2022 • edited Loading

batistein commented Sep 12, 2022

garloff commented Sep 18, 2022

garloff commented Sep 18, 2022 • edited Loading

batistein commented Sep 18, 2022

batistein commented Sep 18, 2022

garloff commented Aug 24, 2022 •

edited

Loading

garloff commented Aug 29, 2022 •

edited

Loading

batistein commented Aug 31, 2022 •

edited

Loading

garloff commented Sep 18, 2022 •

edited

Loading