Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd maintenance #258

Closed
2 of 4 tasks
garloff opened this issue Aug 24, 2022 · 11 comments
Closed
2 of 4 tasks

etcd maintenance #258

garloff opened this issue Aug 24, 2022 · 11 comments
Assignees
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling enhancement New feature or request epic Issues that are spread across multiple sprints on hold Is on hold Sprint Montreal Sprint Montreal (2023, cwk 40+41)

Comments

@garloff
Copy link
Member

garloff commented Aug 24, 2022

As cluster operator, I want to avoid that etcd causes trouble.

etcd could become big and slow and ultimately so large, that it refuses writes.
We have observed this debugging the GXFS staging cluster.
https://input.osb-alliance.de/p/2022-scs-gxfs-cluster-debugging

To learn from this, we may want to tweak the etcd setup on the control nodes.

  • etcd parameters that limit the history and cause compaction and/or defragmentation
  • regular maintenance jobs on the node to maintain etcd's health
  • detecting trouble on the node and raising alarms
  • create documentation how to address potential (remaining) challenges
@garloff garloff added enhancement New feature or request Container Issues or pull requests relevant for Team 2: Container Infra and Tooling epic Issues that are spread across multiple sprints labels Aug 24, 2022
@garloff
Copy link
Member Author

garloff commented Aug 29, 2022

k8s should have limited to 110 pods per node to prevent this. Has this limit been lifted in our cluster?

@garloff
Copy link
Member Author

garloff commented Aug 29, 2022

rate limit needed for k8s API needed to protect cluster
https://kubernetes.io/docs/concepts/cluster-administration/flow-control/

@garloff
Copy link
Member Author

garloff commented Aug 29, 2022

@janiskemper to document recommended default rate limits

@garloff
Copy link
Member Author

garloff commented Aug 29, 2022

Monitoring/Alerting needed to warn before etcd runs out of space and compact/defragment as needed.
Enable auto-compaction in etcd -- together with rate-limit we should be safe in 99+% of the cases.

@garloff
Copy link
Member Author

garloff commented Aug 31, 2022

Discussion:

  • We could add --auto-compaction-retention=10 parameter to etcd (to keep 10hrs of key history, removing anything older than 10hrs every hour), BUT it seems that kubernetes controller already does compaction for us, so this would only be a second line of defense in case this somehow got broken. Is this really advisable?
  • Have a nightly systemd.timer job (or cron job) that calls etcdctl defrag on the control plane nodes; put some random delay to avoid all nodes doing defrag at the same time. (Ideally, defrag runs much shorter than a reelection timeout, but you never know)
  • For kubeapi rate limit, I found some rancher docs ... adding a rate-limiting admission controller. https://rancher.com/docs/rke/latest/en/config-options/rate-limiting/ Is this rancher specific?
    Apart from this, we could add a qos policy (limit the bandwidth) to the loadbalancer address (in front of kubeapi) at openstack level, which seems somewhat suboptimal.

@batistein
Copy link
Contributor

batistein commented Aug 31, 2022

I would also recommend to use auto-compaction-retention. But I would suggest something like:

auto-compaction-mode: revision
auto-compaction-retention: 10000

I really like the idea of having a systemd service (more than cron) that calls etcdctl.
I would also recommend to add a target on the node problem detector.

EventRateLimit is one of the really important configuration we need to do for beeing production-ready. And no it's not rancher specific.
We should also set: max-requests-inflight, max-mutating-requests-inflight and min-request-timeout on the kube-apiserver
We could also think about using FlowControl.
Limiting anything on the openstack level is a bad idea, because this isn't portable to any other provider.

@batistein
Copy link
Contributor

Our flags for etcd to be set in KubeadmControlPlane:

spec:
  kubeadmConfigSpec:
    clusterConfiguration:
      etcd:
        local:
          extraArgs:
            cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
            cert-file: /etc/kubernetes/pki/etcd/server.crt
            key-file: /etc/kubernetes/pki/etcd/server.key
            client-cert-auth: "true"
            auto-tls: "false"
            peer-client-cert-auth: "true"
            peer-auto-tls: "false"
            trusted-ca-file: /etc/kubernetes/pki/etcd/ca.crt
            heartbeat-interval: "300"
            election-timeout: "3000"
            snapshot-count: "5000"
            quota-backend-bytes: "2147483648" # 2×1024×1024×1024 - 2GB
            auto-compaction-mode: periodic
            auto-compaction-retention: 6h

postKubeadmCommand:

  - ionice -c2 -n0 -p `pgrep etcd`

In the node image for the control-planes:

ETCD_VER=v3.4.20 #https://github.com/etcd-io/etcd/releases
mkdir -p /tmp/etcd-download-test
curl -L https://github.com/etcd-io/etcd/releases/download/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
mv /tmp/etcd-download-test/etcdctl /usr/local/sbin/etcdctl && chmod +x /usr/local/sbin/etcdctl
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
rm -rf /tmp/etcd-download-test

mkdir -p /var/lib/etcd
chmod 700 /var/lib/etcd

cat > /etc/systemd/system/etcd-defrag.service <<'EOF'
[Unit]
Description=Run etcdctl defrag
Documentation=https://etcd.io/docs/v3.3.12/op-guide/maintenance/#defragmentation
After=network.target
[Service]
Type=oneshot
Environment="LOG_DIR=/var/log"
Environment="ETCDCTL_API=3"
ExecStart=/usr/local/sbin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt defrag
[Install]
WantedBy=multi-user.target
EOF


cat > /etc/systemd/system/etcd-defrag.timer <<'EOF'
[Unit]
Description=Run etcd-defrag.service every day
After=network.target
[Timer]
OnCalendar=*-*-* 02:00:0
RandomizedDelaySec=10m
[Install]
WantedBy=multi-user.target
EOF

systemctl enable etcd-defrag.service
systemctl enable etcd-defrag.timer

@garloff
Copy link
Member Author

garloff commented Sep 18, 2022

Our flags for etcd to be set in KubeadmControlPlane:

spec:
  kubeadmConfigSpec:
    clusterConfiguration:
      etcd:
        local:
          extraArgs:
            cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256

Do you enable extra cipher-suites here? Or disable some that you deemed not strong enough?

        cert-file: /etc/kubernetes/pki/etcd/server.crt
        key-file: /etc/kubernetes/pki/etcd/server.key
        client-cert-auth: "true"

These are default in kubeadm AFAICS.

        auto-tls: "false"
        peer-client-cert-auth: "true"
        peer-auto-tls: "false"
        trusted-ca-file: /etc/kubernetes/pki/etcd/ca.crt

trusted-ca-file and peer-client-cert-auth are defaults. Why did you disable the two auto-tls flags?

        heartbeat-interval: "300"
        election-timeout: "3000"

You have chosen similar values as we have (250 and 2500), looks like we made similar experience with the storage in the clouds we tested :-)

        snapshot-count: "5000"

Half the default history (which defaults to 10000).

        quota-backend-bytes: "2147483648" # 2×1024×1024×1024 - 2GB

Which is the implicit default as well.

        auto-compaction-mode: periodic
        auto-compaction-retention: 6h

OK, these are new -- are they needed in addition to the snapshot-count limit?

Thanks for sharing these!

@garloff
Copy link
Member Author

garloff commented Sep 18, 2022

postKubeadmCommand:

  - ionice -c2 -n0 -p `pgrep etcd`

Actually I wanted to do both nice -n-10 on the CPU scheduling side (to make etcd preempt other processes to minimize latency) and ionice. Definitely a good tuning measure.
BUT: This is not reboot-safe, is it? Or have you done additional things to achieve this to persist across reboots? Or will a rebooted node fail to join the cluster back anyway?

@batistein
Copy link
Contributor

usually no reboot should happen. If a reboot is necessary we will remediate the node. So it's not critical to have the commands non reboot safe.

@batistein
Copy link
Contributor

a lot of the args came from CIS Benchmark.Here a link to one of the args: https://www.tenable.com/audits/items/CIS_Kubernetes_v1.6.1_Level_1_Master.audit:73819e79f9fcb340a2c29b9efa2a8b71

@jschoone jschoone added the on hold Is on hold label Oct 10, 2023
@jschoone jschoone reopened this Oct 11, 2023
@jschoone jschoone closed this as not planned Won't fix, can't repro, duplicate, stale Oct 11, 2023
@jschoone jschoone added the Sprint Montreal Sprint Montreal (2023, cwk 40+41) label Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling enhancement New feature or request epic Issues that are spread across multiple sprints on hold Is on hold Sprint Montreal Sprint Montreal (2023, cwk 40+41)
Projects
Archived in project
Development

No branches or pull requests

4 participants