Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leader-elected etcd controllers not consistently functional when leader election/lease mismatches occur #10046

Closed
Oats87 opened this issue Apr 29, 2024 · 5 comments
Assignees
Milestone

Comments

@Oats87
Copy link
Member

Oats87 commented Apr 29, 2024

Environmental Info:
K3s Version:

k3s version v1.29.4+k3s1 (94e29e2e)
go version go1.21.9

Node(s) CPU architecture, OS, and Version:
Linux <snipp> 6.2.0-39-generic #40-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:18:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
3 servers, all run with --cluster-init=true

Describe the bug:
In cases where K3s is run with etcd and leader election, there is a possibility that certain etcd-related controllers stop operating as expected in the case where lease/leader election becomes mismatched.

Steps To Reproduce:
On all 3 server nodes:

curl https://get.k3s.io | INSTALL_K3S_SKIP_START=true INSTALL_K3S_SKIP_ENABLE=true sh -
mkdir -p /etc/rancher/k3s
cat << EOF > /etc/rancher/k3s/config.yaml
cluster-init: true
token: <some-token>
EOF

On nodes 2 and 3, you also do:
echo "server: https://:6443" >> /etc/rancher/k3s/config.yaml

Then, start k3s i.e. systemctl start k3s

Once K3s is running, in order to specifically see the problem, create a k3s-etcd-snapshot-extra-metadata configmap i.e.

kubectl create configmap -n kube-system k3s-etcd-snapshot-extra-metadata --from-literal=test=ing

Then, create snapshots on various nodes i.e. k3s etcd-snapshot save

Observe that k3s-etcd-snapshots configmap has a corresponding number of snapshots, i.e. kubectl get configmap -n kube-system with the DATA figure matching the number of expected snapshots.

Now, force a lease overturn by kubectl edit lease k3s-etcd -n kube-system and change to a node that is NOT the holder for the kubectl get lease k3s -n kube-system lease. Once this is done, log into the node you changed the k3s-etcd lease to and systemctl restart k3s. After this, kubectl get leases -n kube-system should show mismatched lease holders for k3s and k3s-etcd.

Try to take another snapshot k3s etcd-snapshot save and observe that k3s never adds this snapshot to the k3s-etcd-snapshots configmap.

Expected behavior:
The controllers for etcd operate on any lease holder.

Actual behavior:
If the controllers for etcd are leased out to a different holder than the holder for k3s, the controllers will not operate correctly.

Additional context / logs:
On the new lease holder, it's possible to see the controllers handlers being registered, but there is no reactivity on the node.

Apr 29 21:50:10 ck-ub2304-b-2 k3s[1448]: time="2024-04-29T21:50:10Z" level=info msg="Starting managed etcd node metadata controller"
Apr 29 21:50:10 ck-ub2304-b-2 k3s[1448]: time="2024-04-29T21:50:10Z" level=info msg="Starting managed etcd apiserver addresses controller"
Apr 29 21:50:10 ck-ub2304-b-2 k3s[1448]: time="2024-04-29T21:50:10Z" level=info msg="Starting managed etcd member removal controller"
Apr 29 21:50:10 ck-ub2304-b-2 k3s[1448]: time="2024-04-29T21:50:10Z" level=info msg="Starting managed etcd snapshot ConfigMap controller"

I have debugged this to what I believe is a missed call to sc.Start(ctx) in the -etcd leader election callback list. As per comment here:

k3s/pkg/server/server.go

Lines 170 to 174 in 0981f00

// Re-run context startup after core and leader-elected controllers have started. Additional
// informer caches may need to start for the newly added OnChange callbacks.
if err := sc.Start(ctx); err != nil {
panic(errors.Wrap(err, "failed to start wranger controllers"))
}
for apiserverControllers, sc.Start is called as additional informer caches must be started for newly registered handlers, but this occurs for the version.Program i.e. k3s LeaderElectedClusterControllerStarts here:

k3s/pkg/server/server.go

Lines 136 to 137 in 0981f00

controlConfig.Runtime.LeaderElectedClusterControllerStarts[version.Program] = func(ctx context.Context) {
apiserverControllers(ctx, sc, config)

Within the corresponding version.Program+"-etcd" LeaderElectedClusterControllerStarts, there is no such sc.Start call in any of the call backs defined.

A quick workaround for this is to "follow" the lease holder for the k3s lease to the holder of k3s-etcd, i.e. kubectl edit lease -n kube-system k3s and change the holder to the current k3s-etcd holder. If on an older version of K3s where Lease objects are not in use for leader election, the same concept can be applied to the corresponding annotation on the ConfigMap object in the kube-system namespace

As the specific controller I am running into issues with is operating off of EtcdSnapshotFile objects and only started when there is an k3s-etcd-snapshot-extra-metadata configmap in kube-system, it is not surprising that this specific case was missed, but I believe it should be added to ensure compatibility with Rancher Provisioning.

It seems this issue was introduced with the introduction of the use of EtcdSnapshotFile CRs.

@brandond
Copy link
Contributor

brandond commented Apr 29, 2024

  • registerEndpointsHandlers watches Endpoints, but does so using a core Kubernetes ListWatch with a FieldSelector, so it is not affected if the wrangler Endpoints informer doesn't get started.
  • registerMemberHandlers registers Node handlers using the wrangler framework, and will fail to handle member deletion if nothing the informer doesn't get started.
  • registerSnapshotHandlers registers ETCDSnapshotFile handlers using the wrangler framework, and will fail to handle snapshot configmap sync if the informer doesn't get started (the issue reported here)

I think incorrect behavior in the k3s-etcd controllers probably could have happened prior to the ETCDSnapshotFile changes, although the symptoms would have been limited to just the etcd cluster membership management stuff (delete/pre-delete) not working right.

The original introduction of the missing informer start bug would have been in #6922 - not in #8064 - although that did make the snapshot configmap sync depend on the controllers being started, whereas previously it was just done ad-hoc as part of the snapshot save process.

@brandond
Copy link
Contributor

Thanks for the report - linked PR should fix this for May releases.

@brandond brandond added this to the v1.30.1+k3s1 milestone Apr 29, 2024
@brandond brandond self-assigned this Apr 29, 2024
@orangedeng
Copy link
Contributor

Do we really need two leases, k3s and k3s-etcd? Seems like both the k3s controllers and the k3s-etcd controllers will be registered in the node with !e.config.DisableAPIServer(APIServer enabled) when using embeddedDB.

Refer to:

k3s/pkg/etcd/etcd.go

Lines 617 to 623 in 1454953

if !e.config.DisableAPIServer {
e.config.Runtime.LeaderElectedClusterControllerStarts[version.Program+"-etcd"] = func(ctx context.Context) {
registerEndpointsHandlers(ctx, e)
registerMemberHandlers(ctx, e)
registerSnapshotHandlers(ctx, e)
}
}

and

k3s/pkg/server/server.go

Lines 135 to 140 in 1454953

if !controlConfig.DisableAPIServer {
controlConfig.Runtime.LeaderElectedClusterControllerStarts[version.Program] = func(ctx context.Context) {
apiserverControllers(ctx, sc, config)
}
}

@brandond
Copy link
Contributor

brandond commented May 8, 2024

Yes, ideally we would continue to allow these to be split up so that the work isn't always loaded onto a single server. I don't think we're interested in merging them back into a single controller at the moment.

@VestigeJ
Copy link

##Environment Details
Reproduced using VERSION=v1.29.4+k3s1
Validated using COMMIT=f9130d537df6862a63120aa4ddb803f1317e0ef9

Infrastructure

  • Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.14.21-150500.53-default x86_64 GNU/Linux
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5"

Cluster Configuration:

NAME               STATUS   ROLES                       AGE   VERSION
ip-1-1-1-7         Ready    control-plane,etcd,master   15m   v1.29.4+k3s1
ip-1-1-1-3         Ready    <none>                      14m   v1.29.4+k3s1
ip-1-1-1-221       Ready    control-plane,etcd,master   15m   v1.29.4+k3s1
ip-1-1-1-5         Ready    control-plane,etcd,master   14m   v1.29.4+k3s1

Config.yaml:

node-external-ip: 1.1.1.221
token: YOUR_TOKEN_HERE
write-kubeconfig-mode: 644
debug: true
cluster-init: true
embedded-registry: true

Steps

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/
$ sudo systemctl restart systemd-sysctl

$ VERSION=v1.29.4+k3s1
$ sudo INSTALL_K3S_VERSION=$VERSION INSTALL_K3S_EXEC=server ./install-k3s.sh
$ watch -n2 kubectl get no,po,svc -A
$ kubectl create configmap -n kube-system k3s-etcd-snapshot-extra-metadata --from-literal=test=ing
$ kg configmap -n kube-system
$ kg lease k3s -n kube-system
$ k edit lease k3s-etcd -n kube-system
$ kg lease k3s -n kube-system
$ kg configmap -n kube-system

Results:

Confirmed reproduction for snapshots not being recorded correctly.

$ kg configmap -n kube-system

NAME                                                   DATA   AGE
chart-content-traefik                                  0      7m14s
chart-content-traefik-crd                              0      7m14s
cluster-dns                                            2      7m19s
coredns                                                2      7m17s
extension-apiserver-authentication                     6      7m22s
k3s-etcd-snapshot-extra-metadata                       1      4m35s
k3s-etcd-snapshots                                     3      4m13s
kube-apiserver-legacy-service-account-token-tracking   1      7m22s
kube-root-ca.crt                                       1      7m5s
local-path-config                                      4      7m17s

3 snapshots remains the count on k3s lease holder node despite third node doing multiple snapshots after the reproduction steps editing the lease to a second node.

$ kg leases -n kube-system

NAME                                   HOLDER                                                                      AGE
apiserver-npohpbu5dxydyvbow7k53fefcm   apiserver-npohpbu5dxydyvbow7k53fefcm_508ef4a6-ac1b-47c4-85e1-0515aa625466   6m20s
apiserver-tkn2lcxdsshiqkpraahwpf4i4m   apiserver-tkn2lcxdsshiqkpraahwpf4i4m_b8a9a05d-a631-43ec-9f39-6731d792272b   6m55s
apiserver-z34mnzd4bdrbimoftxiwhsmkfq   apiserver-z34mnzd4bdrbimoftxiwhsmkfq_b064cd65-bb02-4554-9632-2f7be6153fbf   6m1s
k3s                                    ip-1-1-1-221                                                            6m52s
k3s-cloud-controller-manager           ip-1-1-1-5_7d511fce-3565-49fb-9a06-3f9b89c48c05                         6m47s
k3s-etcd                               ip-1-1-1-7                                                            6m52s
kube-controller-manager                ip-1-1-1-5_6660b54b-08b3-4c30-8a53-e22eadf25605                         6m51s
kube-scheduler                         ip-1-1-1-7_ec6a89ba-c18c-46ea-b49a-d0e4b84b7986                       6m44s

$ kg configmap -n kube-system //etcd snapshots should be 6 not three.

NAME                                                   DATA   AGE
chart-content-traefik                                  0      7m14s
chart-content-traefik-crd                              0      7m14s
cluster-dns                                            2      7m19s
coredns                                                2      7m17s
extension-apiserver-authentication                     6      7m22s
k3s-etcd-snapshot-extra-metadata                       1      4m35s
k3s-etcd-snapshots                                     3      4m13s
kube-apiserver-legacy-service-account-token-tracking   1      7m22s
kube-root-ca.crt                                       1      7m5s
local-path-config                                      4      7m17s

Latest COMMIT_ID install looking good!

$ kg cm -n kube-system

NAME                                                   DATA   AGE
chart-content-traefik                                  0      105s
chart-content-traefik-crd                              0      105s
cluster-dns                                            2      109s
coredns                                                2      108s
extension-apiserver-authentication                     6      113s
k3s-etcd-snapshot-extra-metadata                       1      11s
kube-apiserver-legacy-service-account-token-tracking   1      113s
kube-root-ca.crt                                       1      96s
local-path-config                                      4      107s

$ k edit lease k3s-etcd -n kube-system
lease.coordination.k8s.io/k3s-etcd edited

$ kg lease k3s -n kube-system

NAME   HOLDER             AGE
k3s    ip-1-1-1-221   2m35s

$ kg lease k3s-etcd -n kube-system

NAME       HOLDER           AGE
k3s-etcd   ip-1-1-1-5   2m44s

$ kg cm -n kube-system

NAME                                                   DATA   AGE
chart-content-traefik                                  0      8m7s
chart-content-traefik-crd                              0      8m7s
cluster-dns                                            2      8m11s
coredns                                                2      8m10s
extension-apiserver-authentication                     6      8m15s
k3s-etcd-snapshot-extra-metadata                       1      6m33s
k3s-etcd-snapshots                                     13     27s
kube-apiserver-legacy-service-account-token-tracking   1      8m15s
kube-root-ca.crt                                       1      7m58s
local-path-config                                      4      8m9s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

5 participants