Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s-agent Fails to start with with embedded registry and kill entire OS #10101

Open
ElectroshockGuy opened this issue May 15, 2024 · 7 comments

Comments

@ElectroshockGuy
Copy link

Describe the bug:

I am experiencing an issue where the k3s-agent fails to start properly. During the startup process, the file /var/lib/rancher/k3s/agent/containerd/peer.key is generated but its content is empty, which is quite unusual. When I attempt to delete the /var/lib/rancher/k3s/agent/containerd/peer.key file and then restart the k3s-agent, the system immediately freezes and then reboots.

Environmental Info:
K3s Version: v1.28.9+k3s1

Node(s) CPU architecture, OS, and Version:

cpu: 16
os: ubuntu 24.04 (kairos)

Cluster Configuration:
2 servers, 1 agents

Steps To Reproduce:

  • server enabled --embedded-registry
  • add worker
  • woker`s os start k3s-agent.service

Additional context / logs:

May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Using private registry config file at /etc/rancher/k3s/registries.yaml"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Module overlay was already loaded"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Module nf_conntrack was already loaded"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Module br_netfilter was already loaded"
May 15 06:36:21 node7 k3s[1819]: E0515 06:36:21.555610    1819 remote_runtime.go:294] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\"" filter="nil"
May 15 06:36:21 node7 k3s[1819]: E0515 06:36:21.555713    1819 kuberuntime_sandbox.go:297] "Failed to list pod sandboxes" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\""
May 15 06:36:21 node7 k3s[1819]: E0515 06:36:21.555778    1819 generic.go:238] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\""
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_max' to 524288"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Set sysctl 'net/ipv4/conf/all/forwarding' to 1"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=info msg="Starting distributed registry mirror at https://10.11.111.63:6443/v2 for registries [docker.io registry.k8s.io]"
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=fatal msg="failed to start embedded registry: failed to load or generate p2p private key: error loading key from /var/lib/rancher/k3s/agent/containerd/peer.key: <nil>"
May 15 06:36:21 node7 systemd[1]: k3s-agent.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ An ExecStart= process belonging to unit k3s-agent.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
May 15 06:36:21 node7 systemd[1]: k3s-agent.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit k3s-agent.service has entered the 'failed' state with result 'exit-code'.
May 15 06:36:21 node7 systemd[1]: k3s-agent.service: Consumed 1.283s CPU time, 200.0M memory peak, 0B memory swap peak.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit k3s-agent.service completed and consumed the indicated resources.
May 15 06:36:26 node7 systemd[1]: k3s-agent.service: Scheduled restart job, restart counter is at 1.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ Automatic restarting of the unit k3s-agent.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
May 15 06:36:26 node7 systemd[1]: Starting k3s-agent.service - Lightweight Kubernetes...
░░ Subject: A start job for unit k3s-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit k3s-agent.service has begun execution.
░░ 
░░ The job identifier is 1173.
May 15 06:36:26 node7 sh[2387]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
May 15 06:36:27 node7 k3s[2397]: time="2024-05-15T06:36:27Z" level=info msg="Starting k3s agent v1.28.9+k3s1
@liyimeng
Copy link
Contributor

I also experience the same :(
my guessing is that during key generation, the system boot for some reason, causing the generated key got no chance to write to disk, hence result in the following errors

@brandond
Copy link
Contributor

brandond commented May 16, 2024

During the startup process, the file /var/lib/rancher/k3s/agent/containerd/peer.key is generated but its content is empty
May 15 06:36:21 node7 k3s[1819]: time="2024-05-15T06:36:21Z" level=fatal msg="failed to start embedded registry: failed to load or generate p2p private key: error loading key from /var/lib/rancher/k3s/agent/containerd/peer.key: <nil>"

This error is coming from https://github.com/rancher/dynamiclistener/blob/e590d58b896cc8dd33dde7cec80c52e23ec08189/cert/io.go#L89 - the message suggests that the file was created by a previous startup of k3s, but for some reason the file contents have been lost. Your best bet is probably to just delete the file from disk and let it be recreated on startup. You might be able to find other errors in the logs to suggest why the file has no contents or its contents are corrupted, but given that this node is also rebooting unexpectedly, I suspect that you may have lost data from your filesystem when the system crashed.

When I attempt to delete the /var/lib/rancher/k3s/agent/containerd/peer.key file and then restart the k3s-agent, the system immediately freezes and then reboots.

That sounds like a problem with your node; K3s shouldn't be capable of doing anything that would cause it to panic and reboot. You'll need to figure that out on your own.

@liyimeng
Copy link
Contributor

@brandon agree with you! I manage to switch to an openrc system and test the same k3s version, all work as expected. systemd seems playing devil here. :(

@liyimeng
Copy link
Contributor

liyimeng commented May 18, 2024

strange, when rolling back to 1.28.6, it runs ok with no issue.

@liyimeng
Copy link
Contributor

I have found another potential cause. As I understand, when running with systemd, the cgroup driver should be systemd, however, I found k3s mistaken it as cgroupfs, not sure if this is the issue.

@brandond
Copy link
Contributor

I'm not aware of any defect in k3s that would cause it to use cgroupfs instead of systemd, when using the embedded containerd on a systemd-based OS. You're not trying to use docker or another user-provided container runtime, are you?

@liyimeng
Copy link
Contributor

no, I use kairos from https://github.com/kairos-io/kairos/, which should have no other runtime available. In addition to that, I add some additional printout and find

ARN[0002] isRunningInUserNS=false, cgroup controller map[cpu:true cpuset:true hugetlb:true io:true memory:true misc:true pids:true rdma:true], INVOCATION_ID= 

INVOCATION_ID is empty, something go wrong with systemd, it should set this value.

This is very likely systemd issue in their distribution, I will shout out loud there. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

No branches or pull requests

3 participants