Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet crash loop panic with SIGSEGV #124871

Open
garymm opened this issue May 14, 2024 · 6 comments
Open

kubelet crash loop panic with SIGSEGV #124871

garymm opened this issue May 14, 2024 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@garymm
Copy link

garymm commented May 14, 2024

What happened?

I was running some pods on my node as usual.
At some point (23:41:35.127989 in the log), the kubelet crashed and then went into a crash loop.

Log showing the first crash and several thereafter: kubelet-crash.txt

The errors I see shortly before the crash are:

projected.go:292] Couldn't get configMap default/kube-root-ca.crt: object "default"/"kube-root-ca.crt" not registered
projected.go:198] Error preparing data for projected volume kube-api-access-lwp5v for pod default/run-fluid-wtw5n-99kwf: object "default"/"kube-root-ca.crt" not registered
nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/projected/e6cfc866-a5b2-4e16-9b83-884de8552d45-kube-api-access-lwp5v podName:e6cfc866-a5b2-4e16-9b83-884de8552d45 nodeName:}" failed. No retries permitted until 2024-05-13 23:41:38.8756138 +0000 UTC m=+9176.959811707 (durationBeforeRet
ry 2m2s). Error: MountVolume.SetUp failed for volume "kube-api-access-lwp5v" (UniqueName: "kubernetes.io/projected/e6cfc866-a5b2-4e16-9b83-884de8552d45-kube-api-access-lwp5v") pod "run-fluid-wtw5n-99kwf" (UID: "e6cfc866-a5b2-4e16-9b83-884de8552d45") : object "default"/"kube-root-ca.crt" not registered

Stack trace from the first crash (later stack traces are a bit different):

goroutine 401 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3d005c0?, 0x6d89410})
        vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x480e338?})
        vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x3d005c0, 0x6d89410})
        /usr/local/go/src/runtime/panic.go:884 +0x213
errors.As({0x12, 0x0}, {0x3eac4c0?, 0xc0014e6590?})
        /usr/local/go/src/errors/wrap.go:109 +0x215
k8s.io/apimachinery/pkg/util/net.IsConnectionReset(...)
        vendor/k8s.io/apimachinery/pkg/util/net/util.go:45
k8s.io/client-go/rest.(*Request).request.func2(0x6d2e8a0?, {0x12, 0x0})
        vendor/k8s.io/client-go/rest/request.go:1007 +0x79
k8s.io/client-go/rest.IsRetryableErrorFunc.IsErrorRetryable(...)
        vendor/k8s.io/client-go/rest/with_retry.go:43
k8s.io/client-go/rest.(*withRetry).IsNextRetry(0xc001587f40, {0x0?, 0x0?}, 0x0?, 0xc001ef5f00, 0xc001bb1560, {0x12, 0x0}, 0x480e328)
        vendor/k8s.io/client-go/rest/with_retry.go:169 +0x170
k8s.io/client-go/rest.(*Request).request.func3(0xc001bb1560, 0xc002913af8, {0x4c44840?, 0xc001587f40?}, 0x0?, 0x0?, 0x39efc40?, {0x12?, 0x0?}, 0x480e328)
        vendor/k8s.io/client-go/rest/request.go:1042 +0xba
k8s.io/client-go/rest.(*Request).request(0xc00199f200, {0x4c42e00, 0xc00227b0e0}, 0x2?)
        vendor/k8s.io/client-go/rest/request.go:1048 +0x4e5
k8s.io/client-go/rest.(*Request).Do(0xc00199f200, {0x4c42dc8, 0xc000196010})
        vendor/k8s.io/client-go/rest/request.go:1063 +0xc9
k8s.io/client-go/kubernetes/typed/core/v1.(*nodes).Get(0xc001db6740, {0x4c42dc8, 0xc000196010}, {0x7ffe6a355c42, 0x1d}, {{{0x0, 0x0}, {0x0, 0x0}}, {0x4c03920, ...}})
        vendor/k8s.io/client-go/kubernetes/typed/core/v1/node.go:77 +0x145
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).tryUpdateNodeStatus(0xc000289400, {0x4c42dc8, 0xc000196010}, 0x44ead4?)
        pkg/kubelet/kubelet_node_status.go:561 +0xf6
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).updateNodeStatus(0xc000289400, {0x4c42dc8, 0xc000196010})
        pkg/kubelet/kubelet_node_status.go:536 +0xfc
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncNodeStatus(0xc000289400)
        pkg/kubelet/kubelet_node_status.go:526 +0x105
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002913f28?)
        vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x4c19ae0, 0xc0008f2000}, 0x1, 0xc000180360)
        vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x2540be400, 0x3fa47ae147ae147b, 0x0?, 0x0?)
        vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
created by k8s.io/kubernetes/pkg/kubelet.(*Kubelet).Run
        pkg/kubelet/kubelet.go:1606 +0x58a
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1a pc=0x47a935]

What did you expect to happen?

Ideally no crash, but at least a useful error message.

How can we reproduce it (as minimally and precisely as possible)?

Really not sure.

Anything else we need to know?

No response

Kubernetes version

Client Version: v1.28.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6

Cloud provider

Bare metal.

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

kubespray

Container runtime (CRI) and version (if applicable)

containerd

containerd github.com/containerd/containerd v1.7.13 7c3aca7a610df76212171d200ca3811ff6096eb8

Related plugins (CNI, CSI, ...) and versions (if applicable)

CNI = calico v3.26.4

@garymm garymm added the kind/bug Categorizes issue or PR as related to a bug. label May 14, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 14, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@garymm
Copy link
Author

garymm commented May 14, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 14, 2024
@saschagrunert
Copy link
Member

errors.As will panic if target is not a non-nil pointer to either a type that implements error, or to any interface type. As returns false if err is nil. But syscall.Errno clearly implements error.

var errno syscall.Errno
if errors.As(err, &errno) {
return errno == syscall.ECONNRESET
}

If we assume that you're using golang 1.20.13 to build this version, then it would panic there: https://github.com/golang/go/blob/a95136a88cb8a51ede3ec2cdca4cfa3962dcfacd/src/errors/wrap.go#L109

Which golang version did you use to build this Kubernetes version?

@garymm
Copy link
Author

garymm commented May 15, 2024

According to the changelog, kubernentes v1.28.6 (what I'm using) is built with go 1.21.7

@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs May 22, 2024
@haircommander
Copy link
Contributor

/assign @rphillips
/priority important-soon

@garymm do you have a more succinct reproducer to make it easier to find the issue?

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 22, 2024
@haircommander haircommander moved this from Triage to High Priority in SIG Node Bugs May 22, 2024
@garymm
Copy link
Author

garymm commented May 22, 2024

Not really sorry. The only weird thing which may be relevant:
This happened on a control plane node, and I didn't have any taints on my control plane nodes, so it was running normal pods as well as control plane pods. No idea if that's relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
SIG Node Bugs
High Priority
Development

No branches or pull requests

5 participants