Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeGracefulShutdown fail to update Pod status for system critical pods #124448

Open
avestuk opened this issue Apr 22, 2024 · 13 comments
Open

NodeGracefulShutdown fail to update Pod status for system critical pods #124448

avestuk opened this issue Apr 22, 2024 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@avestuk
Copy link

avestuk commented Apr 22, 2024

What happened?

Pods with a system-cluster-critical priority class set are not evicted by Graceful node shutdown when a node is slated to be shutdown. Instead they are evicted by the Taint Manager once the tolerationSeconds period expires.

What did you expect to happen?

I'd expect the presence of priority classes to be orthogonal to whether or not Graceful node shutdown evicts the pods or not. If graceful node shutdown runs on node shutdown then pods should be evicted before Taint Manager runs.

How can we reproduce it (as minimally and precisely as possible)?

Run a deployment with the below tolerations

priorityClass: system-cluster-critical

Schedule a node shutdown or trigger a shutdown event using gdbus

gdbus emit --system --object-path /org/freedesktop/login1 --signal 'org.freedesktop.login1.Manager.PrepareForShutdown' true

Observe that pods from the deployment enter a "Terminating" state and have conditions indicating they were evicted by the Taint manager

  - lastProbeTime: null
    lastTransitionTime: "2024-04-22T14:30:26Z"
    message: 'Taint manager: deleting due to NoExecute taint'
    reason: DeletionByTaintManager
    status: "True"
    type: DisruptionTarget

The pod status is never updated to show TerminationByKubelet and stays in Running state until the eviction by Taint Manager

Anything else we need to know?

Kubelet logs will indicate that the node shutdown manager finished terminating the pod but this doesn't get reflected in the API.

shutdown_manager_linux.go:395] "Shutdown manager finished killing pod" pod="kube-system/coredns-5687d5bddb-fqlbh"

Pod in the API server

coredns-5687d5bddb-fqlbh                  1/1     Terminating   0          89m

Kubernetes version

$ kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3

Cloud provider

Akamai Connected Cloud (The cloud formerly known as Linode)

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

$ uname -a
Linux lke-e2e-bcbc70-test-1a751f3d 6.1.0-0.deb11.17-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1~bpo11+1 (2024-01-05) x86_64 GNU/Linux

Install tools

kubeadm

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@avestuk avestuk added the kind/bug Categorizes issue or PR as related to a bug. label Apr 22, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 22, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@avestuk
Copy link
Author

avestuk commented Apr 22, 2024

/sig node

@k8s-ci-robot
Copy link
Contributor

@avestuk: The label(s) sig/sig-node cannot be applied, because the repository doesn't have them.

In response to this:

/sig sig-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@avestuk
Copy link
Author

avestuk commented Apr 22, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 22, 2024
@saschagrunert
Copy link
Member

cc @kwilczynski

I'm wondering if we can state this as bug or should be handled as enhancement to the graceful shutdown feature.

@zhifei92
Copy link

This could be due to the kubelet receiving an shutdown event of node, leading it to delete the pod, while the KCM (kube-controller-manager) initiates pod eviction after a 30-second delay. By the time the KCM starts the eviction process, the node has already shut down, but the API server has not yet been updated with the information that the pod has been deleted, so the pod is always in the Terminating state.

If I understand correctly, I do think it can be improved.

@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Apr 24, 2024
@AnishShah
Copy link
Contributor

Graceful node shutdown feature does not evict pods based on taints & tolerations. It first evicts all the pods and then the system pods. In this case, coredns might be running as a system pod?

@AnishShah AnishShah moved this from Triage to Needs Information in SIG Node Bugs Apr 24, 2024
@AnishShah
Copy link
Contributor

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Apr 24, 2024
@zhifei92 zhifei92 removed their assignment Apr 25, 2024
@avestuk
Copy link
Author

avestuk commented Apr 30, 2024

@AnishShah I found yesterday that my intital report was spurious, this does seem to be due to priority classes and CoreDNS was indeed running as a system-cluster-critical pod and so is falling into the ShutdownGracePeriodCriticalPods timeout.

I'm able to consistently reproduce the failure using a minimal example deployment, where when the PriorityClassName for the deployment is set to "system-cluster-critical", then NodeGracefulShutdown fails.

Below are logs from a successful pod eviction due to graceful node shutdown. The pod is part of a deployment without a PriorityClassName specified.

Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621576     718 nodeshutdown_manager_linux.go:375] "Shutdown manager killing pod with gracePeriod" pod="kube-system/test-68969998d9-7nqcn" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621746     718 pod_workers.go:867] "Pod is being removed by the kubelet, begin teardown" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" updateType="kill"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621834     718 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.621926     718 pod_workers.go:970] "Cancelling current pod sync" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622017     718 pod_workers.go:1230] "Processing pod event" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" updateType="terminating"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622101     718 pod_workers.go:1352] "Pod worker has observed request to terminate" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622188     718 kubelet.go:2011] "SyncTerminatingPod enter" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622262     718 kubelet_pods.go:1673] "Generating pod status" podIsTerminal=false pod="kube-system/test-68969998d9-7nqcn"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622415     718 kubelet_pods.go:1686] "Got phase for pod" pod="kube-system/test-68969998d9-7nqcn" oldPhase="Running" phase="Running"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622546     718 status_manager.go:687] "updateStatusInternal" version=3 podIsFinished=false pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containers="(busybox state=running previous=<none>)"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622648     718 kubelet.go:2021] "Pod terminating with grace period" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622757     718 kuberuntime_container.go:749] "Killing container with a grace period override" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containerName="busybox" containerID="containerd://771f1240ede543dccb522e67e8bb099e79bf716f6b5a06ef117e72a220ce7d60" gracePerio>
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.622864     718 kuberuntime_container.go:770] "Killing container with a grace period" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" containerName="busybox" containerID="containerd://771f1240ede543dccb522e67e8bb099e79bf716f6b5a06ef117e72a220ce7d60" gracePeriod=20
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623182     718 status_manager.go:227] "Syncing updated statuses"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623306     718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.623567     718 event.go:376] "Event occurred" object="kube-system/test-68969998d9-7nqcn" fieldPath="spec.containers{busybox}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container busybox"

Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.630224     718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:16 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:16.632487     718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3

pr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.729875     718 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-68969998d9-7nqcn" statusDiff=<
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:           &v1.PodStatus{
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Phase: "Running",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Phase: "Failed",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   Conditions: []v1.PodCondition{
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           ... // 3 identical elements
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:18 +0000 UTC"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:16 +0000 UTC"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                 {
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Type:               "DisruptionTarget",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Status:             "True",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         LastTransitionTime: s"2024-04-30 09:38:16 +0000 UTC",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Reason:             "TerminationByKubelet",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Message:            "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                 },
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   },
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Message:           "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Message:           "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Reason:            "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Reason:            "Terminated",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   NominatedNodeName: "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   HostIP:            "192.168.161.11",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ... // 2 identical fields
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   PodIPs:                     {{IP: "10.2.0.131"}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   StartTime:                  s"2024-04-30 09:37:16 +0000 UTC",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         InitContainerStatuses:      nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         InitContainerStatuses:      []v1.ContainerStatus{},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ContainerStatuses:          {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:37:17 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   QOSClass:                   "BestEffort",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         EphemeralContainerStatuses: nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   Resize:                     "",
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ResourceClaimStatuses:      nil,
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:           }
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]:  >


Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.730751     718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
pr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.735733     718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:18 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:18.735835     718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3

Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.730445     718 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-68969998d9-7nqcn" statusDiff=<
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:           &v1.PodStatus{
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Phase: "Running",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Phase: "Failed",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   Conditions: []v1.PodCondition{
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           ... // 3 identical elements
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:18 +0000 UTC"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                           {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:37:16 +0000 UTC"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                 {
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Type:               "DisruptionTarget",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Status:             "True",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         LastTransitionTime: s"2024-04-30 09:38:16 +0000 UTC",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Reason:             "TerminationByKubelet",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                         Message:            "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +                 },
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   },
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Message:           "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Message:           "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         Reason:            "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         Reason:            "Terminated",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   NominatedNodeName: "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   HostIP:            "192.168.161.11",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ... // 2 identical fields
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   PodIPs:                     {{IP: "10.2.0.131"}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   StartTime:                  s"2024-04-30 09:37:16 +0000 UTC",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         InitContainerStatuses:      nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         InitContainerStatuses:      []v1.ContainerStatus{},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ContainerStatuses:          {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:37:17 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   QOSClass:                   "BestEffort",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         -         EphemeralContainerStatuses: nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:         +         EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   Resize:                     "",
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:                   ResourceClaimStatuses:      nil,
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:           }
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]:  >
Apr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.731178     718 status_manager.go:833] "Sync pod status" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" statusUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" version=3
pr 30 09:38:28 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:28.732865     718 kubelet.go:2505] "SyncLoop (housekeeping) end" duration="5ms"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.241307     718 kubelet.go:2885] "Container runtime status" status="Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=true reason: message:, ContainerdHasNoDeprecationWarnings=true reason: message:"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.242771     718 generic.go:224] "GenericPLEG: Relisting"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.243558     718 kuberuntime_manager.go:437] "Retrieved pods from runtime" all=true
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.362998     718 kubelet_node_status.go:538] "Updating node status"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.592638     718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"}}"
Apr 30 09:38:29 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:29.593094     718 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-68969998d9-7nqcn" statusVersion=3

Apr 30 09:38:30 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:30.727400     718 kubelet.go:2466] "SyncLoop (SYNC) pods" total=1 pods=["kube-system/test-68969998d9-7nqcn"]
Apr 30 09:38:30 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:30.727876     718 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" workType="terminating"

Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.047774     718 kubelet.go:2428] "SyncLoop RECONCILE" source="api" pods=["kube-system/test-68969998d9-7nqcn"]
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.049563     718 status_manager.go:874] "Patch status for pod" pod="kube-system/test-68969998d9-7nqcn" podUID="459ad118-8ed7-476b-b92f-5fa2753b49e6" patch="{\"metadata\":{\"uid\":\"459ad118-8ed7-476b-b92f-5fa2753b49e6\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"DisruptionTarget\"},{\"type\":>
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.049894     718 status_manager.go:883] "Status for pod updated successfully" pod="kube-system/test-68969998d9-7nqcn" statusVersion=4 status={"phase":"Failed","conditions":[{"type":"DisruptionTarget","status":"True","lastProbeTime":null,"lastTransitionTime":"2024-04-30T09:38:37Z","reason":"TerminationByKube>
Apr 30 09:38:38 lke-e2e-a77508-test-392f71e0 kubelet[718]: I0430 09:38:38.050132     718 status_manager.go:833] "Sync pod status" podUID="8999dbb6-9b04-4300-acde-41e81f01c62a" statusUID="8999dbb6-9b04-4300-acde-41e81f01c62a" version=1

Now logs from the same test but with the Priority Class name set to system-cluster-critical. I believe that the crucial difference is that in the below example we never see a sucess message from the status_manager where the pod condidtions are set to reflect the TerminationByKubelet state.

Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.028743    1278 nodeshutdown_manager_linux.go:375] "Shutdown manager killing pod with gracePeriod" pod="kube-system/test-7db8c76ddc-mnjmg" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030611    1278 pod_workers.go:867] "Pod is being removed by the kubelet, begin teardown" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" updateType="kill"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030688    1278 pod_workers.go:963] "Notifying pod of pending update" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" workType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030769    1278 pod_workers.go:970] "Cancelling current pod sync" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" workType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030852    1278 pod_workers.go:1230] "Processing pod event" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" updateType="terminating"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.030933    1278 pod_workers.go:1352] "Pod worker has observed request to terminate" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031024    1278 kubelet.go:2011] "SyncTerminatingPod enter" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031102    1278 kubelet_pods.go:1673] "Generating pod status" podIsTerminal=false pod="kube-system/test-7db8c76ddc-mnjmg"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031201    1278 kubelet_pods.go:1686] "Got phase for pod" pod="kube-system/test-7db8c76ddc-mnjmg" oldPhase="Running" phase="Running"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031306    1278 status_manager.go:687] "updateStatusInternal" version=3 podIsFinished=false pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containers="(busybox state=running previous=<none>)"
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031461    1278 kubelet.go:2021] "Pod terminating with grace period" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031635    1278 kuberuntime_container.go:749] "Killing container with a grace period override" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containerName="busybox" containerID="containerd://aad92529eec61d358f420531acfb67ac9213448057d41ceba090a98d08b4487a" gracePeri>
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031738    1278 kuberuntime_container.go:770] "Killing container with a grace period" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" containerName="busybox" containerID="containerd://aad92529eec61d358f420531acfb67ac9213448057d41ceba090a98d08b4487a" gracePeriod=10
Apr 30 09:08:11 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:11.031877    1278 event.go:376] "Event occurred" object="kube-system/test-7db8c76ddc-mnjmg" fieldPath="spec.containers{busybox}" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container busybox"


Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:14.846215    1278 status_manager.go:984] "Pod status is inconsistent with cached status for pod, a reconciliation should be triggered" pod="kube-system/test-7db8c76ddc-mnjmg" statusDiff=<
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:           &v1.PodStatus{
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         Phase: "Running",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         Phase: "Failed",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   Conditions: []v1.PodCondition{
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                           ... // 3 identical elements
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                           {Type: "ContainersReady", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:07:58 +0000 UTC"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                           {Type: "PodScheduled", Status: "True", LastTransitionTime: {Time: s"2024-04-30 09:07:40 +0000 UTC"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                 {
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         Type:               "DisruptionTarget",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         Status:             "True",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         LastTransitionTime: s"2024-04-30 09:08:11 +0000 UTC",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         Reason:             "TerminationByKubelet",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                         Message:            "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +                 },
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   },
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         Message:           "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         Message:           "Pod was terminated in response to imminent node shutdown.",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         Reason:            "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         Reason:            "Terminated",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   NominatedNodeName: "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   HostIP:            "192.168.161.11",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   ... // 2 identical fields
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   PodIPs:                     {{IP: "10.2.0.131"}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   StartTime:                  s"2024-04-30 09:07:40 +0000 UTC",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         InitContainerStatuses:      nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         InitContainerStatuses:      []v1.ContainerStatus{},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   ContainerStatuses:          {{Name: "busybox", State: {Running: &{StartedAt: {Time: s"2024-04-30 09:07:57 +0000 UTC"}}}, Ready: true, Image: "docker.io/library/busybox:stable-glibc", ...}},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   QOSClass:                   "BestEffort",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         -         EphemeralContainerStatuses: nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:         +         EphemeralContainerStatuses: []v1.ContainerStatus{},
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   Resize:                     "",
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:                   ResourceClaimStatuses:      nil,
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:           }
Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]:  >

Apr 30 09:08:14 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:14.848233    1278 status_manager.go:833] "Sync pod status" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" statusUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" version=3
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.038338    1278 request.go:629] Waited for 190.015469ms due to client-side throttling, not priority and fairness, request: GET:https://a89588f5-edcc-4788-b2ac-d1ed32f7e448.cpc-av8.avest.lkedevs.net:443/api/v1/namespaces/kube-system/pods/test-7db8c76ddc-mnjmg
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.043024    1278 status_manager.go:874] "Patch status for pod" pod="kube-system/test-7db8c76ddc-mnjmg" podUID="ae416f84-b43f-4490-be95-a7d1b556ae9b" patch="{\"metadata\":{\"uid\":\"ae416f84-b43f-4490-be95-a7d1b556ae9b\"}}"
Apr 30 09:08:15 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:15.043109    1278 status_manager.go:881] "Status for pod is up-to-date" pod="kube-system/test-7db8c76ddc-mnjmg" statusVersion=3

Apr 30 09:08:21 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:21.029637    1278 nodeshutdown_manager_linux.go:413] "Shutdown manager pod killing time out" gracePeriod=10 priority=2000000000
Apr 30 09:08:21 lke-e2e-a77508-test-392f71e0 kubelet[1278]: I0430 09:08:21.045581    1278 nodeshutdown_manager_linux.go:327] "Shutdown manager completed processing shutdown event, node will shutdown shortly"

My deployment definition:

deployment, err := kubernetes.AppsV1().Deployments("kube-system").Create(ctx, &appsv1.Deployment{
					ObjectMeta: metaV1.ObjectMeta{
						Name:   "test",
						Labels: map[string]string{"foo": "bar"},
					},
					Spec: appsv1.DeploymentSpec{
						Replicas: &replicas,
						Selector: &metaV1.LabelSelector{
							MatchLabels: map[string]string{"foo": "bar"},
						},
						Template: v1.PodTemplateSpec{
							ObjectMeta: metaV1.ObjectMeta{
								Labels: map[string]string{"foo": "bar"},
							},
							Spec: v1.PodSpec{
								PriorityClassName: "system-cluster-critical",
								Containers: []v1.Container{
									{
										Name:    "busybox",
										Image:   "busybox:stable-glibc",
										Command: []string{"sleep"},
										Args:    []string{"infinity"},
									},
								},
							},
						},
					},
				}, metaV1.CreateOptions{})

@AnishShah
Copy link
Contributor

What do you mean by NodeGracefulShutdown fails? do you not see a pod status update? Is it possible to share the full kubelet logs?

/retitle NodeGracefulShutdown fail to update Pod status for system critical pods

@k8s-ci-robot k8s-ci-robot changed the title Setting tolerations for nodes being unreachable and/or not-ready causes Node Graceful Shutdown to not evict pods NodeGracefulShutdown fail to update Pod status for system critical pods May 1, 2024
@kwilczynski
Copy link
Contributor

kwilczynski commented May 1, 2024

@avestuk, so this is no longer a problem related to setting up tolerations for some Pods?

Graceful Node Shutdown manager does not have any means to convey that it will terminate some Pods, other than setting up the an appropriate node status and taints, as such, a potential race to evict and/or terminate Pods with the Taint (Eviction?) manager could take place.

The KEP and then the implementation most likely did not take such a scenario into account.

However, you sound as if this is no longer an issue?

@kwilczynski
Copy link
Contributor

@avestuk, for your recent update.

How is your Graceful Node Shutdown configured at the moment? Would you be able to share the relevant configuration?

Also, are you testing on a worker node or a member of the control plane?

There is nothing specific in terms of how the Pod of a certain priority is terminated. The order in which these are terminated is different, but the actual process is the same. Thus, I am not sure why the problem you are seeing only applies to Pods with some priority class set.

(taken from kubernetes/kubernetes//pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L452)

func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup {
	groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority))
	for _, period := range shutdownGracePeriodByPodPriority {
		groups = append(groups, podShutdownGroup{
			ShutdownGracePeriodByPodPriority: period,
		})
	}

	for _, pod := range pods {
		var priority int32
		if pod.Spec.Priority != nil {
			priority = *pod.Spec.Priority
		}

		// Find the group index according to the priority.
		index := sort.Search(len(groups), func(i int) bool {
			return groups[i].Priority >= priority
		})

		// 1. Those higher than the highest priority default to the highest priority
		// 2. Those lower than the lowest priority default to the lowest priority
		// 3. Those boundary priority default to the lower priority
		// if priority of pod is:
		//   groups[index-1].Priority <= pod priority < groups[index].Priority
		// in which case we want to pick lower one (i.e index-1)
		if index == len(groups) {
			index = len(groups) - 1
		} else if index < 0 {
			index = 0
		} else if index > 0 && groups[index].Priority > priority {
			index--
		}

		groups[index].Pods = append(groups[index].Pods, pod)
	}
	return groups
}

What happens when you add another Pod and set the priority class to system-node-critical, which is even higher than system-cluster-critical, and then run your test again? Do both have the same issues you noticed?

What is this issue you are seeing? Are the pods' statuses not updated? Are the pods not terminated? Is the order in which things are terminated incorrect?

Related:

@avestuk
Copy link
Author

avestuk commented May 2, 2024

@kwilczynski

Graceful Node Shutdown settings are:

-    shutdownGracePeriod: 30s
-    shutdownGracePeriodCriticalPods: 10s

I am testing on a worker node.

The issue is that I'm seeing that the pod status is not being updated to reflect TerminationByKubelet. This means that the deployment controlling the pod never sees that another replica is needed and so no new pod is scheduled. This results in having to wait 300s for Taint Manager eviction. My expectation is that when Graceful Node Shutdown is configured that all pods are evicted from the node and rescheduled elsewhere in the cluster.

I'll update the initial post to clarify what the current state of affairs is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
SIG Node Bugs
Needs Information
Development

No branches or pull requests

6 participants