Pods that have UnexpectedAdmissionError are not automatically removed. #124934

warlock2k · 2024-05-17T16:59:59Z

Background

I'm using the Smarter Device Manager to allow Kubernetes (K8s) containers access to devices available on nodes (for example, /dev/kvm). However, when the node restarts, it takes a while for the Smarter Device Manager to initialize, as a result, pods scheduled on such nodes requiring access to /dev/kvm error out with a status of UnexpectedAdmissionError and with the following error:

...
Annotations:  <none>
Status:  Failed
Reason:  UnexpectedAdmissionError
Message:  Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices smarter-devices/kvm, which is unexpected

Although, after a while, I see new pod pertaining to the job scheduled, the older pod still remains errored out and not removed.

Kubernetes version

Client Version: v1.26.9
Kustomize Version: v4.5.7
Server Version: v1.26.11

What did I expect to happen?

I expect that pods with status UnexpectedAdmissionError are removed automatically by Kubernetes.

Steps to reproduce

Create a job that deploys a pod needing access to device drivers (say /dev/kvm through smarter-device-manager).
Restart the node on which this job is deployed.
You will see that the pod pertaining to the aforementioned job doesn't get scheduled.

OS

PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
Linux 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-05-17T17:00:09Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

neolit123 · 2024-05-18T07:51:52Z

I expect that pods with status UnexpectedAdmissionError are removed automatically by Kubernetes.

/sig node
for triage

ffromani · 2024-05-20T09:45:19Z

pods in failed state, including UnexpectedAdmissionError aren't automatically cleaned up to enable users to learn the failure conditions (e.g. logs). In order to enable automatic safe cleanup we need to have clear well known errors, which will probably require a (minor) RFE.

warlock2k · 2024-05-21T14:26:14Z

@ffromani pods in the UnexepectedAdmissionError state actually do not have any logs but only the status that can be seen through the kubectl describe command. I do believe that enabling a automatic safe cleanup is a useful feature to have and would greatly help us. Do you suggest I raise a RFE for this?

ffromani · 2024-05-21T14:33:21Z

@ffromani pods in the UnexepectedAdmissionError state actually do not have any logs but only the status that can be seen through the kubectl describe command. I do believe that enabling a automatic safe cleanup is a useful feature to have and would greatly help us. Do you suggest I raise a RFE for this?

The key in this case is Failed state, which UnexpectedAdmissionError is one of the reason for. I think the first step is to have a clear indentifiable error which can enable automatic cleanup later on. Not sure the base system should do this cleanup but a cleaning controller would then be pretty trivial to create.

warlock2k · 2024-05-23T14:16:28Z

Alright, thanks!

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 17, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 18, 2024

warlock2k closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods that have UnexpectedAdmissionError are not automatically removed. #124934

Pods that have UnexpectedAdmissionError are not automatically removed. #124934

warlock2k commented May 17, 2024

k8s-ci-robot commented May 17, 2024

neolit123 commented May 18, 2024

ffromani commented May 20, 2024

warlock2k commented May 21, 2024

ffromani commented May 21, 2024

warlock2k commented May 23, 2024

Pods that have UnexpectedAdmissionError are not automatically removed. #124934

Pods that have UnexpectedAdmissionError are not automatically removed. #124934

Comments

warlock2k commented May 17, 2024

Background

Kubernetes version

What did I expect to happen?

Steps to reproduce

OS

k8s-ci-robot commented May 17, 2024

neolit123 commented May 18, 2024

ffromani commented May 20, 2024

warlock2k commented May 21, 2024

ffromani commented May 21, 2024

warlock2k commented May 23, 2024