Skip to content
This repository has been archived by the owner on Mar 14, 2023. It is now read-only.

[Question] Why sometimes the node-termination is not able to delete all the pods #38

Open
santinoncs opened this issue Dec 2, 2020 · 5 comments

Comments

@santinoncs
Copy link

Hi,

I got preemptible nodes with more than 40 pods.
For some reason is not able to delete all the pods. It starts and when it has deleted around 20 pods, it stops. No logs further this moment.
I tried to delete the pods at the same time that listing the pods

eviction.go:66

is taking place , but no success either.

Thanks for your help

@toms049
Copy link

toms049 commented Dec 17, 2020

Hi,
I have got the same issue.

Tried to test it, see output logs but no luck at all.
It deletes just 6 pods in order they are listed.
No logs further that.

Thanks for an idea

@toms049
Copy link

toms049 commented Dec 18, 2020

I did some testing, it looks like it does the job, but only if there is less than 11 pods on a node. If so, it removes all of them, if not, it stucks, processes just a few of the pods and ends suddenly, no logs further. The rest of the pods is running till the node hardware shutdown. So it takes a lot of time to handle these by k8s and reschedule.

@laxmiprasanna-gunna
Copy link

Hi, Facing the same issue. I see from google docs that pre-empted node gets 30 seconds before it gets deleted.
This value is set to TRUE as soon as the instance is marked to be preempted but there might be some delay between the G2 signal and the instance metadata value query receiving a response with value 'TRUE'. In essence after the preempted value is set to “TRUE”, the instance would be preempted within 30 seconds.
But when I run node-termination-handler, I don't think it is capturing the right signal, because node-terminator doesn't seem to be getting 30 seconds in order to delete all the pods present on the node.
It was able to delete only some of the pods and then exits without any further log.

@santinoncs
Copy link
Author

I follow the GCP article

https://cloud.google.com/solutions/running-web-applications-on-gke-using-cost-optimized-pvms-and-traffic-director#post-preemption_validations

and applied the recommendations , including the daemonset that creates a systemd service that blocks the shutdown of the Kubelet process.

I also delegate to an external service in another pod in another namespace to execute the deletion of all pods outside the machine that is being deleted/preempted. With this solution the deletion of pods is always done outside the proper node.

But with no success.

@santinoncs
Copy link
Author

I am wathing these events from kubernetes when node-termination tries to delete the pods

TaintManagerEviction | Cancelling deletion of Pod yyy/xx

Do you know what this means?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants