-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes goes in Not Ready State after load testing #102
Comments
Few questions.
|
It was intensive memory and CPU operation. After that, I added a memory & CPU limit on all the pods and that issue is not reproducing anymore. But in any case, once the pod is in "Ready" state it should never go to "NotReady" state. I don't have log of it since it was not responsive and have to delete it. |
I have this exact issue. Easy to produce. I run an AKS cluster 1 with a pod with external IP exposed by K8 service. I run another AKS 2 where I run JMeter. From the JMeter AKS 2, I hit 1500 requests \ sec to the AKS 1 where my service lives and the nodes become not ready :
When I use
I also am in the process of adding restrictions to deployment resources, but I just thought the cluster should still recover after such a scenario. I have added events json from Unable to get anything out of heapster
kube-system status:
Hope this info helps you guys troubleshoot further if needed. |
Same issue after overloading on number of replicas that would fit the amount of available RAM.
|
One of my nodes seems to be hitting a similar problem.
|
I'm having the same error with my nodes. What I did :
So here is some logs about my issue :
Here one of my nodes description:
All processes which running in one of the nodes
All disk consumption about a node :
kernel logs one of the nodes :
nic logs :
journalctl records about the service
|
Closing due inactivity. Feel free to re-open if still an issue. |
We are experiencing same behavior, cluster is loosing nodes due to load. especially with 1 core setup. DS1 VM |
Same here -> multiple nodes in Not Ready status, presumed since we don't have any resource quotas on pods yet kubectl describe node shows
Nodes themselves: aks-nodepool1-37134528-0 NotReady agent 4d v1.10.6 What is spooky is that nodes are exactly at 20:05 every evening "Not Ready", and they are at 08:00 in the morning back in ready state. |
Experiencing a similar issue in cluster built through acs-engine: Experiencing all manner of instability - 504s, containers losing connectivity to db, random container restarts. We've applied the azure-cni-networkmonitor daemonset "patch" but still experiencing a high level of networking issues.
|
PLEG Unhealthy is a known defect in Kubernetes upstream with patches looking like they will land in k8s 1.16: |
We are experiencing the same issue when I deployed some statefulSet which contains some PVC, seems disk provisioning caused this problem, in the node's "Resource health" page it says "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk" |
I'm fairly convinced this problem is due to disk performance getting so
slow that the nodes can't write logs fast enough.
…On Wed, 30 Oct 2019, 03:34 mingtwan, ***@***.***> wrote:
We are experiencing the same issue when I deployed some statefulSet which
contains some PVC, seems disk provisioning caused this problem, in the
node's "Resource health" page it says "We're sorry, your virtual machine is
unavailable because of connectivity loss to the remote disk"
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#102?email_source=notifications&email_token=AAW55RLBM5QNHSULXD3BCVLQRD6DFA5CNFSM4EJVLC52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSZ3EQ#issuecomment-547724690>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAW55RNATYD26PKORJ6LQRTQRD6DFANCNFSM4EJVLC5Q>
.
|
@mooperd you’re probably right. If you have a cluster with a 1 tb disk, that I think is a p4 class premium disk with a maximum iops of 200, which means that the OS disk IO is so high due to disk IO contention this occurs. |
I think its more a problem of the underlying infrastructure. I have seen
disk r/w down to 10Kb/s on Azure instances.
Lower standard IOPS on nodes wouldn't necessarily be a problem as the
writing wouldn't be synchronous - the operating system would stream them to
disk.
Anyone seeing this issue should have a look at top on all the nodes (not
just the one affected) and see if any of the processes are spending a lot
of time in 'wait'.
…On Wed, 30 Oct 2019, 12:08 Jesse Noller, ***@***.***> wrote:
@mooperd <https://github.com/mooperd> you’re probably right. If you have
a cluster with a 1 tb disk, that I think is a p4 class premium disk with a
maximum iops of 200, which means that the OS disk IO is so high due to disk
IO contention this occurs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#102?email_source=notifications&email_token=AAW55RJNXHV6GUQZ5BOKN6TQRF2NJA5CNFSM4EJVLC52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECT5PAQ#issuecomment-547870594>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAW55RNPXXJH7SXJK32NTMLQRF2NJANCNFSM4EJVLC5Q>
.
|
@mooperd I've been debugging this for the service. You're right in most normal IO cases, and top will not show the underlying throttling of the OS disk. Go back to my example: Node, 1tb OS disk, running linux. A 1tb disk on that page has a maximum IO - but it also has a maximum IOPs - 5000 Max Iops per disk, and that is your OS disk Now, factor in the size of the containers - larger docker containers have worse disk IO transactions. The Azure system detects anything doing 256KiB of IO as an IOP Now on the OS disk you have the docker daemon, kubelet, in memory FS drivers (say cifs, etc). Looking at kube-metrics data only shows the in-memory kube object view and now OS/Docker level. which means it's short the system level IO calls. This means in addition to the normal VM limitations - you also have the cache limit / max etc: |
Please also see this issue for intermittent nodenotready, DNS latency and other crashed related to system load: #1373 |
Action required from @Azure/aks-pm |
Issue needing attention of @Azure/aks-leads |
@Azure/aks-pm issue needs labels |
3 similar comments
@Azure/aks-pm issue needs labels |
@Azure/aks-pm issue needs labels |
@Azure/aks-pm issue needs labels |
Sorry about the spam. The bot issue should be fixed now. A lot of the issues on this ticket have been solved or mitigated upstream or in recent versions of AKS. We realize perhaps not all problems have been addressed since the thread is running fairly long. If you can kindly open a ticket with your specific issue we can look into it. Please also refer to recent features that will add increased stability and resilience:
|
I was doing load testing in the AKS cluster. Many time, after firing heavy load on the cluster, the nodes are going in the "Not Ready" state and never returns to "Ready" state.
What is a resolution to this problem? How can I bring back the nodes?
The text was updated successfully, but these errors were encountered: