Increase user watches on nodes #772

mbrancato · 2019-01-02T18:27:10Z

What happened:
On a well-sized cluster, I started getting "no space left on device" issues when trying to run kubectl logs on pods.

What you expected to happen:
Logs print successfully, and tail successfully.

How to reproduce it (as minimally and precisely as possible):
Run kubectl logs

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.9.9
Size of cluster (how many worker nodes are in the cluster?): 4
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): Java app, CronJobs
Others: Kured is deployed and the cluster regularly drains and reboots nodes. Actual filesystem quotas were not full.

It looks like the fix here for me was to increase the fs.inotify.max_user_watches value. The AKS nodes are deployed with the default 8192. Can this value be increased to avoid this issue in the future?

The text was updated successfully, but these errors were encountered:

kvolkovich-sc · 2019-01-02T20:55:34Z

Have the same issue. (k8s 1.11.5)

paulgmiller · 2019-01-04T19:04:16Z

We saw this too but concluded that inotify objects were being leaked by os/docker and raising the limit only delayed the period between pain. We did a periodic rolling reboot instead (which can be az aks upgrade --same version on aks).
There is some thought that moving to ubuntu 18 will be a better fix (may solve the leak) but aks teams has a decent amount of work to let you configure choosing 16 or 18.
Azure/acs-engine#4251

mbrancato · 2019-01-05T01:40:03Z

@paulgmiller Kured effectively provides periodic reboots. The systems I noticed this on had an 18 day uptime.

rubroboletus · 2019-01-07T06:43:20Z

Same problem here, temporary solved by daemonset, which sets fs.inotify.max_user_watches to bigger value using sysctl and also mounts /etc/sysctl.d directory and creates file with fs.inotify.max_user_watches=OUR_VALUE as content.

vaclav-dvorak · 2019-01-07T07:27:34Z

@rubroboletus we concluded with same "hack" on our clusters. it would be nice not to have to resort to this

paulgmiller · 2019-01-07T21:39:38Z

Azure/aks-engine#223

xinyanmsft · 2019-02-27T17:30:07Z

we are running into this limit sometimes, if more of the containers are started with nodemon (these containers are used by engineers to test / debug code)

xinyanmsft · 2019-03-28T20:33:18Z

more-watches.yaml.txt
I've been using this yaml file to increase the user watches on AKS node. 'kubectl -f more-watches.yaml' and restart the nodes. Hope this helps.

andrewschmidt-a · 2019-09-10T05:42:17Z

We are experiencing this issue on our clusters. also when we run the above daemon set we get outages in our workloads.

How exactly does it do a rolling update do you have to ensure that there are replica sets that are spreading pods across nodes?

andrewschmidt-a · 2019-09-12T18:35:35Z

If anyone is interested this is how we are solving it for now. We would love to see a resolution on this issue

name="name_of_aks"
rg="name_of_resource_group"
version=$(az aks show -n $name -g $rg | jq -r ".kubernetesVersion" )

az aks upgrade -k $version -n $name -g $rg -y --no-wait

palma21 · 2019-12-30T23:39:07Z

Increased to fs.inotify.max_user_watches = 1048576

hameno · 2020-01-06T14:47:18Z

@palma21 Could you please expand on what changed? Has this been rolled out to Azure AKS? If so, which version? How do we get the fix?

palma21 · 2020-01-06T19:01:50Z

Sure
Done in this PR: Azure/aks-engine#1801
Merged in AKS Engine Release: https://github.com/Azure/aks-engine/releases/tag/v0.42.0
Released in AKS in https://github.com/Azure/AKS/releases/tag/2019-11-18

palma21 · 2020-01-06T19:02:33Z

Any upgrade done after the AKS release date (2019-11-18) will have this change (since it's a node setting change it needs an upgrade)

hameno · 2020-01-06T19:49:30Z

Thanks, will trigger upgrades as soon as possible

StepanKuksenko · 2020-05-26T09:03:42Z

more-watches.yaml.txt
I've been using this yaml file to increase the user watches on AKS node. 'kubectl -f more-watches.yaml' and restart the nodes. Hope this helps.

@xinyanmsft why do you mount /sys directory to the container ? i tried to configure without this, seems everything is ok. But maybe i don't know something...

xinyanmsft mentioned this issue Mar 26, 2019

azds.yaml - container sync Azure/dev-spaces#73

Closed

jnoller added the triage label Apr 4, 2019

This was referenced Jul 19, 2019

"etcdserver: request timed out" kubernetes-sigs/kind#717

Closed

add daemonset for tuning sysctls (namely inotify watches) kubernetes/test-infra#13515

Merged

kwikwag mentioned this issue Oct 16, 2019

K8S cluster provisioning/VMSS restart failed due to VMExtensionProvisioningError on VMAccessForLinux extension Azure/azure-linux-extensions#918

Open

palma21 closed this as completed Dec 30, 2019

ghost locked as resolved and limited conversation to collaborators Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase user watches on nodes #772

Increase user watches on nodes #772

mbrancato commented Jan 2, 2019

kvolkovich-sc commented Jan 2, 2019 •

edited

Loading

paulgmiller commented Jan 4, 2019

mbrancato commented Jan 5, 2019

rubroboletus commented Jan 7, 2019

vaclav-dvorak commented Jan 7, 2019

paulgmiller commented Jan 7, 2019

xinyanmsft commented Feb 27, 2019

xinyanmsft commented Mar 28, 2019

andrewschmidt-a commented Sep 10, 2019

andrewschmidt-a commented Sep 12, 2019

palma21 commented Dec 30, 2019

hameno commented Jan 6, 2020

palma21 commented Jan 6, 2020

palma21 commented Jan 6, 2020

hameno commented Jan 6, 2020

StepanKuksenko commented May 26, 2020

Increase user watches on nodes #772

Increase user watches on nodes #772

Comments

mbrancato commented Jan 2, 2019

kvolkovich-sc commented Jan 2, 2019 • edited Loading

paulgmiller commented Jan 4, 2019

mbrancato commented Jan 5, 2019

rubroboletus commented Jan 7, 2019

vaclav-dvorak commented Jan 7, 2019

paulgmiller commented Jan 7, 2019

xinyanmsft commented Feb 27, 2019

xinyanmsft commented Mar 28, 2019

andrewschmidt-a commented Sep 10, 2019

andrewschmidt-a commented Sep 12, 2019

palma21 commented Dec 30, 2019

hameno commented Jan 6, 2020

palma21 commented Jan 6, 2020

palma21 commented Jan 6, 2020

hameno commented Jan 6, 2020

StepanKuksenko commented May 26, 2020

kvolkovich-sc commented Jan 2, 2019 •

edited

Loading