Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase user watches on nodes #772

Closed
mbrancato opened this issue Jan 2, 2019 · 16 comments
Closed

Increase user watches on nodes #772

mbrancato opened this issue Jan 2, 2019 · 16 comments
Labels

Comments

@mbrancato
Copy link

What happened:
On a well-sized cluster, I started getting "no space left on device" issues when trying to run kubectl logs on pods.

What you expected to happen:
Logs print successfully, and tail successfully.

How to reproduce it (as minimally and precisely as possible):
Run kubectl logs

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.9.9
  • Size of cluster (how many worker nodes are in the cluster?): 4
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): Java app, CronJobs
  • Others: Kured is deployed and the cluster regularly drains and reboots nodes. Actual filesystem quotas were not full.

It looks like the fix here for me was to increase the fs.inotify.max_user_watches value. The AKS nodes are deployed with the default 8192. Can this value be increased to avoid this issue in the future?

@kvolkovich-sc
Copy link

kvolkovich-sc commented Jan 2, 2019

Have the same issue. (k8s 1.11.5)

@paulgmiller
Copy link
Member

We saw this too but concluded that inotify objects were being leaked by os/docker and raising the limit only delayed the period between pain. We did a periodic rolling reboot instead (which can be az aks upgrade --same version on aks).
There is some thought that moving to ubuntu 18 will be a better fix (may solve the leak) but aks teams has a decent amount of work to let you configure choosing 16 or 18.
Azure/acs-engine#4251

@mbrancato
Copy link
Author

@paulgmiller Kured effectively provides periodic reboots. The systems I noticed this on had an 18 day uptime.

@rubroboletus
Copy link

Same problem here, temporary solved by daemonset, which sets fs.inotify.max_user_watches to bigger value using sysctl and also mounts /etc/sysctl.d directory and creates file with fs.inotify.max_user_watches=OUR_VALUE as content.

@vaclav-dvorak
Copy link

@rubroboletus we concluded with same "hack" on our clusters. it would be nice not to have to resort to this

@paulgmiller
Copy link
Member

Azure/aks-engine#223

@xinyanmsft
Copy link

we are running into this limit sometimes, if more of the containers are started with nodemon (these containers are used by engineers to test / debug code)

@xinyanmsft
Copy link

more-watches.yaml.txt
I've been using this yaml file to increase the user watches on AKS node. 'kubectl -f more-watches.yaml' and restart the nodes. Hope this helps.

@andrewschmidt-a
Copy link

We are experiencing this issue on our clusters. also when we run the above daemon set we get outages in our workloads.

How exactly does it do a rolling update do you have to ensure that there are replica sets that are spreading pods across nodes?

@andrewschmidt-a
Copy link

If anyone is interested this is how we are solving it for now. We would love to see a resolution on this issue

name="name_of_aks"
rg="name_of_resource_group"
version=$(az aks show -n $name -g $rg | jq -r ".kubernetesVersion" )

az aks upgrade -k $version -n $name -g $rg -y --no-wait

@palma21
Copy link
Member

palma21 commented Dec 30, 2019

Increased to fs.inotify.max_user_watches = 1048576

@palma21 palma21 closed this as completed Dec 30, 2019
@hameno
Copy link

hameno commented Jan 6, 2020

@palma21 Could you please expand on what changed? Has this been rolled out to Azure AKS? If so, which version? How do we get the fix?

@palma21
Copy link
Member

palma21 commented Jan 6, 2020

Sure
Done in this PR: Azure/aks-engine#1801
Merged in AKS Engine Release: https://github.com/Azure/aks-engine/releases/tag/v0.42.0
Released in AKS in https://github.com/Azure/AKS/releases/tag/2019-11-18

@palma21
Copy link
Member

palma21 commented Jan 6, 2020

Any upgrade done after the AKS release date (2019-11-18) will have this change (since it's a node setting change it needs an upgrade)

@hameno
Copy link

hameno commented Jan 6, 2020

Thanks, will trigger upgrades as soon as possible

@StepanKuksenko
Copy link

more-watches.yaml.txt
I've been using this yaml file to increase the user watches on AKS node. 'kubectl -f more-watches.yaml' and restart the nodes. Hope this helps.

@xinyanmsft why do you mount /sys directory to the container ? i tried to configure without this, seems everything is ok. But maybe i don't know something...

@ghost ghost locked as resolved and limited conversation to collaborators Jul 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests