-
Notifications
You must be signed in to change notification settings - Fork 71
Add an efficient, user homedirectory size prometheus reporter #2621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merging this PR will trigger the following deployment actions. Support and Staging deployments
Production deployments
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an entry say in the support section of the infra docs on where to look for these dashboards
Maybe expanding the below
Monitoring each user's home directory size is an often requested feature for z2jh, and kinda important to prevent disk space related outages when using shared clusters. A single user can easily accidentally take up the space of an entire NFS share, and without an easy way to monitor this, can be hard to let them know before the disk is full. Admins could manually run `du` on an NFS server now and then, but this is time consuming and error prone. This also can hammer the server hard - millions of IO operations are performed as pretty much each file must be touched to calculate directory sizes. https://github.com/yuvipanda/prometheus-dirsize-exporter does this efficiently, primarily by allowing you to throttle how much IO operations per second it can make. This prevents the NFS server from being unsuable due to IO saturation, instead trading off latency of how up to date these stats are. As we don't need upto the minute stats here, this is acceptable. Test deployment at 2i2c-org/infrastructure#2621
@pnasrat the graph panel doesn't exist yet, I just made a PR to add it: jupyterhub/grafana-dashboards#80. I'll add some docs too. |
ah great. basically lgtm but will wait for your updates |
@pnasrat given the situation with utoronto and q planning, I won't be able to write docs this week. Are you ok with me merging this now so it can start collecting data, and I'll work on docs before end of the month? |
@yuvipanda As an async team documentation for new services including operational procedures (eg how to turn off) is necessary to ensure the team can both use and troubleshoot new services If you want to merge without docs, please add a specific feature flag enabled and test in a single staging hub cluster to prove it out rather than turn on everywhere |
Alright, I'll just mark this as draft and let this be until I have time to come back to it. |
This deploys https://github.com/yuvipanda/prometheus-dirsize-exporter, an *efficient* per-user homedirectory stats (size, no. of files, last modified date, etc) collector. It is capped at performing no more than 250 IO operations per second, to not overwhelm NFS servers. Metrics are refreshed every 2h after completion, although on large servers (like LEAP), they can take many many hours to complete with just 250 IO operations per second. This is perfectly fine though, as we do not need 'up to date' information. Trading off metric latency for minimal resource usage is pretty good here. Ref 2i2c-org#764
We don't have anyone using that!
I think the right place for documenting this is upstream, so I've finally opened jupyterhub/grafana-dashboards#84 to setup an actual docs site for the grafana dashboard. Once that's merged, I'll add a page documenting this panel. |
I've added documentation about this graph to jupyterhub/grafana-dashboards#84, so I think this is ready for review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can you make sure you announce this on the engineering channel so folks know there is a new component and graph they might find useful
Thanks @pnasrat, and will do once I confirm data is coming in! |
🎉🎉🎉🎉 Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/5488684979 |
Monitoring each user's home directory size is an often requested feature for z2jh, and kinda important to prevent disk space related outages when using shared clusters. A single user can easily accidentally take up the space of an entire NFS share, and without an easy way to monitor this, can be hard to let them know before the disk is full. Admins could manually run `du` on an NFS server now and then, but this is time consuming and error prone. This also can hammer the server hard - millions of IO operations are performed as pretty much each file must be touched to calculate directory sizes. https://github.com/yuvipanda/prometheus-dirsize-exporter does this efficiently, primarily by allowing you to throttle how much IO operations per second it can make. This prevents the NFS server from being unsuable due to IO saturation, instead trading off latency of how up to date these stats are. As we don't need upto the minute stats here, this is acceptable. Test deployment at 2i2c-org/infrastructure#2621
This deploys https://github.com/yuvipanda/prometheus-dirsize-exporter, an efficient per-user homedirectory stats (size, no. of files, last modified date, etc) collector. It is capped at performing no more than 250 IO operations per second, to not overwhelm NFS servers. Metrics are refreshed every 2h after completion, although on large servers (like LEAP), they can take many many hours to complete with just 250 IO operations per second. This is perfectly fine though, as we do not need 'up to date' information. Trading off metric latency for minimal resource usage is pretty good here.
Ref #764