Skip to content

Add an efficient, user homedirectory size prometheus reporter #2621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 7, 2023

Conversation

yuvipanda
Copy link
Member

This deploys https://github.com/yuvipanda/prometheus-dirsize-exporter, an efficient per-user homedirectory stats (size, no. of files, last modified date, etc) collector. It is capped at performing no more than 250 IO operations per second, to not overwhelm NFS servers. Metrics are refreshed every 2h after completion, although on large servers (like LEAP), they can take many many hours to complete with just 250 IO operations per second. This is perfectly fine though, as we do not need 'up to date' information. Trading off metric latency for minimal resource usage is pretty good here.

Ref #764

@yuvipanda yuvipanda requested a review from a team as a code owner June 6, 2023 17:13
@github-actions
Copy link

github-actions bot commented Jun 6, 2023

Merging this PR will trigger the following deployment actions.

Support and Staging deployments

Cloud Provider Cluster Name Upgrade Support? Reason for Support Redeploy Upgrade Staging? Reason for Staging Redeploy
gcp leap No Yes Core infrastructure has been modified
aws smithsonian No Yes Core infrastructure has been modified
aws 2i2c-aws-us No Yes Core infrastructure has been modified
aws gridsst No Yes Core infrastructure has been modified
gcp meom-ige No Yes Core infrastructure has been modified
gcp linked-earth No Yes Core infrastructure has been modified
gcp m2lines No Yes Core infrastructure has been modified
gcp pangeo-hubs No Yes Core infrastructure has been modified
aws jupyter-meets-the-earth No Yes Core infrastructure has been modified
aws victor No Yes Core infrastructure has been modified
aws carbonplan No Yes Core infrastructure has been modified
gcp callysto No Yes Core infrastructure has been modified
gcp 2i2c-uk No Yes Core infrastructure has been modified
aws nasa-veda No Yes Core infrastructure has been modified
kubeconfig utoronto No Yes Core infrastructure has been modified
aws nasa-cryo No Yes Core infrastructure has been modified
gcp cloudbank No Yes Core infrastructure has been modified
gcp 2i2c No Yes Core infrastructure has been modified
aws openscapes No Yes Core infrastructure has been modified
gcp qcl No Yes Core infrastructure has been modified
gcp awi-ciroh No Yes Core infrastructure has been modified
aws ubc-eoas No Yes Core infrastructure has been modified

Production deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
gcp leap prod Core infrastructure has been modified
aws smithsonian prod Core infrastructure has been modified
aws 2i2c-aws-us researchdelight Core infrastructure has been modified
aws 2i2c-aws-us ncar-cisl Core infrastructure has been modified
aws gridsst prod Core infrastructure has been modified
gcp meom-ige prod Core infrastructure has been modified
gcp linked-earth prod Core infrastructure has been modified
gcp m2lines prod Core infrastructure has been modified
gcp pangeo-hubs prod Core infrastructure has been modified
aws jupyter-meets-the-earth prod Core infrastructure has been modified
aws victor prod Core infrastructure has been modified
aws carbonplan prod Core infrastructure has been modified
gcp callysto prod Core infrastructure has been modified
gcp 2i2c-uk lis Core infrastructure has been modified
aws nasa-veda prod Core infrastructure has been modified
kubeconfig utoronto prod Core infrastructure has been modified
kubeconfig utoronto r-prod Core infrastructure has been modified
aws nasa-cryo prod Core infrastructure has been modified
gcp cloudbank ccsf Core infrastructure has been modified
gcp cloudbank csm Core infrastructure has been modified
gcp cloudbank elcamino Core infrastructure has been modified
gcp cloudbank glendale Core infrastructure has been modified
gcp cloudbank howard Core infrastructure has been modified
gcp cloudbank miracosta Core infrastructure has been modified
gcp cloudbank skyline Core infrastructure has been modified
gcp cloudbank demo Core infrastructure has been modified
gcp cloudbank fresno Core infrastructure has been modified
gcp cloudbank laney Core infrastructure has been modified
gcp cloudbank lassen Core infrastructure has been modified
gcp cloudbank sbcc Core infrastructure has been modified
gcp cloudbank lacc Core infrastructure has been modified
gcp cloudbank mills Core infrastructure has been modified
gcp cloudbank palomar Core infrastructure has been modified
gcp cloudbank pasadena Core infrastructure has been modified
gcp cloudbank sjcc Core infrastructure has been modified
gcp cloudbank tuskegee Core infrastructure has been modified
gcp cloudbank avc Core infrastructure has been modified
gcp cloudbank csu Core infrastructure has been modified
gcp 2i2c hackanexoplanet Core infrastructure has been modified
gcp 2i2c demo Core infrastructure has been modified
gcp 2i2c ohw Core infrastructure has been modified
gcp 2i2c pfw Core infrastructure has been modified
gcp 2i2c catalyst-cooperative Core infrastructure has been modified
gcp 2i2c aup Core infrastructure has been modified
gcp 2i2c temple Core infrastructure has been modified
gcp 2i2c ucmerced Core infrastructure has been modified
gcp 2i2c cosmicds Core infrastructure has been modified
gcp 2i2c climatematch Core infrastructure has been modified
aws openscapes prod Core infrastructure has been modified
gcp qcl prod Core infrastructure has been modified
gcp awi-ciroh prod Core infrastructure has been modified
aws ubc-eoas prod Core infrastructure has been modified

@yuvipanda
Copy link
Member Author

Experimental grafana dashboard on LEAP cluster:

image

Lookin good.

As it churns through this, it is being CPU limited at the 0.05 we have given it, which seems alright with me.

Copy link
Contributor

@pnasrat pnasrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an entry say in the support section of the infra docs on where to look for these dashboards

Maybe expanding the below

https://infrastructure.2i2c.org/sre-guide/support/home-dir/

yuvipanda added a commit to yuvipanda/grafana-dashboards that referenced this pull request Jun 6, 2023
Monitoring each user's home directory size is an often requested
feature for z2jh, and kinda important to prevent disk space related
outages when using shared clusters. A single user can easily
accidentally take up the space of an entire NFS share, and without
an easy way to monitor this, can be hard to let them know before
the disk is full.

Admins could manually run `du` on an NFS server now and then, but
this is time consuming and error prone. This also can hammer the
server hard - millions of IO operations are performed as pretty
much each file must be touched to calculate directory sizes.

https://github.com/yuvipanda/prometheus-dirsize-exporter does
this efficiently, primarily by allowing you to throttle how much
IO operations per second it can make. This prevents the NFS
server from being unsuable due to IO saturation, instead trading
off latency of how up to date these stats are. As we don't
need upto the minute stats here, this is acceptable.

Test deployment at 2i2c-org/infrastructure#2621
@yuvipanda
Copy link
Member Author

@pnasrat the graph panel doesn't exist yet, I just made a PR to add it: jupyterhub/grafana-dashboards#80.

I'll add some docs too.

@pnasrat
Copy link
Contributor

pnasrat commented Jun 6, 2023

ah great. basically lgtm but will wait for your updates

@yuvipanda
Copy link
Member Author

@pnasrat given the situation with utoronto and q planning, I won't be able to write docs this week. Are you ok with me merging this now so it can start collecting data, and I'll work on docs before end of the month?

@pnasrat
Copy link
Contributor

pnasrat commented Jun 12, 2023

@yuvipanda As an async team documentation for new services including operational procedures (eg how to turn off) is necessary to ensure the team can both use and troubleshoot new services

If you want to merge without docs, please add a specific feature flag enabled and test in a single staging hub cluster to prove it out rather than turn on everywhere

@yuvipanda
Copy link
Member Author

Alright, I'll just mark this as draft and let this be until I have time to come back to it.

@yuvipanda yuvipanda marked this pull request as draft June 12, 2023 11:29
This deploys https://github.com/yuvipanda/prometheus-dirsize-exporter,
an *efficient* per-user homedirectory stats (size, no. of files,
last modified date, etc) collector. It is capped at performing no
more than 250 IO operations per second, to not overwhelm NFS
servers. Metrics are refreshed every 2h after completion, although
on large servers (like LEAP), they can take many many hours to
complete with just 250 IO operations per second. This is perfectly
fine though, as we do not need 'up to date' information. Trading
off metric latency for minimal resource usage is pretty good
here.

Ref 2i2c-org#764
We don't have anyone using that!
@yuvipanda
Copy link
Member Author

I think the right place for documenting this is upstream, so I've finally opened jupyterhub/grafana-dashboards#84 to setup an actual docs site for the grafana dashboard. Once that's merged, I'll add a page documenting this panel.

@yuvipanda yuvipanda marked this pull request as ready for review June 20, 2023 22:38
@yuvipanda
Copy link
Member Author

I've added documentation about this graph to jupyterhub/grafana-dashboards#84, so I think this is ready for review again.

Copy link
Contributor

@pnasrat pnasrat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Can you make sure you announce this on the engineering channel so folks know there is a new component and graph they might find useful

@yuvipanda
Copy link
Member Author

Thanks @pnasrat, and will do once I confirm data is coming in!

@yuvipanda yuvipanda merged commit 1aac5c8 into 2i2c-org:master Jul 7, 2023
@github-actions
Copy link

github-actions bot commented Jul 7, 2023

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/5488684979

yuvipanda added a commit to yuvipanda/grafana-dashboards that referenced this pull request Jul 25, 2024
Monitoring each user's home directory size is an often requested
feature for z2jh, and kinda important to prevent disk space related
outages when using shared clusters. A single user can easily
accidentally take up the space of an entire NFS share, and without
an easy way to monitor this, can be hard to let them know before
the disk is full.

Admins could manually run `du` on an NFS server now and then, but
this is time consuming and error prone. This also can hammer the
server hard - millions of IO operations are performed as pretty
much each file must be touched to calculate directory sizes.

https://github.com/yuvipanda/prometheus-dirsize-exporter does
this efficiently, primarily by allowing you to throttle how much
IO operations per second it can make. This prevents the NFS
server from being unsuable due to IO saturation, instead trading
off latency of how up to date these stats are. As we don't
need upto the minute stats here, this is acceptable.

Test deployment at 2i2c-org/infrastructure#2621
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants