Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

easy to run out of disk with prometheus #2523

Closed
ryandawsonuk opened this issue Oct 5, 2020 · 0 comments · Fixed by #2524
Closed

easy to run out of disk with prometheus #2523

ryandawsonuk opened this issue Oct 5, 2020 · 0 comments · Fixed by #2524

Comments

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented Oct 5, 2020

I've a cluster with ~20 models running and it ran out of disk after 9 days. It was allocated 32GB.

Options I can see on this:

  1. Scrape less often. The 1s scrape interval on seldon-core-analytics produces a lot of data. Changing that to 5s should allow us to run 5x longer before hitting the limit.
  2. Increase disk allocation.
  3. Try the new size-based retention option for prometheus, which the helm chart seems to support.

Options 2 and 3 are up to to the user. But we dictate the scrape interval so we should change this.

It was set to 1s for an outlier example. The example is reporting outlier or not with a simple gauge so scraping less often does risk missing some outliers there. Basically an outlier comes in and then a non-outlier comes in and if you've not had a scrape then you've missed your opportunity to record the outlier. But that example has been removed anyway - f005422#diff-c70b42df21318c0cde7ea9e8f05ac093

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant