easy to run out of disk with prometheus #2523

ryandawsonuk · 2020-10-05T10:58:55Z

I've a cluster with ~20 models running and it ran out of disk after 9 days. It was allocated 32GB.

Options I can see on this:

Scrape less often. The 1s scrape interval on seldon-core-analytics produces a lot of data. Changing that to 5s should allow us to run 5x longer before hitting the limit.
Increase disk allocation.
Try the new size-based retention option for prometheus, which the helm chart seems to support.

Options 2 and 3 are up to to the user. But we dictate the scrape interval so we should change this.

It was set to 1s for an outlier example. The example is reporting outlier or not with a simple gauge so scraping less often does risk missing some outliers there. Basically an outlier comes in and then a non-outlier comes in and if you've not had a scrape then you've missed your opportunity to record the outlier. But that example has been removed anyway - f005422#diff-c70b42df21318c0cde7ea9e8f05ac093

The text was updated successfully, but these errors were encountered:

ryandawsonuk mentioned this issue Oct 5, 2020

increase scrape interval to reduce disk usage #2524

Merged

seldondev closed this as completed in #2524 Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

easy to run out of disk with prometheus #2523

easy to run out of disk with prometheus #2523

ryandawsonuk commented Oct 5, 2020 •

edited

Loading

easy to run out of disk with prometheus #2523

easy to run out of disk with prometheus #2523

Comments

ryandawsonuk commented Oct 5, 2020 • edited Loading

ryandawsonuk commented Oct 5, 2020 •

edited

Loading