[ceph] Improve health metrics #2852

olivielpeau · 2016-09-16T13:53:38Z

What does this PR do?

Catch KeyError on osd_pool_stats
Improve health metrics: the ceph.num_full_osds and ceph.num_near_full_osds metrics report values that make more sense and add ceph.osd.pct_used metric (see explanation below).

Motivation

Check resilience
Improved usability of the health metrics

Testing

Updated the existing mock tests, and made them more detailed

Additional Notes

health metrics

Previous behavior: the ceph.num_near_full_osds and
ceph.num_full_osds would either:

take the value 0 and be tagged with no osd tag if no osd is
reporting health issues
take a value representing the usage percentage, tagged by osd

This doesn't really make sense with respect to the name of the metrics
(num_*). To solve this, replace these metrics with the following:

ceph.num_near_full_ods and ceph.num_full_osds report the total
number of osds that are respectively near full and full. Not tagged
by osd.
when some osds report health issues, the check sends a
ceph.osd.pct_used metric which reports the usage
percentage, tagged by osd. Unfortunately we can't send 0 values
on this metric when no osd reports health issues since we can't tag by
osd in that case.

This should make these metrics more usable. Also, use gauge since
there's no reason to use count.

We do it for all the other keys of the raw dict, so let's do it for this one too (even though I'm not a huge fan of this "try/except everything" approach).

Behavior before this commit: the `ceph.num_near_full_osds` and `ceph.num_full_osds` would either: * take the value `0` and be tagged with no `osd` tag if no osd is reporting health issues * take a value representing the usage percentage, tagged by `osd` This doesn't really make sense with respect to the name of the metrics (`num_*`). To solve this, replace these metrics with the following: * `ceph.num_near_full_ods` and `ceph.num_full_osds` report the total number of osds that are respectively near full and full. Not tagged by osd. * when some osds report health issues, the check sends a `ceph.osd.pct_used` metric which reports the usage percentage, tagged by osd. Unfortunately we can't send `0` values on this metric when no osd reports health issues since we can't tag by osd in that case. This should make these metrics more usable. Also, use `gauge` since there's no reason to use `count`.

olivielpeau · 2016-09-16T14:03:00Z

cc @vagelim: Let me know what you think of these changes

vagelim · 2016-09-16T14:15:10Z

This is a much better solution! I was unsure how to send this data to DD in a useful way. Please ping me when these changes have made it to prod, I will need to update the default screenboard accordingly.

olivielpeau · 2016-09-16T16:04:02Z

Merging, thanks for the review!

@vagelim: this will be released with 5.9.0, we'll ping you when it's out. Could you also update the metrics metadata with these new metrics if you get a chance? :)

vagelim · 2016-09-23T18:03:23Z

The metadata: https://github.com/DataDog/dogweb/pull/15018

olivielpeau added 2 commits September 16, 2016 14:34

[ceph] Catch KeyError on osd_pool_stats

e6cf383

We do it for all the other keys of the raw dict, so let's do it for this one too (even though I'm not a huge fan of this "try/except everything" approach).

olivielpeau added the improvement label Sep 16, 2016

olivielpeau added this to the 5.9.0 milestone Sep 16, 2016

masci approved these changes Sep 16, 2016

View reviewed changes

olivielpeau merged commit c9a01de into master Sep 16, 2016

olivielpeau deleted the olivielpeau/ceph-improve-health-metrics branch September 16, 2016 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ceph] Improve health metrics #2852

[ceph] Improve health metrics #2852

olivielpeau commented Sep 16, 2016

olivielpeau commented Sep 16, 2016

vagelim commented Sep 16, 2016

olivielpeau commented Sep 16, 2016

vagelim commented Sep 23, 2016

[ceph] Improve health metrics #2852

[ceph] Improve health metrics #2852

Conversation

olivielpeau commented Sep 16, 2016

What does this PR do?

Motivation

Testing

Additional Notes

health metrics

olivielpeau commented Sep 16, 2016

vagelim commented Sep 16, 2016

olivielpeau commented Sep 16, 2016

vagelim commented Sep 23, 2016