Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ceph] Improve health metrics #2852

Merged
merged 2 commits into from Sep 16, 2016

Conversation

olivielpeau
Copy link
Member

What does this PR do?

  • Catch KeyError on osd_pool_stats
  • Improve health metrics: the ceph.num_full_osds and ceph.num_near_full_osds metrics report values that make more sense and add ceph.osd.pct_used metric (see explanation below).

Motivation

  • Check resilience
  • Improved usability of the health metrics

Testing

Updated the existing mock tests, and made them more detailed

Additional Notes

health metrics

Previous behavior: the ceph.num_near_full_osds and
ceph.num_full_osds would either:

  • take the value 0 and be tagged with no osd tag if no osd is
    reporting health issues
  • take a value representing the usage percentage, tagged by osd

This doesn't really make sense with respect to the name of the metrics
(num_*). To solve this, replace these metrics with the following:

  • ceph.num_near_full_ods and ceph.num_full_osds report the total
    number of osds that are respectively near full and full. Not tagged
    by osd.
  • when some osds report health issues, the check sends a
    ceph.osd.pct_used metric which reports the usage
    percentage, tagged by osd. Unfortunately we can't send 0 values
    on this metric when no osd reports health issues since we can't tag by
    osd in that case.

This should make these metrics more usable. Also, use gauge since
there's no reason to use count.

We do it for all the other keys of the raw dict, so let's do it for
this one too (even though I'm not a huge fan of this "try/except
everything" approach).
Behavior before this commit: the `ceph.num_near_full_osds` and
`ceph.num_full_osds` would either:
* take the value `0` and be tagged with no `osd` tag if no osd is
reporting health issues
* take a value representing the usage percentage, tagged by `osd`

This doesn't really make sense with respect to the name of the metrics
(`num_*`). To solve this, replace these metrics with the following:
* `ceph.num_near_full_ods` and `ceph.num_full_osds` report the total
number of osds that are respectively near full and full. Not tagged
by osd.
* when some osds report health issues, the check sends a
`ceph.osd.pct_used` metric which reports the usage
percentage, tagged by osd. Unfortunately we can't send `0` values
on this metric when no osd reports health issues since we can't tag by
osd in that case.

This should make these metrics more usable. Also, use `gauge` since
there's no reason to use `count`.
@olivielpeau
Copy link
Member Author

cc @vagelim: Let me know what you think of these changes

@vagelim
Copy link
Contributor

vagelim commented Sep 16, 2016

This is a much better solution! I was unsure how to send this data to DD in a useful way. Please ping me when these changes have made it to prod, I will need to update the default screenboard accordingly.

@olivielpeau olivielpeau added this to the 5.9.0 milestone Sep 16, 2016
@olivielpeau
Copy link
Member Author

Merging, thanks for the review!

@vagelim: this will be released with 5.9.0, we'll ping you when it's out. Could you also update the metrics metadata with these new metrics if you get a chance? :)

@olivielpeau olivielpeau merged commit c9a01de into master Sep 16, 2016
@olivielpeau olivielpeau deleted the olivielpeau/ceph-improve-health-metrics branch September 16, 2016 16:04
@vagelim
Copy link
Contributor

vagelim commented Sep 23, 2016

The metadata: https://github.com/DataDog/dogweb/pull/15018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants