Add additional node metrics to monitor cpu throttling #8290

onurdialpad · 2021-01-05T17:55:45Z

What does this PR do?

These metrics help people to monitor cpu throttling in the cluster

Motivation

We use Datadog for monitoring/alerting our Elasticsearch cluster which is managed by Elastic. It runs on GCP and we need to see how much the cgroup which Elasticsearch exists in uses cpu and whether cpu throttling has been increasing or not. These metrics provide that capability to everyone who needs similar stuff.

Additional Notes

No

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
PR title must be written as a CHANGELOG entry (see why)
Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
PR must have changelog/ and integration/ labels attached

yzhan289

Thanks for the PR! Small nits.

elastic/datadog_checks/elastic/__about__.py

requirements-agent-release.txt

yzhan289 · 2021-01-05T22:55:28Z

elastic/datadog_checks/elastic/metrics.py

+    'elasticsearch.cgroup.cpu.stat.number_of_elapsed_periods': (
+        'gauge',
+        'os.cgroup.cpu.stat.number_of_elapsed_periods',
+    ),
+    'elasticsearch.cgroup.cpu.stat.number_of_times_throttled': (
+        'gauge',
+        'os.cgroup.cpu.stat.number_of_times_throttled',
+    ),
+    'elasticsearch.process.cpu.percent': ('gauge', 'process.cpu.percent'),


With these three new metrics, please also add them to https://github.com/DataDog/integrations-core/blob/master/elastic/metadata.csv

mgarabed

Thanks for the submission, one more nit!

mgarabed · 2021-01-06T15:22:49Z

elastic/metadata.csv

+elasticsearch.cgroup.cpu.stat.number_of_elapsed_periods,gauge,integer, The number of reporting periods that have elapsed
+elasticsearch.cgroup.cpu.stat.number_of_times_throttled,gauge,integer, The number of times all tasks in the same cgroup as the Elasticsearch process have been throttled
+elasticsearch.process.cpu.percent,gauge,integer, CPU usage in percent, or -1 if not known at the time the stats are computed


The CI reported an error with these lines, this should fix it:

Suggested change

elasticsearch.cgroup.cpu.stat.number_of_elapsed_periods,gauge,integer, The number of reporting periods that have elapsed

elasticsearch.cgroup.cpu.stat.number_of_times_throttled,gauge,integer, The number of times all tasks in the same cgroup as the Elasticsearch process have been throttled

elasticsearch.process.cpu.percent,gauge,integer, CPU usage in percent, or -1 if not known at the time the stats are computed

elasticsearch.cgroup.cpu.stat.number_of_elapsed_periods,gauge,,integer,,The number of reporting periods that have elapsed,0,elasticsearch,cgroup cpu stat

elasticsearch.cgroup.cpu.stat.number_of_times_throttled,gauge,,integer,,The number of times all tasks in the same cgroup as the Elasticsearch process have been throttled,0,elasticsearch,cpu stat throttled

elasticsearch.process.cpu.percent,gauge,,integer,,CPU usage in percent, or -1 if not known at the time the stats are computed,0,elasticsearch,process cpu percent

CI reported a similar error by saying "integer is an invalid unit_name." Is it ok to add "integer" into VALID_UNIT_NAMES in metadata.py ?

Co-authored-by: Mike Garabedian <mike@mercuryrising.net>

mgarabed

Correct, integer is an invalid unit type, updated my suggestion and also corrected an extra comma.

elastic/metadata.csv

Co-authored-by: Mike Garabedian <mike@mercuryrising.net>

- add additional node metrics

41ce9a3

onurdialpad requested a review from a team as a code owner January 5, 2021 17:55

onurdialpad added 4 commits January 5, 2021 11:15

- update version for agent

beec8ce

- fix linter

bbdb9a4

- fix tests

27b53b7

- use formatter to format style

80a1009

yzhan289 reviewed Jan 5, 2021

View reviewed changes

- add metrics into metadata, undo versioning

2b2e7e3

mgarabed requested changes Jan 6, 2021

View reviewed changes

Update elastic/metadata.csv

410f403

Co-authored-by: Mike Garabedian <mike@mercuryrising.net>

mgarabed requested changes Jan 7, 2021

View reviewed changes

elastic/metadata.csv Outdated Show resolved Hide resolved

Update elastic/metadata.csv

e718b52

Co-authored-by: Mike Garabedian <mike@mercuryrising.net>

mgarabed added changelog/Added integration/elastic labels Jan 14, 2021

mgarabed approved these changes Jan 14, 2021

View reviewed changes

mgarabed merged commit 4157704 into DataDog:master Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional node metrics to monitor cpu throttling #8290

Add additional node metrics to monitor cpu throttling #8290

onurdialpad commented Jan 5, 2021

yzhan289 left a comment

yzhan289 Jan 5, 2021

mgarabed left a comment

mgarabed Jan 6, 2021

onurdialpad Jan 6, 2021 •

edited

Loading

mgarabed left a comment

Add additional node metrics to monitor cpu throttling #8290

Add additional node metrics to monitor cpu throttling #8290

Conversation

onurdialpad commented Jan 5, 2021

What does this PR do?

Motivation

Additional Notes

Review checklist (to be filled by reviewers)

yzhan289 left a comment

Choose a reason for hiding this comment

yzhan289 Jan 5, 2021

Choose a reason for hiding this comment

mgarabed left a comment

Choose a reason for hiding this comment

mgarabed Jan 6, 2021

Choose a reason for hiding this comment

onurdialpad Jan 6, 2021 • edited Loading

Choose a reason for hiding this comment

mgarabed left a comment

Choose a reason for hiding this comment

onurdialpad Jan 6, 2021 •

edited

Loading