-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional node metrics to monitor cpu throttling #8290
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Small nits.
'elasticsearch.cgroup.cpu.stat.number_of_elapsed_periods': ( | ||
'gauge', | ||
'os.cgroup.cpu.stat.number_of_elapsed_periods', | ||
), | ||
'elasticsearch.cgroup.cpu.stat.number_of_times_throttled': ( | ||
'gauge', | ||
'os.cgroup.cpu.stat.number_of_times_throttled', | ||
), | ||
'elasticsearch.process.cpu.percent': ('gauge', 'process.cpu.percent'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these three new metrics, please also add them to https://github.com/DataDog/integrations-core/blob/master/elastic/metadata.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the submission, one more nit!
elastic/metadata.csv
Outdated
elasticsearch.cgroup.cpu.stat.number_of_elapsed_periods,gauge,integer, The number of reporting periods that have elapsed | ||
elasticsearch.cgroup.cpu.stat.number_of_times_throttled,gauge,integer, The number of times all tasks in the same cgroup as the Elasticsearch process have been throttled | ||
elasticsearch.process.cpu.percent,gauge,integer, CPU usage in percent, or -1 if not known at the time the stats are computed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI reported an error with these lines, this should fix it:
elasticsearch.cgroup.cpu.stat.number_of_elapsed_periods,gauge,integer, The number of reporting periods that have elapsed | |
elasticsearch.cgroup.cpu.stat.number_of_times_throttled,gauge,integer, The number of times all tasks in the same cgroup as the Elasticsearch process have been throttled | |
elasticsearch.process.cpu.percent,gauge,integer, CPU usage in percent, or -1 if not known at the time the stats are computed | |
elasticsearch.cgroup.cpu.stat.number_of_elapsed_periods,gauge,,integer,,The number of reporting periods that have elapsed,0,elasticsearch,cgroup cpu stat | |
elasticsearch.cgroup.cpu.stat.number_of_times_throttled,gauge,,integer,,The number of times all tasks in the same cgroup as the Elasticsearch process have been throttled,0,elasticsearch,cpu stat throttled | |
elasticsearch.process.cpu.percent,gauge,,integer,,CPU usage in percent, or -1 if not known at the time the stats are computed,0,elasticsearch,process cpu percent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI reported a similar error by saying "integer is an invalid unit_name." Is it ok to add "integer" into VALID_UNIT_NAMES in metadata.py ?
Co-authored-by: Mike Garabedian <mike@mercuryrising.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, integer
is an invalid unit type, updated my suggestion and also corrected an extra comma.
Co-authored-by: Mike Garabedian <mike@mercuryrising.net>
What does this PR do?
These metrics help people to monitor cpu throttling in the cluster
Motivation
We use Datadog for monitoring/alerting our Elasticsearch cluster which is managed by Elastic. It runs on GCP and we need to see how much the cgroup which Elasticsearch exists in uses cpu and whether cpu throttling has been increasing or not. These metrics provide that capability to everyone who needs similar stuff.
Additional Notes
No
Review checklist (to be filled by reviewers)
changelog/
andintegration/
labels attached