New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
elastic - add missing metrics #2758
elastic - add missing metrics #2758
Conversation
@@ -227,14 +244,75 @@ class ESCheck(AgentCheck): | |||
} | |||
|
|||
ADDITIONAL_METRICS_POST_1_4_0 = { | |||
"elasticsearch.indices.indexing.throttle_time_in_millis": ("guage", "indices.indexing.throttle_time_in_millis", lambda v: float(v)/1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos everywhere here :) guage -> gauge
Thanks! Looks like tests are passing on travis. Could you fix the tipos btw ? |
a30d3e0
to
bf9ef19
Compare
gah, sorry about the typos! Fixed :) |
@mdelaney there's a couple |
"elasticsearch.indices.segments.doc_values_memory_in_bytes": ("gauge", "indices.segments.doc_values_memory_in_bytes"), | ||
"elasticsearch.indices.segments.norms_memory_in_bytes": ("gauge", "indices.segments.norms_memory_in_bytes"), | ||
"elasticsearch.indices.segments.stored_fields_memory_in_bytes": ("gauge", "indices.segments.stored_fields_memory_in_bytes"), | ||
"elasticsearch.indices.segments.term_memory_in_bytes": ("gauge", "indices.segments.term_memory_in_bytes"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stat term_memory_in_bytes
is unavailable - might be a typo that was forgotten here. This is what's causing the two 2.0+ tests to fail. Removal of the metric works fine for me locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gah, should have been term_vectors_memory_in_bytes
@mdelaney the issue is with the |
"elasticsearch.breakers.fielddata.estimated_size_in_bytes": ("gauge", "breakers.fielddata.estimated_size_in_bytes"), | ||
"elasticsearch.breakers.fielddata.overhead": ("gauge", "breakers.fielddata.overhead"), | ||
"elasticsearch.breakers.fielddata.tripped": ("gauge", "breakers.fielddata.tripped"), | ||
"elasticsearch.breakers.parent.estimated_size_in_bytes": ("gauge", "breakers.fielddata.estimated_size_in_bytes"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these should be:
"elasticsearch.breakers.parent.estimated_size_in_bytes": ("gauge", "breakers.parent.estimated_size_in_bytes"),
"elasticsearch.breakers.parent.overhead": ("gauge", "breakers.parent.overhead"),
"elasticsearch.breakers.parent.tripped": ("gauge", "breakers.parent.tripped"),
"elasticsearch.breakers.request.estimated_size_in_bytes": ("gauge", "breakers.request.estimated_size_in_bytes"),
"elasticsearch.breakers.request.overhead": ("gauge", "breakers.request.overhead"),
"elasticsearch.breakers.request.tripped": ("gauge", "breakers.request.tripped"),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
@mdelaney added some more feedback as a product of some copy/paste typos. |
Thanks for the feedback. I'll address these today. |
bf9ef19
to
502f7d4
Compare
The tests related to the metrics added are passing locally now, so assuming they should pass ci tests here now. |
@mdelaney thanks a lot for addressing those. We're almost ready to go.... But I do have to pester you with one more thing. I'm so sorry 😅 So the deal is, we have gauges for all new metrics, and I'm not sure if that should always be the case. For example: On the other hand, there are metrics, such as Let me know if you need any help with that, or you have any doubts with specific metrics - I know I often do. Thanks! |
@mdelaney appreciate your work on this! |
Just in case there are doubts, It doesn't really matter if a counter is reset on a node restart, as the rate will reset with it. The main thing here is being able to extract meaning from an otherwise less insightful ever-growing (resets aside) metric. |
Sorry for the delay, I had gotten a bit busy. I should have time to finish this up today. |
metrics are added for various versions of elasticsearch
502f7d4
to
3c74140
Compare
I think I've caught all the stats I added that should not be gauges and caught a small naming mistake for throttle times. It's worth mentioning that there are almost certainly existing metrics that should not be gauges as well. Likely I'll submit another PR in the future to address that if no one else has. Also, there are definitely other metrics that are still missing but that will be left to a future PR as well :-) |
Thanks so much @mdelaney! I agree with the choice of metrics types. And yeah, I wouldn't be surprised if we have some metrics in there that are submitted as gauges when they shouldn't really. Test are green, and all comments have been addressed, so lets 🎉 |
What does this PR do?
This adds a bunch of metrics that are missing. I've tracked down the versions of Elasticsearch that each new metric was added, so this should be useful regardless of what version of ES you run.
It's worth noting that there are definitely more metrics missing (primary shard level metrics in particular). If we end up needing these metrics I'll create another pull request.
Motivation
We need these metrics to accurately see what's happening inside our production clusters. DataDog support advised me that there was no specific timeline to getting these metrics added so I decided to fix it myself and contribute so others can benefit.
Additional Notes
I wasn't able to get the tests to pass locally. I've set up a single Elasticsearch node locally running on the correct port and tests fail (with or without my changes). There seem to be some issues in the tests for the elastic check that should probably be addressed at some point (among which is expecting a cluster to be yellow, this seems wrong). At any rate, I've confirmed these to be working in a live environment (we're actively using this in our production systems now). I don't really have time to debug the full test script but if there is any advise on how to get these tests passing I'd appreciate it. :-)