New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cassandra dashboards fixes and improvements #726
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything else looks good.
Really liked not having to change the file for Cassandra 2.0 support. 👍
cassandra/conf.yaml.example
Outdated
metric_type: counter | ||
alias: jmx.gc.major_collection_time | ||
# Deprecated metrics for pre Cassandra 3.0 versions compatibility. | ||
# If you are using cassandra 2, the metrics below will be user, otherwise ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*will be used
@joaquincasares @arodrime thanks for digging into this. Our team is QA'ing this in our C* environments, but hope to have more updates for you soon. |
Hi @arodrime, thanks for your PR. regarding the orientation field, you should have a look at this: https://docs.datadoghq.com/guides/integration_sdk/#metadata-csv |
@zippolyte thanks for the pointer! I'll add the orientation as soon as possible. Could you please let me know if the other fields are correctly filled up? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple things in the csv, other than that and the orientation, the csv looks good.
cassandra/metadata.csv
Outdated
cassandra.waiting_on_free_memtable_space.95th_percentile,gauge,10,microsecond,,The time spent waiting for free memtable space either on- or off-heap - p95.,,cassandra,waiting memtable p95 | ||
cassandra.write_latency.75th_percentile,gauge,10,microsecond,,The local read latency - p75.,,cassandra,local write latency p75 | ||
cassandra.write_latency.95th_percentile,gauge,10,microsecond,,The local read latency - p95.,,cassandra,local write latency p95 | ||
cassandra.write_latency.99th_percentile,gauge,10,microsecond,,The local read latency - p99.,,cassandra,local write latency p99 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and above should be write latency
in the description field
cassandra/metadata.csv
Outdated
cassandra.write_latency.75th_percentile,gauge,10,microsecond,,The local read latency - p75.,,cassandra,local write latency p75 | ||
cassandra.write_latency.95th_percentile,gauge,10,microsecond,,The local read latency - p95.,,cassandra,local write latency p95 | ||
cassandra.write_latency.99th_percentile,gauge,10,microsecond,,The local read latency - p99.,,cassandra,local write latency p99 | ||
cassandra.write_latency.one_minute_rate,gauge,10,write,second,The number of local write requets.,,cassandra,local write count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small typo: requests
in the description
Thanks for the feedback @zippolyte, commit added to this PR. |
Thanks for the update, but you've also replaced all the commas with semicolons, could you change that back please ? |
28dcfa6
to
4277dc7
Compare
Oops, my bad. Working from spreadsheet and csv export was with semi column... I fixed it. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @arodrime,
I've finished reviewing the config while doing the changes to the dashboards on our side. I have just a few comments, nothing major, the rest looks good to me
cassandra/conf.yaml.example
Outdated
@@ -2,6 +2,9 @@ instances: | |||
- host: localhost | |||
port: 7199 | |||
cassandra_aliasing: true | |||
tags: | |||
environment: default_environment | |||
datacenter: default_datacenter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment to explain the use of these tags and how they are useful in the dashboards ? And since we cannot force the displaying of the timeboards in 3 columns (It is by default that way though), maybe you could also add a note here about that as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently writing an article for Datadog blog, this will be in there most definitely. Adding a short comment though, this indeed matters a lot, good catch.
cassandra/conf.yaml.example
Outdated
- SSTablesPerReadHistogram | ||
- TombstoneScannedHistogram | ||
- WaitingOnFreeMemtableSpace | ||
attribute: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation here is wrong, which is why you get more histogram parts than specified for these mbeans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I really messed up with indentation. I fixed it for both the 'table' and 'columnfamily' metrics. Thanks!
cassandra/conf.yaml.example
Outdated
metric_type: counter | ||
alias: jmx.gc.major_collection_time | ||
metric_type: counter | ||
alias: jmx.gc.major_collection_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and for all the GC metric_type
and alias
above, there are two spaces too many for the indentation.
It seems to work anyway in this case but the indentation everywhere is two spaces so let's have that here also :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, not sure what happened there 🙃. I am fixing it.
@zippolyte I pushed the latest changes let me know if that works for you. Also, have you managed to tackle these 2 points?
I would say these 2 fixes ( node status and disk space top list) are really the most important changes needed. If you need more details I am happy to discuss it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @arodrime thanks for the changes. Just a couple of issues with the indentation again.
Regarding the node status, I'm not sure I understand what's the problem, I'll investigate.
For the other point, since the JSON allows, I was able to break down by two variables. You can do it by saving the graph from the JSON tab if you want to see how it looks.
cassandra/conf.yaml.example
Outdated
environment: default_environment | ||
datacenter: default_datacenter | ||
environment: default_environment | ||
datacenter: default_datacenter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation was good before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok fixing this.
cassandra/conf.yaml.example
Outdated
- system_auth | ||
- system_distributed | ||
- system_schema | ||
- system_traces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exclude
section was indented correctly: it should be at the same level than the include
section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, of course. ok, fixing this as well.
cassandra/conf.yaml.example
Outdated
- system_auth | ||
- system_distributed | ||
- system_schema | ||
- system_traces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this exclude
section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at the all file carefully... I am terrible at managing indents obviously 🙃 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @arodrime, sorry for the delay.
The changes looks good to me now so i'm approving this PR.
Regarding the node status, what's happening is that nodetool
gives information regarding the whole cluster. So, each agent where the cassandra_nodetool
integration is enabled will pick up a nodetool.status.status
metric for each node of the cluster, tagged with node_id
, node_address
, datacenter
and rack
. So you can have the 0 or 1 value if you select another visualization and break down by node_id
or node_address
. Unfortunately it's not possible in the host map visualization because it breaks down the metrics by host
automatically, which is the host on which the agent sending the metric is installed.
That's good news 👍 . On the dashboard (UI) side of things, while writing about dashboards I made some fixes and gave the changelog to @irabinovitch. @zippolyte Here is a solution as you suggested to have the status chart working. Let me know your thoughts. I am not sure to understand why the node down is not showing in red though. Let me know if you are still missing something or if we are good to go like this for this version :-). Thanks for all the work and nice communication on this in the last weeks. |
I think using the |
What does this PR do?
Motivation
This request is part of the work TLP is doing to have a nice set of dashboards out of the box for Cassandra / Datadog users.
Testing
This code needs to be tested with Cassandra (2+) and TLP - * Dashboards from TLP datadog account. On our end, filtering was working, we found no way to test metadata.csv file though.
Additional Notes
conf.yaml.example
For some reason, this filtering is not working correctly and we are accepting more measurement (histogram parts) than we actually need for these 3 attributes. I am not sure why. It also happen with "column family" in older versions of Cassandra.
metadata.csv: I am not sure about the column "orientation" and I let it empty for all the metrics that were added. Also could you make sure the format / syntax is valid there. I am not sure if percentiles need to be detailed etc.
About dashboards
The changes mentioned above are made to support a specific version of the Cassandra dashboards, for completeness' sake, please find dashboards "change log" below.
@irabinovitch here are all the comments of most what was done lately, it's been hard for me to keep tracks of everything, but I was willing to at least explain why I changed some stuff added by Datadog teams. happy to discuss the design and adjust files in this PR accordingly. I recommend QA teams to take as it is now (I merged their past work in there) and see if that suits the needs.
Important / Design:
Possible improvements:
Hand over:
List of dashboard changes:
Overview Dashboard:
Read Path / Write Path Dashboards
We reworked the 2 dashboards to add a few charts and improving some exiting charts and the layout as well.
We saw no changes from what we offered in the first version of the dashboards so no merge from Datadog default chart is needed there. Taking the new dashboard from our TLP account is probably the easiest way to go. Then check it to see if it still suits you :-).
SSTable management dashboard
This dashboard being new, we should have no merge issues there.