Cassandra dashboards fixes and improvements #726

arodrime · 2017-09-01T14:21:03Z

What does this PR do?

Update the filtering of metrics in cassandra integration to fit with latest version of dashboards we are proposing in from the dev environments, @irabinovitch is the person we are in contact with and he knows were to take the dashboards from so it can go through datadog QA.
Add by default tags we need to use as template variables in distinct dashboards.
Makes the default configuration compatible with C*2.0
Makes the default configuration compatible with C*2.1
Makes the default configuration compatible with C*2.2
Still compatible with C*3+
Made filtering more exhaustive to use exactly the metrics we display
Update the corresponding metadata CSV so we have units everywhere and a small description.

Motivation

This request is part of the work TLP is doing to have a nice set of dashboards out of the box for Cassandra / Datadog users.

Testing

This code needs to be tested with Cassandra (2+) and TLP - * Dashboards from TLP datadog account. On our end, filtering was working, we found no way to test metadata.csv file though.

Additional Notes

conf.yaml.example

For some reason, this filtering is not working correctly and we are accepting more measurement (histogram parts) than we actually need for these 3 attributes. I am not sure why. It also happen with "column family" in older versions of Cassandra.

- include:
        domain: org.apache.cassandra.metrics
        type: Table
        bean_regex:
          - .*keyspace=.*
        name:
          - SSTablesPerReadHistogram
          - TombstoneScannedHistogram
          - WaitingOnFreeMemtableSpace
        attribute:
          - 75thPercentile
          - 95thPercentile
      exclude:
        keyspace:
          - system
          - system_auth
          - system_distributed
          - system_schema
          - system_traces

metadata.csv: I am not sure about the column "orientation" and I let it empty for all the metrics that were added. Also could you make sure the format / syntax is valid there. I am not sure if percentiles need to be detailed etc.

About dashboards

The changes mentioned above are made to support a specific version of the Cassandra dashboards, for completeness' sake, please find dashboards "change log" below.

@irabinovitch here are all the comments of most what was done lately, it's been hard for me to keep tracks of everything, but I was willing to at least explain why I changed some stuff added by Datadog teams. happy to discuss the design and adjust files in this PR accordingly. I recommend QA teams to take as it is now (I merged their past work in there) and see if that suits the needs.

Important / Design:

Could we force displaying those dashboards with 3 charts per rows (at least by default?). I was designed to be displayed that way and we believe it makes things way more clear for the user. If not it might be worth it to mention that somewhere alongside with the tag configuration.
Also, all the charts where built with filters using template variables (that you can see in TLP datadog account).

Possible improvements:

Top list: It would be nice to make top list that can be broken down by multiple variables, not just one. JSON allows it, a save from the UI breaks it as the UI only accepts one variable there (for example Top list of disk space used broken down by host AND device would be awesome).

Hand over:

A lot of charts changed (using better metric, specified when available on C3+ only, added metrics and charts for C2.1 and C*2.2 compatibility
Added marker as guidelines to new Cassandra operators.
Some charts were added to improve existing chart tracking new informations
Changes listed below were also made when merging your (datadog teams) work with what we did in the meantime. We tried to detail them so you can have the big picture of the design we used to create dashboards. We hope you will like the changes but we are happy to discuss anything.
For all those reason it might be easier to take dashboards as they are now in our development environment and check if that version would work for you or make the changes as you see fit from there
We provide this pull request for metrics in use for those dashboards, compatible with C2.1+ (maybe C2.0, not tested) and the corresponding metadata so it should be easy to plug this new version hopefully on your side

List of dashboard changes:

Overview Dashboard: 

We need to use the information about the node being up or down itself, disregarding other nodes status. As of now a down node will not be shown. I recommend you to simulate a node down and see this chart in a 3+ node cluster if I am unclear.
We removed Live space growth: This information is important we agree. Yet the overview dashboard is made to exclusively. The on disk data growth is not likely to be helpful to detect or prevent an anomaly from happening, but it’s a good information to have indeed and we placed it in the new “SSTable Management” dashboard that gather information about SSTable, flushes, compactions, and disk.
We split Read / write counts and added “other operation” counts and latencies. We believe that when the workload between reads and writes is unbalanced, the smallest one would look “flat” in the chart even though it would drop 50 % of the load. So it is good to split them as we want to detect anomaly and ratio between reads and write would still be trivial for the operator as both charts seat next to each other.
We put ‘dropped messages’ to the top as it is a key factor in anomaly detection. Also a database is suppose to store and deliver messages successfully, if it doesn’t we want to know about it. We also removed the filtering here so ANY dropped messages, so no kind of issue slips through the cracks.
Memtable count, Memtable data size: We believe the same way this is too “low level” to be in this dashboard that has to remain simple and just monitor minimal things to never miss that “something happened” or to say the cluster is healthy. Those charts would fit in a dashboard “Hardware and JVM usage” that is not yet designed but that we aim to build for a V2 of those dashboards. For now we dropped these 2 charts from overview.
Max partition size became a top list, to change and because it fits. We do not mind much the evolution of the biggest partition at this point but see clearly how big are the partitions.
System memory: we missed that one, thanks for adding it. Now using avg instead of sum though, to be more consistent with CPU and for readability.
I/O wait (%): this is definitely a good fit as many of the Cassandra issues are related to disk one way or another. We broke it per host / device though, so we know what disk precisely is having troubles.
We also removed non-working charts called ‘Read and write rate’, ‘Write latency (ms)’ and ‘Key/Row cache hit rate’. They are trying to use metrics that are currently not available here https://github.com/DataDog/integrations-core/blob/master/cassandra/conf.yaml.example. Mostly these charts exist in “TLP - Read path” and use metrics from org.apache.cassandra.metrics and not org.apache.cassandra.db).

Read Path / Write Path Dashboards

We reworked the 2 dashboards to add a few charts and improving some exiting charts and the layout as well.

We saw no changes from what we offered in the first version of the dashboards so no merge from Datadog default chart is needed there. Taking the new dashboard from our TLP account is probably the easiest way to go. Then check it to see if it still suits you :-).

SSTable management dashboard

This dashboard being new, we should have no merge issues there.

bits-bot · 2017-09-01T14:21:06Z

@arodrime, thanks for your PR! By analyzing the history of the files in this pull request, we identified @gmmeyer to be a potential reviewer.

joaquincasares

Everything else looks good.

Really liked not having to change the file for Cassandra 2.0 support. 👍

joaquincasares · 2017-09-01T21:05:09Z

cassandra/conf.yaml.example

+              metric_type: counter
+              alias: jmx.gc.major_collection_time
+     # Deprecated metrics for pre Cassandra 3.0 versions compatibility.
+     # If you are using cassandra 2, the metrics below will be user, otherwise ignored.


*will be used

irabinovitch · 2017-09-11T14:56:34Z

@joaquincasares @arodrime thanks for digging into this. Our team is QA'ing this in our C* environments, but hope to have more updates for you soon.

zippolyte · 2017-09-12T09:05:51Z

Hi @arodrime, thanks for your PR.

regarding the orientation field, you should have a look at this: https://docs.datadoghq.com/guides/integration_sdk/#metadata-csv

arodrime · 2017-09-12T11:31:38Z

@zippolyte thanks for the pointer! I'll add the orientation as soon as possible.

Could you please let me know if the other fields are correctly filled up?

zippolyte

Just a couple things in the csv, other than that and the orientation, the csv looks good.

zippolyte · 2017-09-12T14:14:33Z

cassandra/metadata.csv

+cassandra.waiting_on_free_memtable_space.95th_percentile,gauge,10,microsecond,,The time spent waiting for free memtable space either on- or off-heap - p95.,,cassandra,waiting memtable p95
+cassandra.write_latency.75th_percentile,gauge,10,microsecond,,The local read latency - p75.,,cassandra,local write latency p75
+cassandra.write_latency.95th_percentile,gauge,10,microsecond,,The local read latency - p95.,,cassandra,local write latency p95
+cassandra.write_latency.99th_percentile,gauge,10,microsecond,,The local read latency - p99.,,cassandra,local write latency p99


Here and above should be write latency in the description field

zippolyte · 2017-09-12T14:20:10Z

cassandra/metadata.csv

+cassandra.write_latency.75th_percentile,gauge,10,microsecond,,The local read latency - p75.,,cassandra,local write latency p75
+cassandra.write_latency.95th_percentile,gauge,10,microsecond,,The local read latency - p95.,,cassandra,local write latency p95
+cassandra.write_latency.99th_percentile,gauge,10,microsecond,,The local read latency - p99.,,cassandra,local write latency p99
+cassandra.write_latency.one_minute_rate,gauge,10,write,second,The number of local write requets.,,cassandra,local write count


small typo: requests in the description

arodrime · 2017-09-14T12:03:02Z

Thanks for the feedback @zippolyte, commit added to this PR.

zippolyte · 2017-09-14T13:26:07Z

Thanks for the update, but you've also replaced all the commas with semicolons, could you change that back please ?

arodrime · 2017-09-14T13:41:59Z

Oops, my bad. Working from spreadsheet and csv export was with semi column... I fixed it. Thanks.

zippolyte

Hey @arodrime,

I've finished reviewing the config while doing the changes to the dashboards on our side. I have just a few comments, nothing major, the rest looks good to me

zippolyte · 2017-09-20T15:57:37Z

cassandra/conf.yaml.example

@@ -2,6 +2,9 @@ instances:
  - host: localhost
    port: 7199
    cassandra_aliasing: true
+    tags:
+      environment: default_environment
+      datacenter: default_datacenter


Could you add a comment to explain the use of these tags and how they are useful in the dashboards ? And since we cannot force the displaying of the timeboards in 3 columns (It is by default that way though), maybe you could also add a note here about that as you suggested.

I am currently writing an article for Datadog blog, this will be in there most definitely. Adding a short comment though, this indeed matters a lot, good catch.

zippolyte · 2017-09-20T16:04:24Z

cassandra/conf.yaml.example

+          - SSTablesPerReadHistogram
+          - TombstoneScannedHistogram
+          - WaitingOnFreeMemtableSpace
+      attribute:


The indentation here is wrong, which is why you get more histogram parts than specified for these mbeans.

Wow I really messed up with indentation. I fixed it for both the 'table' and 'columnfamily' metrics. Thanks!

zippolyte · 2017-09-20T16:09:24Z

cassandra/conf.yaml.example

-            metric_type: counter
-            alias: jmx.gc.major_collection_time
+              metric_type: counter
+              alias: jmx.gc.major_collection_time


Here and for all the GC metric_type and alias above, there are two spaces too many for the indentation.
It seems to work anyway in this case but the indentation everywhere is two spaces so let's have that here also :)

Makes sense, not sure what happened there 🙃. I am fixing it.

arodrime · 2017-09-22T15:24:55Z

@zippolyte I pushed the latest changes let me know if that works for you.

Also, have you managed to tackle these 2 points?

We need to use the information about the node being up or down itself, disregarding other nodes status. As of now a down node will not be shown. I recommend you to simulate a node down and see this chart in a 3+ node cluster if I am unclear.
Top list: It would be nice to make top list that can be broken down by multiple variables, not just one. JSON allows it, a save from the UI breaks it as the UI only accepts one variable there (for example Top list of disk space used broken down by host AND device would be awesome).

I would say these 2 fixes ( node status and disk space top list) are really the most important changes needed. If you need more details I am happy to discuss it.

zippolyte

Hi @arodrime thanks for the changes. Just a couple of issues with the indentation again.

Regarding the node status, I'm not sure I understand what's the problem, I'll investigate.
For the other point, since the JSON allows, I was able to break down by two variables. You can do it by saving the graph from the JSON tab if you want to see how it looks.

zippolyte · 2017-09-25T06:51:44Z

cassandra/conf.yaml.example

-      environment: default_environment
-      datacenter: default_datacenter
+        environment: default_environment
+        datacenter: default_datacenter


The indentation was good before

Ok fixing this.

zippolyte · 2017-09-25T06:52:38Z

cassandra/conf.yaml.example

+            - system_auth
+            - system_distributed
+            - system_schema
+            - system_traces


This exclude section was indented correctly: it should be at the same level than the include section

Oh, of course. ok, fixing this as well.

zippolyte · 2017-09-25T06:53:10Z

cassandra/conf.yaml.example

+            - system_auth
+            - system_distributed
+            - system_schema
+            - system_traces


Same for this exclude section

I'll have a look at the all file carefully... I am terrible at managing indents obviously 🙃 .

arodrime · 2017-09-28T11:49:40Z

Regarding the node status, I'm not sure I understand what's the problem, I'll investigate.

I believe we need to test the node status directly (instead of having the perspective from each node on the cluster state). We need a value 1 (up) or 0 (down) and not a number depending on the cluster size, to be able to monitor nodes going down properly.

When nodes are up, all good.

When a node is down, numbers shown are counter intuitive, but well you can guess what's happening. Not ideal though.

When 2 nodes are down (moment you really need monitoring), it's hard to read. Try to guess what 2 nodes were down below.

Playing with palette and aggregation method (avg, sum, min, max) does not help much. That's why I think we would need to know the status for each node from trying to connect it somehow. The value of status should be 1, or 0. Independently of the number of node and state of other nodes.

It's bit tricky to explain by writing, I hope the screenshots help. Testing it as you suggested is probably the best way to go. We can also go for a call if need be.

zippolyte

Hey @arodrime, sorry for the delay.

The changes looks good to me now so i'm approving this PR.

Regarding the node status, what's happening is that nodetool gives information regarding the whole cluster. So, each agent where the cassandra_nodetool integration is enabled will pick up a nodetool.status.status metric for each node of the cluster, tagged with node_id, node_address, datacenter and rack. So you can have the 0 or 1 value if you select another visualization and break down by node_id or node_address. Unfortunately it's not possible in the host map visualization because it breaks down the metrics by host automatically, which is the host on which the agent sending the metric is installed.

arodrime · 2017-10-18T12:29:45Z

The changes looks good to me now so i'm approving this PR.

That's good news 👍 .

On the dashboard (UI) side of things, while writing about dashboards I made some fixes and gave the changelog to @irabinovitch.

@zippolyte Here is a solution as you suggested to have the status chart working. Let me know your thoughts.

I am not sure to understand why the node down is not showing in red though.

Let me know if you are still missing something or if we are good to go like this for this version :-).

Thanks for all the work and nice communication on this in the last weeks.

zippolyte · 2017-10-19T16:50:30Z

I think using the toplist for the status is the best alternative even though it's not ideal. We should be good to go after I integrate your dashboard fixes to our own.
Thanks a lot again for your great work on this @arodrime !

arodrime added 2 commits September 1, 2017 14:43

C*2+ compatible metrics, needed for Cassandra dashboards improvements

8e9d57e

Describe all metrics in use for Cassandra dashboards

89c7ac8

jeremy-lq added the community label Sep 1, 2017

joaquincasares reviewed Sep 1, 2017

View reviewed changes

Fixed small typo in comment

cccf5c4

hush-hush assigned truthbk and gmmeyer Sep 4, 2017

irabinovitch requested review from olivielpeau and zippolyte September 8, 2017 20:40

olivielpeau added this to the 5.18 milestone Sep 8, 2017

irabinovitch assigned zippolyte Sep 11, 2017

masci assigned zippolyte and unassigned truthbk, gmmeyer and zippolyte Sep 11, 2017

zippolyte reviewed Sep 12, 2017

View reviewed changes

+orientation and small fixes metadata.csv

4277dc7

arodrime force-pushed the tlp-dashboards-quick-fixes branch from 28dcfa6 to 4277dc7 Compare September 14, 2017 13:40

zippolyte requested changes Sep 20, 2017

View reviewed changes

indentation fixes + comments

f5b6567

zippolyte requested changes Sep 25, 2017

View reviewed changes

Cassandra fix indentation conf.yaml.example

e07fbc6

truthbk modified the milestones: 5.18, 5.19 Oct 4, 2017

zippolyte approved these changes Oct 12, 2017

View reviewed changes

zippolyte modified the milestones: 5.19, 5.20 Nov 3, 2017

zippolyte merged commit 29b848b into DataDog:master Nov 3, 2017

zippolyte mentioned this pull request Nov 3, 2017

[cassandra][changelog] Update changelog and manifest #849

Merged

arodrime deleted the tlp-dashboards-quick-fixes branch April 7, 2021 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cassandra dashboards fixes and improvements #726

Cassandra dashboards fixes and improvements #726

arodrime commented Sep 1, 2017

bits-bot commented Sep 1, 2017

joaquincasares left a comment

joaquincasares Sep 1, 2017

irabinovitch commented Sep 11, 2017

zippolyte commented Sep 12, 2017

arodrime commented Sep 12, 2017

zippolyte left a comment

zippolyte Sep 12, 2017

zippolyte Sep 12, 2017

arodrime commented Sep 14, 2017

zippolyte commented Sep 14, 2017

arodrime commented Sep 14, 2017

zippolyte left a comment

zippolyte Sep 20, 2017

arodrime Sep 22, 2017

zippolyte Sep 20, 2017 •

edited

arodrime Sep 22, 2017

zippolyte Sep 20, 2017

arodrime Sep 22, 2017

arodrime commented Sep 22, 2017

zippolyte left a comment

zippolyte Sep 25, 2017

arodrime Sep 28, 2017

zippolyte Sep 25, 2017

arodrime Sep 28, 2017

zippolyte Sep 25, 2017

arodrime Sep 28, 2017

arodrime commented Sep 28, 2017

zippolyte left a comment

arodrime commented Oct 18, 2017

zippolyte commented Oct 19, 2017

Cassandra dashboards fixes and improvements #726

Cassandra dashboards fixes and improvements #726

Conversation

arodrime commented Sep 1, 2017

What does this PR do?

Motivation

Testing

Additional Notes

About dashboards

bits-bot commented Sep 1, 2017

joaquincasares left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irabinovitch commented Sep 11, 2017

zippolyte commented Sep 12, 2017

arodrime commented Sep 12, 2017

zippolyte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arodrime commented Sep 14, 2017

zippolyte commented Sep 14, 2017

arodrime commented Sep 14, 2017

zippolyte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zippolyte Sep 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arodrime commented Sep 22, 2017

zippolyte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arodrime commented Sep 28, 2017

zippolyte left a comment

Choose a reason for hiding this comment

arodrime commented Oct 18, 2017

zippolyte commented Oct 19, 2017

zippolyte Sep 20, 2017 •

edited