Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra dashboards fixes and improvements #726

Merged
merged 6 commits into from Nov 3, 2017

Conversation

arodrime
Copy link
Contributor

@arodrime arodrime commented Sep 1, 2017

What does this PR do?

  • Update the filtering of metrics in cassandra integration to fit with latest version of dashboards we are proposing in from the dev environments, @irabinovitch is the person we are in contact with and he knows were to take the dashboards from so it can go through datadog QA.
  • Add by default tags we need to use as template variables in distinct dashboards.
  • Makes the default configuration compatible with C*2.0
  • Makes the default configuration compatible with C*2.1
  • Makes the default configuration compatible with C*2.2
  • Still compatible with C*3+
  • Made filtering more exhaustive to use exactly the metrics we display
  • Update the corresponding metadata CSV so we have units everywhere and a small description.

Motivation

This request is part of the work TLP is doing to have a nice set of dashboards out of the box for Cassandra / Datadog users.

Testing

This code needs to be tested with Cassandra (2+) and TLP - * Dashboards from TLP datadog account. On our end, filtering was working, we found no way to test metadata.csv file though.

Additional Notes

conf.yaml.example

For some reason, this filtering is not working correctly and we are accepting more measurement (histogram parts) than we actually need for these 3 attributes. I am not sure why. It also happen with "column family" in older versions of Cassandra.

- include:
        domain: org.apache.cassandra.metrics
        type: Table
        bean_regex:
          - .*keyspace=.*
        name:
          - SSTablesPerReadHistogram
          - TombstoneScannedHistogram
          - WaitingOnFreeMemtableSpace
        attribute:
          - 75thPercentile
          - 95thPercentile
      exclude:
        keyspace:
          - system
          - system_auth
          - system_distributed
          - system_schema
          - system_traces

metadata.csv: I am not sure about the column "orientation" and I let it empty for all the metrics that were added. Also could you make sure the format / syntax is valid there. I am not sure if percentiles need to be detailed etc.

About dashboards

The changes mentioned above are made to support a specific version of the Cassandra dashboards, for completeness' sake, please find dashboards "change log" below.

@irabinovitch here are all the comments of most what was done lately, it's been hard for me to keep tracks of everything, but I was willing to at least explain why I changed some stuff added by Datadog teams. happy to discuss the design and adjust files in this PR accordingly. I recommend QA teams to take as it is now (I merged their past work in there) and see if that suits the needs.

Important / Design:

  • Could we force displaying those dashboards with 3 charts per rows (at least by default?). I was designed to be displayed that way and we believe it makes things way more clear for the user. If not it might be worth it to mention that somewhere alongside with the tag configuration.
  • Also, all the charts where built with filters using template variables (that you can see in TLP datadog account).

Possible improvements:

  • Top list: It would be nice to make top list that can be broken down by multiple variables, not just one. JSON allows it, a save from the UI breaks it as the UI only accepts one variable there (for example Top list of disk space used broken down by host AND device would be awesome).

Hand over:

  • A lot of charts changed (using better metric, specified when available on C3+ only, added metrics and charts for C2.1 and C*2.2 compatibility
  • Added marker as guidelines to new Cassandra operators.
  • Some charts were added to improve existing chart tracking new informations
  • Changes listed below were also made when merging your (datadog teams) work with what we did in the meantime. We tried to detail them so you can have the big picture of the design we used to create dashboards. We hope you will like the changes but we are happy to discuss anything.
  • For all those reason it might be easier to take dashboards as they are now in our development environment and check if that version would work for you or make the changes as you see fit from there
  • We provide this pull request for metrics in use for those dashboards, compatible with C2.1+ (maybe C2.0, not tested) and the corresponding metadata so it should be easy to plug this new version hopefully on your side

List of dashboard changes:

Overview Dashboard:


  • We need to use the information about the node being up or down itself, disregarding other nodes status. As of now a down node will not be shown. I recommend you to simulate a node down and see this chart in a 3+ node cluster if I am unclear.
  • We removed Live space growth: This information is important we agree. Yet the overview dashboard is made to exclusively. The on disk data growth is not likely to be helpful to detect or prevent an anomaly from happening, but it’s a good information to have indeed and we placed it in the new “SSTable Management” dashboard that gather information about SSTable, flushes, compactions, and disk.
  • We split Read / write counts and added “other operation” counts and latencies. We believe that when the workload between reads and writes is unbalanced, the smallest one would look “flat” in the chart even though it would drop 50 % of the load. So it is good to split them as we want to detect anomaly and ratio between reads and write would still be trivial for the operator as both charts seat next to each other.
  • We put ‘dropped messages’ to the top as it is a key factor in anomaly detection. Also a database is suppose to store and deliver messages successfully, if it doesn’t we want to know about it. We also removed the filtering here so ANY dropped messages, so no kind of issue slips through the cracks.
  • Memtable count, Memtable data size: We believe the same way this is too “low level” to be in this dashboard that has to remain simple and just monitor minimal things to never miss that “something happened” or to say the cluster is healthy. Those charts would fit in a dashboard “Hardware and JVM usage” that is not yet designed but that we aim to build for a V2 of those dashboards. For now we dropped these 2 charts from overview.
  • Max partition size became a top list, to change and because it fits. We do not mind much the evolution of the biggest partition at this point but see clearly how big are the partitions.
  • System memory: we missed that one, thanks for adding it. Now using avg instead of sum though, to be more consistent with CPU and for readability.
  • I/O wait (%): this is definitely a good fit as many of the Cassandra issues are related to disk one way or another. We broke it per host / device though, so we know what disk precisely is having troubles.
  • We also removed non-working charts called ‘Read and write rate’, ‘Write latency (ms)’ and ‘Key/Row cache hit rate’. They are trying to use metrics that are currently not available here https://github.com/DataDog/integrations-core/blob/master/cassandra/conf.yaml.example. Mostly these charts exist in “TLP - Read path” and use metrics from org.apache.cassandra.metrics and not org.apache.cassandra.db).

Read Path / Write Path Dashboards

We reworked the 2 dashboards to add a few charts and improving some exiting charts and the layout as well.

We saw no changes from what we offered in the first version of the dashboards so no merge from Datadog default chart is needed there. Taking the new dashboard from our TLP account is probably the easiest way to go. Then check it to see if it still suits you :-).

SSTable management dashboard

This dashboard being new, we should have no merge issues there.

@bits-bot
Copy link
Collaborator

bits-bot commented Sep 1, 2017

@arodrime, thanks for your PR! By analyzing the history of the files in this pull request, we identified @gmmeyer to be a potential reviewer.

Copy link

@joaquincasares joaquincasares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else looks good.

Really liked not having to change the file for Cassandra 2.0 support. 👍

metric_type: counter
alias: jmx.gc.major_collection_time
# Deprecated metrics for pre Cassandra 3.0 versions compatibility.
# If you are using cassandra 2, the metrics below will be user, otherwise ignored.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*will be used

@irabinovitch
Copy link
Contributor

@joaquincasares @arodrime thanks for digging into this. Our team is QA'ing this in our C* environments, but hope to have more updates for you soon.

@zippolyte
Copy link
Contributor

Hi @arodrime, thanks for your PR.

regarding the orientation field, you should have a look at this: https://docs.datadoghq.com/guides/integration_sdk/#metadata-csv

@arodrime
Copy link
Contributor Author

@zippolyte thanks for the pointer! I'll add the orientation as soon as possible.

Could you please let me know if the other fields are correctly filled up?

Copy link
Contributor

@zippolyte zippolyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple things in the csv, other than that and the orientation, the csv looks good.

cassandra.waiting_on_free_memtable_space.95th_percentile,gauge,10,microsecond,,The time spent waiting for free memtable space either on- or off-heap - p95.,,cassandra,waiting memtable p95
cassandra.write_latency.75th_percentile,gauge,10,microsecond,,The local read latency - p75.,,cassandra,local write latency p75
cassandra.write_latency.95th_percentile,gauge,10,microsecond,,The local read latency - p95.,,cassandra,local write latency p95
cassandra.write_latency.99th_percentile,gauge,10,microsecond,,The local read latency - p99.,,cassandra,local write latency p99
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and above should be write latency in the description field

cassandra.write_latency.75th_percentile,gauge,10,microsecond,,The local read latency - p75.,,cassandra,local write latency p75
cassandra.write_latency.95th_percentile,gauge,10,microsecond,,The local read latency - p95.,,cassandra,local write latency p95
cassandra.write_latency.99th_percentile,gauge,10,microsecond,,The local read latency - p99.,,cassandra,local write latency p99
cassandra.write_latency.one_minute_rate,gauge,10,write,second,The number of local write requets.,,cassandra,local write count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo: requests in the description

@arodrime
Copy link
Contributor Author

Thanks for the feedback @zippolyte, commit added to this PR.

@zippolyte
Copy link
Contributor

Thanks for the update, but you've also replaced all the commas with semicolons, could you change that back please ?

@arodrime
Copy link
Contributor Author

Oops, my bad. Working from spreadsheet and csv export was with semi column... I fixed it. Thanks.

Copy link
Contributor

@zippolyte zippolyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @arodrime,

I've finished reviewing the config while doing the changes to the dashboards on our side. I have just a few comments, nothing major, the rest looks good to me

@@ -2,6 +2,9 @@ instances:
- host: localhost
port: 7199
cassandra_aliasing: true
tags:
environment: default_environment
datacenter: default_datacenter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment to explain the use of these tags and how they are useful in the dashboards ? And since we cannot force the displaying of the timeboards in 3 columns (It is by default that way though), maybe you could also add a note here about that as you suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently writing an article for Datadog blog, this will be in there most definitely. Adding a short comment though, this indeed matters a lot, good catch.

- SSTablesPerReadHistogram
- TombstoneScannedHistogram
- WaitingOnFreeMemtableSpace
attribute:
Copy link
Contributor

@zippolyte zippolyte Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation here is wrong, which is why you get more histogram parts than specified for these mbeans.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow I really messed up with indentation. I fixed it for both the 'table' and 'columnfamily' metrics. Thanks!

metric_type: counter
alias: jmx.gc.major_collection_time
metric_type: counter
alias: jmx.gc.major_collection_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and for all the GC metric_type and alias above, there are two spaces too many for the indentation.
It seems to work anyway in this case but the indentation everywhere is two spaces so let's have that here also :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, not sure what happened there 🙃. I am fixing it.

@arodrime
Copy link
Contributor Author

@zippolyte I pushed the latest changes let me know if that works for you.

Also, have you managed to tackle these 2 points?

  • We need to use the information about the node being up or down itself, disregarding other nodes status. As of now a down node will not be shown. I recommend you to simulate a node down and see this chart in a 3+ node cluster if I am unclear.
  • Top list: It would be nice to make top list that can be broken down by multiple variables, not just one. JSON allows it, a save from the UI breaks it as the UI only accepts one variable there (for example Top list of disk space used broken down by host AND device would be awesome).

I would say these 2 fixes ( node status and disk space top list) are really the most important changes needed. If you need more details I am happy to discuss it.

Copy link
Contributor

@zippolyte zippolyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arodrime thanks for the changes. Just a couple of issues with the indentation again.

Regarding the node status, I'm not sure I understand what's the problem, I'll investigate.
For the other point, since the JSON allows, I was able to break down by two variables. You can do it by saving the graph from the JSON tab if you want to see how it looks.

environment: default_environment
datacenter: default_datacenter
environment: default_environment
datacenter: default_datacenter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation was good before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fixing this.

- system_auth
- system_distributed
- system_schema
- system_traces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exclude section was indented correctly: it should be at the same level than the include section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, of course. ok, fixing this as well.

- system_auth
- system_distributed
- system_schema
- system_traces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this exclude section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at the all file carefully... I am terrible at managing indents obviously 🙃 .

@arodrime
Copy link
Contributor Author

Regarding the node status, I'm not sure I understand what's the problem, I'll investigate.

I believe we need to test the node status directly (instead of having the perspective from each node on the cluster state). We need a value 1 (up) or 0 (down) and not a number depending on the cluster size, to be able to monitor nodes going down properly.

When nodes are up, all good.

nodes up

When a node is down, numbers shown are counter intuitive, but well you can guess what's happening. Not ideal though.
1 node down

When 2 nodes are down (moment you really need monitoring), it's hard to read. Try to guess what 2 nodes were down below.

2 nodes down

Playing with palette and aggregation method (avg, sum, min, max) does not help much. That's why I think we would need to know the status for each node from trying to connect it somehow. The value of status should be 1, or 0. Independently of the number of node and state of other nodes.

It's bit tricky to explain by writing, I hope the screenshots help. Testing it as you suggested is probably the best way to go. We can also go for a call if need be.

@truthbk truthbk modified the milestones: 5.18, 5.19 Oct 4, 2017
Copy link
Contributor

@zippolyte zippolyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @arodrime, sorry for the delay.

The changes looks good to me now so i'm approving this PR.

Regarding the node status, what's happening is that nodetool gives information regarding the whole cluster. So, each agent where the cassandra_nodetool integration is enabled will pick up a nodetool.status.status metric for each node of the cluster, tagged with node_id, node_address, datacenter and rack. So you can have the 0 or 1 value if you select another visualization and break down by node_id or node_address. Unfortunately it's not possible in the host map visualization because it breaks down the metrics by host automatically, which is the host on which the agent sending the metric is installed.

@arodrime
Copy link
Contributor Author

The changes looks good to me now so i'm approving this PR.

That's good news 👍 .

On the dashboard (UI) side of things, while writing about dashboards I made some fixes and gave the changelog to @irabinovitch.

@zippolyte Here is a solution as you suggested to have the status chart working. Let me know your thoughts.

node_status

I am not sure to understand why the node down is not showing in red though.

Let me know if you are still missing something or if we are good to go like this for this version :-).

Thanks for all the work and nice communication on this in the last weeks.

@zippolyte
Copy link
Contributor

I think using the toplist for the status is the best alternative even though it's not ideal. We should be good to go after I integrate your dashboard fixes to our own.
Thanks a lot again for your great work on this @arodrime !

@zippolyte zippolyte modified the milestones: 5.19, 5.20 Nov 3, 2017
@zippolyte zippolyte merged commit 29b848b into DataDog:master Nov 3, 2017
@arodrime arodrime deleted the tlp-dashboards-quick-fixes branch April 7, 2021 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants