Cassandra Monitoring

This section describes some of the metrics {page-component-title} collects from a Cassandra cluster.

JMX must be enabled on the Cassandra nodes and made accessible from {page-component-title} in order to collect these metrics. See Enabling JMX authentication and authorization for details.

The data collection is bound to the agent IP interface with the service name JMX-Cassandra. The JMXCollector retrieves the MBean entities from the Cassandra node.

Client connections

Collects the number of active client connections from org.apache.cassandra.metrics.Client:

Name	Description
connectedNativeClients	Metrics for connected native clients
connectedThriftClients	Metrics for connected thrift clients

Compaction bytes

Collects the following compaction manager metrics from org.apache.cassandra.metrics.Compaction:

Name	Description
BytesCompacted	Number of bytes compacted since node started.

Compaction tasks

Collects the following compaction manager metrics from org.apache.cassandra.metrics.Compaction:

Name	Description
CompletedTasks	Estimated number of completed compaction tasks.
PendingTasks	Estimated number of pending compaction tasks.

Storage load

Collects the following storage load metrics from org.apache.cassandra.metrics.Storage:

Name	Description
Load	Total disk space (in bytes) this node uses.

Storage exceptions

Collects the following storage exception metrics from org.apache.cassandra.metrics.Storage:

Name	Description
Exceptions	Number of unhandled exceptions since start of this Cassandra instance.

Dropped messages

Measurement of messages that were DROPPABLE. These ran after a given timeout set per message type so were discarded. In JMX, access these via org.apache.cassandra.metrics.DroppedMessage. The number of dropped messages in the different message queues are good indicators whether a cluster can handle its load.

Name	Stage	Description
Mutation	MutationStage	If a write message is processed after its timeout (write_request_timeout_in_ms), it either sent a failure to the client or it met its requested consistency level and will relay on hinted handoff and read repairs to do the mutation if it succeeded.
Counter_Mutation	MutationStage	If a write message is processed after its timeout (write_request_timeout_in_ms), it either sent a failure to the client or it met its requested consistency level and will relay on hinted handoff and read repairs to do the mutation if it succeeded.
Read_Repair	MutationStage	Times out after write_request_timeout_in_ms.
Read	ReadStage	Times out after read_request_timeout_in_ms. No point in servicing reads after that point since it would have returned an error to the client.
Range_Slice	ReadStage	Times out after range_request_timeout_in_ms.
Request_Response	RequestResponseStage	Times out after request_timeout_in_ms. Response was completed and sent back but not before the timeout.

Thread pools

Apache Cassandra is based on a staged event-driven architecture (SEDA). This separates different operations in stages. These stages are loosely coupled using a messaging service. Each of these components uses queues and thread pools to group and execute its tasks. The documentation for Cassandra thread pool monitoring originated from the Pythian Guide to Cassandra Thread Pools.

Table 1. Collected metrics for Thread Pools

Name	Description
ActiveTasks	Tasks that are currently running.
CompletedTasks	Tasks that have been completed.
CurrentlyBlockedTasks	Tasks that have been blocked due to a full queue.
PendingTasks	Tasks queued for execution.

Memtable FlushWriter

Sort and write memtables to disk from org.apache.cassandra.metrics.ThreadPools. A majority of time this backing up is from overrunning disk capability. Sorting can cause issues as well, usually accompanied with high load but a small amount of actual flushes (seen in cfstats). The cause can be from huge rows with large column names; in other words, something inserting many large values into a CQL collection. If overrunning disk capabilities, add nodes or tune the configuration.

Tip	Alerts: pending > 15 \|\| blocked > 0

Memtable post flusher

Operations after flushing the memtable. Discard commit log files that have had all data in them in sstables. Flushing non-cf backed secondary indexes.

Tip	Alerts: pending > 15 \|\| blocked > 0

Anti-entropy stage

Repairing consistency. Handle repair messages like merkle tree transfer (from validation compaction) and streaming.

Tip	Alerts: pending > 15 \|\| blocked > 0

Gossip stage

If you see issues with pending tasks, monitor logs for a message:

Gossip stage has {} pending tasks; skipping status check ...

Check NTP working correctly and attempt nodetool resetlocalschema or the more drastic deleting of system column family folder.

Tip	Alerts: pending > 15 \|\| blocked > 0

Migration stage

Making schema changes

Tip	Alerts: pending > 15 \|\| blocked > 0

MiscStage

Snapshotting, replicating data after node remove completed.

Tip	Alerts: pending > 15 \|\| blocked > 0

Mutation stage

Performing a local including:

insert/updates
schema merges
commit log replays
hints in progress

Similar to ReadStage, an increase in pending tasks here can be caused by disk issues, overloading a system, or poor tuning. If messages are backed up in this stage, you can add nodes, tune hardware and configuration, or update the data model and use case.

Tip	Alerts: pending > 15 \|\| blocked > 0

Read stage

Performing a local read. Also includes deserializing data from row cache. Pending values can cause increased read latency. This can spike due to disk problems, poor tuning, or overloading your cluster. In many cases (not disk failure) resolve this by adding nodes or tuning the system.

Tip	Alerts: pending > 15 \|\| blocked > 0

Request response stage

When a response to a request is received this is the stage used to execute any callbacks that were created with the original request.

Tip	Alerts: pending > 15 \|\| blocked > 0

Read repair stage

Performing read repairs. Chance of them occurring is configurable per column family with read_repair_chance. More likely to back up if using CL.ONE (and to lesser possibly other non-CL.ALL queries) for reads and using multiple data centers. It will then be kicked off asynchronously outside of the queries feedback loop. Note that this is not likely to be a problem since it does not happen on all queries and quickly provides good connectivity between replicas. The repair being droppable also means that after write_request_timeout_in_ms it will be discarded, which further mitigates this. If pending grows, attempt to lower the rate for high-read CFs.

Tip	Alerts: pending > 15 \|\| blocked > 0

JVM metrics

Also collects some key metrics from the running Java virtual machine:

java.lang:type=Memory: The memory system of the Java virtual machine. This includes heap and non-heap memory.
java.lang:type=GarbageCollector,name=ConcurrentMarkSweep: Metrics for the garbage collection process of the Java virtual machine

Tip	If you use Apache Cassandra for running Newts you can also enable additional metrics for the Newts keyspace.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cassandra-jmx.adoc

cassandra-jmx.adoc

Cassandra Monitoring

Client connections

Compaction bytes

Compaction tasks

Storage load

Storage exceptions

Dropped messages

Thread pools

Memtable FlushWriter

Memtable post flusher

Anti-entropy stage

Gossip stage

Migration stage

MiscStage

Mutation stage

Read stage

Request response stage

Read repair stage

JVM metrics

Files

cassandra-jmx.adoc

Latest commit

History

cassandra-jmx.adoc

File metadata and controls

Cassandra Monitoring

Client connections

Compaction bytes

Compaction tasks

Storage load

Storage exceptions

Dropped messages

Thread pools

Memtable FlushWriter

Memtable post flusher

Anti-entropy stage

Gossip stage

Migration stage

MiscStage

Mutation stage

Read stage

Request response stage

Read repair stage

JVM metrics