Skip to content

BigGraphite Announcement

Corentin Chary edited this page Dec 5, 2016 · 19 revisions

Graphite

People love or hate Graphite (https://graphiteapp.org/, https://github.com/graphite-project), and whatever might be your take on it, it's huge part of the open source monitoring ecosystem. It comes with a Django-based real-time graphing system (Graphite Web), a metric relay and aggregator (Carbon) and two file based TSDB (whisper and Ceres).

Graphite works using the push model, where applications send points periodically to a metric receiver. It also has fixed period retention and doesn't allow (or work well with) dynamic resolutions. This means that applications usually send their metrics every minutes or so.

Graphs are then rendered using Graphite Web directly, or using a frontend such as Grafana.

Distributed Time Series Databases

A Time Series Database also known as TSDB is a database optimized to handle timeseries. TimeSeries can be seen as a sort of Map<String, SortedMap<DateTime, Double>>: in other words, a bi-dimensional map indexed by a string (the metric name) and a time and which value is a double. Each metric is basically associated with a series of points. This is pretty useful to represent the evolution of counters and gauges over time.

A distributed TSDB is useful when the load can't be handled by a single machine, or when one wants to improve availability in cause of machine outage. It's the same concept as a regular distributed database.

For us, the need was obvious: we use Graphite as our main monitoring system, and we don't want the loss of a single machine to affect anyone. We also want to be able to simply add new machines if we need to add more capacity, that with a minimal operational cost.

Graphite clustering

The first limit you'll hit with Graphite is likely to be storage capacity. Today's machines have a few TiB of storage but that a limit that is easy to reach. Carbon comes with ways to shard metrics to multiple carbon-cache instances: it basically uses consistent hashing.

When retrieving the points to display graph, graphite-web will simply query every machine storing points and merge the results together.

This basically looks like this:

                              +------------------+  +---+
                              |                  |  |   |
                     +-------->  carbon-cache 0  |  |web<-----+
+----------------+   |        |                  |  |   |     | +----------------+
|                |   |        +------------------+  +---+     | |                |
|  carbon-relay  +---+                                        +-+  graphite-web  |
|                |   |        +------------------+  +---+     | |                |
+-------^--------+   |        |                  |  |   |     | +---------^------+
        |            +-------->  carbon-cache 1  |  |web<-----+           |
        |                     |                  |  |   |                 |
        |                     +------------------+  +---+                 |
        +                                                                 +
     points                                                             graphs

Some links to know more:

This works, and even has replication features that can be use to make the lost of any machine a no-op. It even have a few tools, like carbonate and carbonate-utils to make it easier to add, remove or resync machines. We've been using that for a few years now, and we found the following drawbacks to the current solution:

  • The infrastructure is complex to configure and understand:
    • with three kinds of carbon daemons (relays, aggregator, cache)
    • graphite-web on the cache machines, and graphite-web as a front
    • configuration must be kept in sync, even the order of the machines matters
  • The clustering code on Graphite-Web isn't particularely efficient and more than fragile
    • If any of the Web instances on the storage machines misbehaves, it's likely to affect the system as a whole
  • Adding storage carbon machines is painfull to do without downtime
  • Re-synchronzing replicated nodes takes days

We need something better

We can't afford to just put more man-hours doing maintainance of the current system. Because of that we started looking into something that could fit our needs better. We came up with the following requirements:

  • Perfect Graphite compatibility on the write and read paths: we have hundreds of users, thousands of dashboards and we didn't want to break any of this.
  • Easy to configure and administrate: as few settings as possible.
  • Easy to scale up: adding storage capacity should take minutes.
  • Highly available: the loss of any 2 machines should not affect the system.
  • Smooth migration path: importing legacy data must be easy.
  • Fully open-source

Existing solutions, and why they didn't work for us

Initially we looked at existing open-source solutions to see if any would work for us.

Cyanite

Cyanite is a is a daemon which provides services to store and retrieve timeseries data. It aims to be compatible with the Graphite eco-system. Cyanite stores timeseries data in Apache Cassandra. See http://cyanite.io/concepts.html to learn more about Cyanite's infrastructure.

On the write path, Cyanite supports Carbon protocols with notable different for retention policies and aggregations. On the read path Cyanite has its own API with a basic Graphite plugin to make it Graphite compatible.

We choose not to go with Cyanite for the following reasons:

  • All the points are stored in the same table, which makes compaction inefficient when you have different resolutions.
  • The current index implementation depends on ElasticSearch, which means that two database systems are needed.
  • The Cassandra schema will generate wide-rows for metrics with many points

Links:

InfluxDB

InfluxDB is another TSDB. The fact that the clustering feature wasn't available at the time, and would be only be available in the paid version was a no-go for us: https://www.influxdata.com/update-on-influxdb-clustering-high-availability-and-monetization/.

Links:

OpenTSDB

OpenTSDB is built on top of H-Base and Hadoop. See http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html for a design overview. OpenTSDB supports rollups, tags, and can have a metric hierarchy. There is some Graphite compatibility support (https://github.com/mikebryant/graphite-opentsdb-finder).

The main issue with OpenTSDB is that it requires an Hadoop cluster, and while we already have a few of those we did not want to depend on such a complex system for our monitoring infrastructure. We also wanted to deploy this in smaller datacenters if needed.

KairosDB

KairosDB is a TSDB similar to OpenTSDB but built on top of Cassandra. It also has some basic Graphite compatibility later. kairos-carbon: a re-implementation of Carbon in java and feeds KairosDB and graphite-kairosdb and Graphite Web plugin reading from KariosDB (with a dependency on ElasticSearch).

We did not choose KairosDB because of the need of ElasticSearch for its Graphite Plugin, the outdated Cassandra driver and because rollups didn't exist at the time. It seems that some of these issues are fixed now, but a lot of work would still have to be done to have perfect Graphite compatibility.

BigGraphite

Graphite plugin API

We decided that the best way to integrate with the Graphite ecosystem was to use the Graphite and Carbon plugin systems, more information can be found here:

Both of these plugins have very straighforward APIs.

Choosing a distributed database

Then we had to choose which database we would use. Since we need to be able to store hundred of TB of data and millions of points per second we obviously had to go with a distributed database. We had multiple options, including Riak (and RiakTS). We spoke with people from Hosted Graphite and Raintank about what they use internally, and we ended up with Cassandra, mostly because the outcome was that "anything would work" and we already have some Cassandra clusters at Criteo.

Time Series with Cassandra

Cassandra isn't as easy to use as a standard MariaDB would be and the way you store your data will affect your performances a lot.

Naive approach

The naïve approach to timeseries in Cassandra would be:

CREATE TABLE IF NOT EXISTS datapoints (
  metric text,
  timestamp int,
  value double,
  PRIMARY KEY (metric, timestamp)"
)  WITH CLUSTERING ORDER BY (timestamp DESC);

This would have the following issues:

  • You usually can't keep all your metrics at the highest resolution: with such a schema how would you handle downsampling ? "just run a job on the table" isn't an option, re-writing and erasing data would be very costly and reduce the read-performances due to tombstones.
  • Partitions usually don't perform well if they have more than 200k values or are bigger than a few hundred of MB. This basically means that a metric with a resolution of one second could not have more than 3 days of points before being too "big".

Final schema

We eventually came up with the following schema:

CREATE TABLE IF NOT EXISTS datapoints_<stage.duration>_<stage.resolution>s (
  metric uuid,           # Metric UUID.
  time_start_ms bigint,  # Lower bound for this row.
  offset smallint,       # time_start_ms + offset * precision = timestamp
  value double,          # Value for the point.
  count int,             # If value is sum, divide by count to get the avg.
  PRIMARY KEY ((metric, time_start_ms), offset)"
)  WITH CLUSTERING ORDER BY (offset DESC)
   AND default_time_to_live = <stage.duration>
   AND memtable_flush_period_in_ms = 300000"

If you want to know more, it is explained in depth in CASSANDRA_DESIGN.md. The main thing is that it creates partitions holding no more than 25k points at once and try to optimize things such as a single read will be done for a typical query.

We also put quite a lot of though on the settings on this table, including:

  • The compaction policy, which uses either TWCS or DTCS depending on the version of cassandra (https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsConfigureCompaction.html)
  • The memtable flush period: with heavy load it would not be possible to write all the points all the time. Instead we simply bypass the commit logs and only flush the memtable once in a while. With the default settings only 5 minutes of data are at risk of being lost.

Another important things is that there is one dynamically created table per retention-policy. This allows us to store points with the same expiration time in different tables and to get the full benefit of the custom compaction policies. It also make it easier to manage different resolutions separately (repairs, compacts, etc..).

Related links:

But, does it work ?

BigGraphite is currently running in our test environment, and we have a setup in production that isn't fully use yet. Here are some numbers:

Environment carbon servers Cassandra servers Points / sec Queries / sec Bytes Metrics
test 3 4 100k 10 500GiB 2M
prod 8 20 500k 100 5TiB 20M

Servers have 20 Cores, 100GiB of RAM and 2TiB of HDD each. The carbon servers do not have disks.

So far it has proved quite stable, appart from a few Cassandra hiccups that we are currently debugging. We aren't fully happy with the read performances yet, mostly for queries trying to list metrics based on globs (see CASSANDRA-12915 for the details, but it is honestly quite good for a first release (most sane queries take less than 500ms).

We would also like to try to reduce the resource usage for writes, currently we topped at ~60k writes/sec per server, and ideally we would like to do more. The bottleneck so far seems to be the CPU (writes are totally asynchronous), and we haven't really try to seriously profile Cassandra yet. We will publish our results later if we make any progress.

But overhall, this is a good replacement for Graphite as soon as you want some redundancy and have more than a few hundred of GB of data. You only need three machines that will hold both Cassandra, Carbon and Graphite-Web and you're good to go! We would be very happy to hear your feedback.

The future

Here is what we will trying to build in the future:

  • Administration: work on continuous fsck/repairs/cleanups.
  • Native API: build a native API to access BigGraphite internals and integrate with other tools.
  • Compatibility: add OpenTSDB, InfluxDB writers (and maybe readers ?)
  • Drivers: RiakTS would be a good candidate to write another database driver.