Skip to content

snewman/reimann

 
 

Repository files navigation

Reimann

Reimann is a network event stream processor. It is designed for monitoring, analytics, and alerts for events from multiple services. Reimann listens on port 5555 for protocol buffer messages containing events and processes them through various streams.

You can use Reimann to graph the average rate of requests in your application, email responsible parties every time an exception is thrown, and plot the 50th, 95th, and 99th percentile latencies for your HTTP service. It is a tool to make writing comprehensive, site-specific analytics easy.

Configuration Guide
API docs
Clojars
Clients: Ruby, Clojure
Tools
Dashboard

Installation

Tarball

wget http://aphyr.com/reimann/reimann-0.0.3.tar.bz2
tar xvfj reimann-0.0.3.tar.bz2
cd reimann-0.0.3
sudo $EDITOR etc/reimann.config
bin/reimann

Debian Package

wget http://aphyr.com/media/reimann_0.0.3.deb
sudo dpkg -i reimann_0.0.3.deb
sudo $EDITOR /etc/reimann/reimann.config
reimann

Warning: the .deb will overwrite /etc/reimann/reimann.conf. I haven't figured out how to get lein-deb to play nice with conffiles yet.

Events

Events are generated by various programs and sent to Reimann servers over protocol buffers. They might be indications of state, notifications about a daemon humming along, or transient alerts. Reimann represents events as structs. For instance:

"DB2 is down!"

{:host "db2" :state "down"}

"The rails app is OK."

{:service "rails" :state "ok"}

"db5.tx is using 30% of its memory."

{:host "db5.tx" :service "memory" :metric 0.3}

"The rails app on web1 just caught an exception!"

{:host "web1"
 :service "rails" 
 :state "exception"
 :description: "Errno::ENOENT at some_long_stacktrace..."}

"The feed processor just finished 16 items for these two users."

{:service "feed processor"
 :state "ok"
 :metric 16
 :tags ["jamal" "deidre"]}

"There are 25000 active users. Consider this number valid for an hour."

{:service "active users"
 :metric 25000
 :ttl 3600}

In full:

event {
  host: A hostname, e.g. "api1", "foo.com"
  service: e.g. "API port 8000 reqs/sec"
  state: Any string less than 255 bytes, e.g. "ok", "warning", "critical"
  time: The time that the service entered this state, in unix time
  description: Freeform text
  tags: Freeform list of strings, e.g. ["rate", "fooproduct", "transient"]
  ttl: A floating-point time, in seconds, that this event is considered
       valid for. Expired states may be removed from the index.
  metric: A number associated with this event, e.g. the 
          number of reqs/sec.
}

All fields are optional.

Streams

A stream is a function that accepts a single event. Many streams accept children, to which they forward events under certain conditions. Together, the streams form a directed graph along which events flow.

A stream can filter the events it receives, passing on those that match some predicate. They can pass on a changed event to their children, or fork into several distinct substreams. They can compute percentiles, rates, or moving averages. Streams can send email about the events they receive, forward events to other Reimann servers, or send them to Graphite. Any clojure function accepting an event map can be a stream.

The Index

A special type of stream updates the index: a table of the current state of all services tracked by Reimann. Events entered into the index have a :ttl field; states that sit in the index for too long are removed from the index and reinserted into the event streams with state :expired. This means that services which fail to check in regularly enough can trigger alerts.

Querying

Clients can query the index for particular states.

# Simple equality
state = "ok"

# Wildcards
(service =~ "disk%") or (state != "critical" and host =~ "%.trioptimum.com")

# Standard operator precedence applies
metric_f > 2.0 and not host = nil

# All states
true

# No states
false

Query messages return a list of matching states.

Configuration

When run, reimann loads a configuration file: either the first command line argument or a file in the current directory called reimann.config. This file is a Clojure program evaluated in the context of reimann.config.

The configuration guide is here:

https://github.com/aphyr/reimann/blob/master/reimann.config.guide

Protocol

A connection to Reimann is a stream of messages. Each message is a 4 byte network-endian integer length, followed by a Protoocol Buffer Message of length bytes. See proto/reimann/proto.proto for the protobuf particulars.

The server will accept a repeated list of Events, and respond with a confirmation message with either an acknowledgement or an error. Check the ok boolean in the message; if false, message.error will be a descriptive string.

Because protocol buffers is strongly typed, the metric field is represented on the wire as metric_f. At some point I'll add an int64 as well.

Events are uniquely identified by host and service. Both allow null. Event.time is the time in unix epoch seconds and is optional. The server will generate a time for each event when received if you do not provide one.

You can also query states from the index using a basic query language. The grammar is specified in src/reimann/query.g. Just submit a Message with your query in message.query.string. Search queries will return a message with repeated States matching that expression. A null expression will return no states.

Plan

I built Reimann with the goal of getting it out the door as quickly as possible. There are many slow or kludgy parts, but they should all be readily replaceable as I find the time. Top on my list:

  • Add a raw Netty UDP listener for accepting events. We lose a lot of time to aleph.tcp.
  • Use Korma/HSQL to implement a faster index for query-heavy installations.
  • Think more carefully about time-partitioning functions.
  • Reservoir sampling
  • Event pubsub

About

A network event stream processing system, in Clojure.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 56.3%
  • Clojure 43.5%
  • Shell 0.2%