# Observing IT Systems

## Data Sources

Provided by the Application:
* Logs
* Status endpoints (`/stats.json`, `/proc`)
* Instrumentation (`StatsD.gauge('Base.queued', 12)`, Zipkin)

Provided by the Runtime:
* JMX, PyMX
* Debuggers (gdb, jdb, pdb)
* Profilers

Provided by the Kernel:
* Dynamic tracing (DTrace, eBPF) 
* System profilers (perf)
* Network sniffing (pcap)

Provided by Hardware:
* PCM/MSR registers (Intel)
* SMART disk interface
* RAID card interfacts

# Common Denominator: Events

All the above data sources can be regarded as emitting events:

```
{ event="HTTP_START", reqid=2398721, endpoint="/", srcip="10.8.20.132", t=1535120313.123 }
{ event="HTTP_END", reqid=2398721, status="OK", bytes=125232, t=1535120313.123  }
{ event="HTTP", endpoint="/", duration[ms]=1.23, status="200", srcip="10.8.20.132", t=1535120313.123 }
{ event="funccall", name="Customer.payment", args=[120, "USD"], t=1535120313.123 }
{ event="cpu_idle", value=1231522, unit=jiffy, t=1535120313 }
{ event="syscall", name="read", pid=1231, duration[us]=18.1, t=1535120313.123123152 }
{ event="shedule_on_cpu", pid=1231 , t=1535120313.123123152 }
{ event="shedule_off_cpu", pid=1231, t=1535120313.123123152 }
```

<!--

Remarks:
- Events seem to be used as a term for state changes.
- Some of the data sources are not attached to state changes, e.g. "disk free %", but more to the state the system is in. We can introduce measurement "events", to make that fit our description here.
- Events as K-V pairs is relatively arbitrary.

-->

# Geometric View on Events

Events are points in a high dimensional space:

- Each attribute name gives an axes.
- Each attribute value, determines the location on that axes.
- Attribues that are not set are set to a special value `undef`.

![](../img/events.png)

# Challenges

- High Dimensionality: Lots of axes
- High Cardinality: Lots of values on a single discrete axes (userid, pid)
- High Volume: Lots of events

# Coping Strategies

* Forget attributes (reduce dimensionality)  
  E.g. only forget "srcip", "userid" attibutes

* Filter by attribute values (reduce dimensionality, volume)  
  E.g. only record events with user_id=25

* Group values (reduce cardinality)  
  E.g. ip => ip range (192.*) or code=204 => 2xx

* **Sampling** (reduce volume)  
  E.g. only record every 100s HTTP request.

* **Aggregation** (reduce volume)  
  over time (within this minute) and across dimensions (e.g. users)

# Data Models

(1) Documents
- Events are naturally expressed as JSON documents

- Used by Log Analysis and APM tools: ELK (ElasticSearch/Lucene), Splunk, Honeycomb, Dynatrace ([Cassandra](https://www.dynatrace.com/platform/dynatrace-architecture/))

(2) Tables
- Pre-selection of attributes allows to impose table structure

- Used by APM tools: NewRelic ([MySQL](http://highscalability.com/blog/2011/7/18/new-relic-architecture-collecting-20-billion-metrics-a-day.html)), VividCortext ([MySQL](http://highscalability.com/blog/2015/3/30/how-we-scale-vividcortexs-backend-systems.html))

(3) Time Series
- Record a single attribute changing over time
- Works like a three column table `(KEY, TIME, VALUE)`
- Discrete TIME axes (10s, 60s) typical
- Continues VALUE axes (float) typical
- More and more structure is allowed in the KEY field (uuid ~> tag set)

- Used by Monitoring systems, APM tools: Graphite, Prometheues, Influx, Circonus/IRONdb, ...

# Tradeoff

* Metrics are compact, aggregated, aggregatable, efficient, but can’t be disaggregated.

* Logs/events are full fidelity but relatively expensive at full capture.

Source: https://www.xaprb.com/slides/devopsdc-what-is-observability/#9

In [10]:
# Numeric Example: Storage Volume Logs vs. Metrics

# Access logs
req_per_second = 3
log_lines_per_req = 1
bytes_per_log_line = 1000
seconds_per_day = 24*60*60
bytes_per_kb = 1000
bytes_per_mb = 1000 * bytes_per_kb
log_mb_per_day = bytes_per_log_line * log_lines_per_req * req_per_second * seconds_per_day / bytes_per_mb
print("Log volume per day: {:,.1f} MB".format(log_mb_per_day))

Log volume per day: 259.2 MB


In [7]:
# Metric Storage: request-rate, error-rate, p50, p95, p99
number_of_metrics = 5
bytes_per_value = 8 + 8 # value + timestamp
aggregation_period_sec = 60
values_per_day = number_of_metrics * seconds_per_day / aggregation_period_sec
metric_bytes_per_day = bytes_per_value * values_per_day
print("Metric volume per day: {:.3f} MB".format(metric_bytes_per_day/bytes_per_mb))

Metric volume per day: 0.115 MB


# Logs

* Capture full information about an event that occured including all attributes
* Long-lines should be machine readable
* Various levles of indexing seen in practice
* Very expensive to store in the medium term

Coping strategies:
* Denormalize events
* Sampling

# Metrics
We say that:

> "Events are rolled-up into metrics."

It's a three step process:

* (1) Select Axes
* (2) Group by Time
* (3) Aggregation

### (1) Select Axes

A metric has a single numeric value axes (e.g. "duration").
In addition it can take arbitraty attributes as a metric key.

For each event attribute we have the choice to:

* Filter by attribute value (e.g. event=HTTP)

* Keep attribute and make it part of the metric key (e.g. `{host=www1,dc=ir/dub}`)

* Forget attribute (e.g. ignore user_id)

Equivalent SQL:

```
SELECT t, (host, dc) as key, duration as value FROM events WHERE event='HTTP';
```

**Caution:** Keeping High Cardinality attributes let's the metric count explode (e.g. user_id)

### (2) Group by Time

If you measure time precise enough all events will have different time stamps.

Need to group time values into discrete windows. Typical "periods" 10s, 60s, 5M.

Equivalent SQL:

```
SELECT (floor(t/period) as T, duration) FROM events WHERE event='HTTP', host='www1-eu-dus';
```

### (3) Aggregate Events

All durations in the given time window, need to be summarized to an aggregate.
Typical aggregation functions are:

* count
* mean
* sum
* min/max
* percentile
* histogram

Equivalent SQL:

```
SELECT 
   floor(t/period) as T, 
   aggregate(duration)
FROM events 
WHERE 
  event='HTTP', host='www1-eu-dus' 
GROUP BY 1; # first field
```

## Summary: Creating Metrics from Events

<img src="../img/metrics.png" style="width:800px"/>

## Higher Rollups

So far we were only concerned with rolling-up metrics to a base period of 60s.
For graphing, indexing, and long-term storage also rollups on hiher periods are used:

<img src="../img/rollup.png" style="height:600px"></img>

# Key Properties for Rollup Aggregation

## Robustness 

Data is noisy. We don't want a single outlier to skew up the aggregate.

An aggregation method is *robust* if a small number of (exteme) outliers do affect the aggregate only a little.

## Mergability

Mergability the property of an aggregation method, that let's you aggregate aggregates.

* Critical for computing higher rollups, and graphing.
* Critical for cross metric aggregation


# Statistics for Engineers

What are we doing here today?

* Visualizing Events and Metrics
* Aggregating Events and Metrics
* Sampling Events
* Modeling and prediction of systems behavior