# Measuring Systems

Many terms floating around:

- Monitoring
- Observability
- Instrumentation
- Log analysis
- High Cardinality
- Tracing

What does it mean? And why would I care?

Inspired by: Baron Schwarz (2018) https://www.xaprb.com/slides/devopsdc-what-is-observability

## Observability is Property of a System

... just like:

* Usability -- The user can find his/her way around the front end
* Efficiency -- The application does not waste system resources
* Maintainability -- A Software Engineer can find his/her way around the source code
* Testability -- A Software Engineer could write effective tests for the system (if he wanted)
* Debugability -- An SRE/SE can understand pathological behavior (Related: [Cantrill (2018)](https://www.slideshare.net/bcantrill/debugging-under-fire-keeping-your-head-when-systems-have-lost-their-mind))

* Observability -- The operator can inspect the application behavior while it's running.

We all want Observability. Otherwise we would not be here.

## Data Sources for Observability

Provided by the Application:
* Logs
* Status endpoints (`/stats.json`, `/proc`)
* Instrumentation (`StatsD.gauge('Base.queued', 12)`, Zipkin)

Provided by the Runtime:
* JMX, PyMX
* Debuggers (gdb, jdb, pdb)
* Profilers

Provided by the Kernel:
* Dynamic tracing (DTrace, eBPF) 
* System profilers (perf)
* Network sniffing (pcap)

Provided by Hardware:
* PCM/MSR registers (Intel)
* SMART disk interface

# Common Denominator: Events

All the above data sources can be regarded as emitting events:

```
{ event="HTTP_START", reqid=2398721, endpoint="/", srcip="10.8.20.132", t=1535120313.123 }
{ event="HTTP_END", reqid=2398721, status="OK", bytes=125232, t=1535120313.123  }
{ event="HTTP", endpoint="/", duration[ms]=1.23, status="200", srcip="10.8.20.132", t=1535120313.123 }
{ event="funccall", name="Customer.payment", args=[120, "USD"], t=1535120313.123 }
{ event="cpu_idle", value=1231522, unit=jiffy, t=1535120313 }
{ event="syscall", name="read", pid=1231, duration[us]=18.1, t=1535120313.123123152 }
{ event="shedule_on_cpu", pid=1231 , t=1535120313.123123152 }
{ event="shedule_off_cpu", pid=1231, t=1535120313.123123152 }
```

<!--

Remarks:
- Events seem to be used as a term for state changes.
- Some of the data sources are not attached to state changes, e.g. "disk free %", but more to the state the system is in. We can introduce measurement "events", to make that fit our description here.
- Events as K-V pairs is relatively arbitrary.

-->

# Geometric View on Events

Events are points in a high dimensional space:

- Each attribute name gives an axes.
- Each attribute value, determines the location on that axes.
- Attribues that are not set are set to a special value `undef`.

![](../img/events.png)

# Challenges

- High Dimensionality: Lots of Axes
- High Cardinality: Lots of values on a single discrete axes (userid, pid)
- High Volume: Lots of Events

# Coping Strategies

* Select event sources (reduce dimensionality, volume)  
  E.g. only look at log data

* Select attributes (reduce dimensionality)  
  E.g. only look at (duration, url, host)

* Filter attribute value (reduce dimensionality, volume)  
  E.g. only record events with user_id=25

* Sample Events (reduce volume)  
  E.g. only record every 100s HTTP request.

* Group values (reduce cardinality)  
  E.g. ip => ip range (192.*)

* Aggregate events (reduce volume)  
  over time (within this minute) and across dimensions (e.g. users)

# Data Models

(1) Documents (e.g. JSON)
- Events are naturally expressed as JSON documents
- Used by Log aggregators, APM tools: ELK, Splunk, Honeycomb, Dynatrace

(2) Tables (Relational Data)
- Pre-selection of attributes allows to impose table structure
- Used by some APM tools: New Relic, ...?

(3) Metrics / Time Series (TSDB)
- Focus a single attribute over time
- Like two column table (TIME, ATTRIBUTE)
- Discrete time axes (10s, 60s) typical
- Continues value axes (float) typical
- Used by Monitoring systems, APM tools: Graphite, Prometheues, Influx, Circonus/IRONdb, ...

# Tradeoff

* Documents/Tables: are full/high "fidelity" but relatively expensive at full capture.

* Metrics are compact, aggregated, efficient, but can’t be disaggregated.

Source: https://www.xaprb.com/slides/devopsdc-what-is-observability/#9

In [36]:
# Numeric Example: Storage Volume Logs vs. Metrics

# Access logs
req_per_second = 5
log_lines_per_req = 1
bytes_per_log_line = 1000
seconds_per_day = 24*60*60
bytes_per_kb = 1000
bytes_per_mb = 1000 * bytes_per_kb
log_mb_per_day = bytes_per_log_line * log_lines_per_req * req_per_second * seconds_per_day / bytes_per_mb
print("Log storage per day {} MB".format(log_mb_per_day))

Log storage per day 432.0 MB


In [37]:
# Metric Storage: rate, p50, p95, p99
number_of_metrics = 4
bytes_per_value = 8 # with compression this goes down to 1-4 bytes
aggregation_period_sec = 60
values_per_day = number_of_metrics * seconds_per_day / aggregation_period_sec
metric_kb_per_day = bytes_per_value * values_per_day / bytes_per_kb
print("Metric storage per day {} kb".format(metric_kb_per_day))

Metric storage per day 46.08 kb


# Metrics

How can map events to metric data?

### (1) Select Value Axes
   
* Select a single value axes for the metric (e.g. duration)

* Filtering attributes (e.g. event=HTTP, host=www1-eu-fra)

```
SELECT (t, duration) FROM events WHERE event='HTTP', host='www1-eu-dus';
```

Remark: Filtering on "High Cardinality" leads to "metric explosion".

### (2) Group by Time

If you measure time precise enough all events will have different time stamps.

Need to group time values into discrete windows. Typical "periods" 10s, 60s, 1h, 3h, 1d.

```
SELECT (floor(t/period) as T, duration) FROM events WHERE event='HTTP', host='www1-eu-dus';
```

### (3) Aggregate Events

All durations in the given time window, need to be summarized to an aggregate.
Typical aggregation functions are:

* count
* mean
* sum
* min/max
* percentile
* histogram

... or multiple at the same time.

This step is often called "rollup".

```
SELECT (floor(t/period) as T, aggregate(duration)) FROM events WHERE event='HTTP', host='www1-eu-dus' GROUP BY T;
```

## Summary: Creating Metrics from Events
![](../img/metrics.png)

## Higher Rollups


![](../img/rollup.png)