# [Site Reliability Engineering](https://landing.google.com/sre/book/)

## Service Level Objectives ([Chapter 4](https://landing.google.com/sre/book/chapters/service-level-objectives.html))
---

Which behaviors matter for a service and how to measure and evaluate those behaviors are critical to managing a service.

**Service level indicators** (SLIs), **objectives** (SLOs), and **agreements** (SLAs) describes basic properties of metrics that matter, what values those metric should be at, and how to react if the expected service can't be provided.

### Indicators
A **quantitative measure** of some aspect of the level of service that is provided, often aggregated into rate, average or percentile.  
Some examples are *Request latency*, *Error rate*, *System throughput*.

A SLI should ideally directly measure a service level of interest. In reality, a SLI may also measure a proxy of the service level of interest. e.g. server-side latency as a proxy of client-side latency

**Availability**, the fraction of the time that a service is usable. Number of "nines" notation. 

#### Possible Indicators
- **correctness, availability, latency, throughput, durability,** end-to-end latency

#### Collecting Indicators
- Both server and client side

#### Aggregation
- Most metrics are better thought of as **distributions** rather than averages.
    - Averages may miss **instantaneous** load and obscure **tail** latency
- Using **percentile** helps consider shape of distribution.
    - 99th, 99.9th $\to$ worst case
    - 50th (median) $\to$ typical case

### Objectives
A **target value** or **range of values** for a service level that is measured by an SLI. e.g. $\textbf{SLI}\le\textbf{target}$, or $\textbf{lower bound}\le\textbf{SLI}\le\textbf{upper bound}$

Complexities when choosing SLOs:
- Some metrics cannot be set a SLO. e.g. queries per second (QPS) is determined by users.  
- Other SLIs can have a SLO. e.g. average latency per request  
- QPS and latency are related.

SLOs helps set user expections about the service.

#### Control Measures
SLOs can be referenced as to when to take action when a SLI degrades. 
1. Monitor and measure the system’s SLIs.
2. Compare the SLIs to the SLOs, and decide whether or not action is needed.
3. If action is needed, figure out what needs to happen in order to meet the target.
4. Take that action.  

#### Safety Margin
Set internal SLOs for responding to SLI degradation. 

### Aggreements
A **contract** with users that includes consequences of meeting (or missing) the SLOs.

## Monitoring Distributed Systems ([Chapter 6](https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html))
---

- Monitoring systems should address symptoms, 'what's broken?', and cause, 'why?'.
- Black-box monitoring is symptom-oriented and represents active—not predicted—problems
- White-box monitoring inspects the system internals with instrumentation, therefore allows detection of imminent problems, failures masked by retries, and so forth.

### The Four Golden Signals
- Latency - The time it takes to service a request.
    - Separate latency of successful requests and failed requests.
- Traffic
- Errors
- Saturation
    - Set utilization target on most constrained resources

### Instrumentation and Performance
- Do not use mean quantity. Mean values hide imbalanced/skewed details.
- For latency, collect request counts bucketed by latencies rather than actual latency values.
- For CPU utilization,
    1. Record the current CPU utilization each second.
    2. Using buckets of 5% granularity, increment the appropriate CPU utilization bucket each second.
    3. Aggregate those values every minute.

## Other Chapters
---
Contents beyond Chapter 6 should be followed up on and reviewed when specific topic arises in Overwatch.
- Chapter 16 - Tracking Outages
- Chapter 21 - Handling Overload
- Chapter 22 - Addressing Cascading Failures