# **Chapter 19: Observability & Monitoring**

You cannot fix what you cannot see. In distributed systems, where a single request might traverse fifty microservices across three availability zones, understanding system behavior requires more than checking if a server is "up." Observability—the ability to infer internal system state from external outputs—is the foundation of reliability engineering.

This chapter covers the Three Pillars of Observability, modern tooling stacks, and the Site Reliability Engineering (SRE) practices that separate amateur operations from enterprise-grade reliability.

---

## **19.1 The Three Pillars of Observability**

Observability rests on three distinct data types: **Metrics** (numbers over time), **Logs** (discrete events), and **Traces** (request flows). Used together, they transform opaque distributed systems into debuggable, optimizable infrastructure.

### **Pillar 1: Metrics—The Numerical Backbone**

**What they are**: Time-series numerical data points (e.g., "CPU usage was 45% at 10:30 AM").

**Key Characteristics**:
- **Aggregatable**: You can sum, average, or percentile them
- **Low cardinality**: Limited distinct values (hundreds of status codes, not millions of user IDs)
- **Efficient**: Cheap to store and query at scale
- **Dimensional**: Tagged with metadata (e.g., `http_requests_total{method="POST",status="500",endpoint="/api/users"}`)

**The RED Method** (For services—what users experience):
```
Rate: Requests per second
Errors: Percentage of failed requests  
Duration: Distribution of request latencies (p50, p95, p99)
```

**The USE Method** (For resources—what infrastructure experiences):
```
Utilization: Percentage of resource busy (CPU, memory, disk)
Saturation: Queue length or work backlog (threads waiting)
Errors: Count of error events (disk failures, network drops)
```

**Metric Types**:
1. **Counters**: Monotonically increasing (total requests, bytes sent)
   ```prometheus
   # HELP http_requests_total Total HTTP requests
   # TYPE http_requests_total counter
   http_requests_total{method="GET",status="200"} 1425
   http_requests_total{method="POST",status="500"} 3
   ```

2. **Gauges**: Arbitrary values that go up and down (temperature, queue depth, memory usage)
   ```prometheus
   # HELP queue_length Current items in processing queue
   # TYPE queue_length gauge
   queue_length{queue="payment"} 42
   ```

3. **Histograms**: Distribution of values into buckets (request durations)
   ```prometheus
   # HELP http_request_duration_seconds Request latency
   # TYPE http_request_duration_seconds histogram
   http_request_duration_seconds_bucket{le="0.1"} 240
   http_request_duration_seconds_bucket{le="0.5"} 489
   http_request_duration_seconds_bucket{le="1.0"} 567
   http_request_duration_seconds_sum 186.4
   http_request_duration_seconds_count 600
   ```

**Cardinality—The Hidden Killer**:
Cardinality = number of unique time series. Each unique combination of labels creates a new series.

```
Good: http_requests_total{status="200",method="GET"}  // 10s of series
Bad:  http_requests_total{user_id="12345",session_id="abc"} // Millions of series
```

High cardinality (millions of unique label combinations) explodes storage costs and query latency. Never put unbounded dimensions (user IDs, order IDs) in metric labels.

---

### **Pillar 2: Logs—The Event Narrative**

**What they are**: Immutable, timestamped records of discrete events. Logs tell the story of what happened.

**Structured vs. Unstructured**:
```
Unstructured (Hard to query):
127.0.0.1 - - [15/Jan/2024:10:30:00 +0000] "GET /api/users HTTP/1.1" 200 42

Structured (JSON—queryable):
{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "INFO",
  "service": "user-service",
  "trace_id": "abc123",
  "span_id": "def456",
  "http_method": "GET",
  "path": "/api/users",
  "status_code": 200,
  "duration_ms": 42,
  "user_agent": "Mozilla/5.0...",
  "message": "Request processed successfully"
}
```

**Log Levels** (Severity hierarchy):
- **DEBUG**: Detailed diagnostic information (development only)
- **INFO**: Normal operational messages (requests handled, state changes)
- **WARN**: Unexpected but handled conditions (retries, degraded performance)
- **ERROR**: Failed operations that didn't crash the service (DB connection lost, handled exception)
- **FATAL**: Service cannot continue (out of memory, corrupted state)

**Correlation IDs** (Distributed Tracing Precursor):
Every request gets a unique ID propagated through all services:
```
Client Request → API Gateway → User Service → Database
   [req_abc]      [req_abc]      [req_abc]    [req_abc]
   
All logs contain: "trace_id": "req_abc"
Query: `trace_id="req_abc"` returns complete request flow across services
```

---

### **Pillar 3: Traces—The Request Journey**

**What they are**: End-to-end latency breakdowns showing exactly where time is spent in distributed systems.

**Anatomy of a Trace**:
```
Trace (User Checkout Request)
├── Span 1: API Gateway (Total: 250ms)
│   ├── Span 2: Auth Service (15ms)
│   ├── Span 3: Cart Service (80ms)
│   │   ├── Span 4: Redis (5ms)
│   │   └── Span 5: Database (70ms)
│   ├── Span 6: Payment Service (120ms)
│   │   ├── Span 7: Stripe API (100ms)
│   │   └── Span 8: Database Update (15ms)
│   └── Span 9: Email Service (30ms)
```

**Key Concepts**:
- **Trace ID**: Unique identifier for the entire request tree
- **Span**: A single operation within the trace (has start time, duration, parent reference)
- **Baggage**: Context propagated with the trace (user tier, AB test group)
- **Sampling**: Only recording 1% or 0.1% of traces to manage volume

**Critical Path Analysis**:
In the trace above, the critical path is: Gateway → Cart → Payment → Stripe. Optimizing the Email Service (30ms) won't reduce total latency because it likely runs async or in parallel. Optimizing Stripe calls (100ms) or Cart DB (70ms) will.

---

## **19.2 Monitoring Infrastructure**

### **Prometheus—The Metrics Standard**

Prometheus is the de facto standard for cloud-native monitoring, using a pull-based model and powerful query language (PromQL).

**Architecture**:
```
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Targets   │     │  Prometheus  │     │   Grafana   │
│  (Your Apps)│◄────│   Server    │────►│  (Dashboards)│
└─────────────┘     └──────────────┘     └─────────────┘
       ▲                   │
       │                   ▼
       │            ┌──────────────┐
       └────────────│   Alert      │
                    │   Manager    │
                    └──────────────┘
```

**Instrumentation** (Code example):
```python
from prometheus_client import Counter, Histogram, start_http_server

# Metrics definitions
REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'Request latency', ['endpoint'])

# Application code
@app.route('/api/users')
def get_users():
    start = time.time()
    try:
        users = db.query_users()
        REQUEST_COUNT.labels(method='GET', status='200').inc()
        return jsonify(users)
    except Exception as e:
        REQUEST_COUNT.labels(method='GET', status='500').inc()
        raise
    finally:
        duration = time.time() - start
        REQUEST_DURATION.labels(endpoint='/api/users').observe(duration)

# Expose metrics on port 8000
start_http_server(8000)
```

**Service Discovery**:
Prometheus doesn't use static configs. It discovers targets via:
- **Kubernetes**: Auto-discover pods with specific annotations
- **Consul**: Service registry integration
- **AWS/GCP/Azure**: Cloud API discovery of instances

**PromQL Examples**:
```promql
# Request rate per second over last 5 minutes
rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# Memory usage percentage
100 * (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
```

---

### **Grafana—Visualization & Alerting**

Grafana transforms Prometheus metrics into actionable dashboards.

**Dashboard Design Principles**:
1. **The RED Dashboard**: Rate, Errors, Duration for every service
2. **The USE Dashboard**: Utilization, Saturation, Errors for infrastructure
3. **The Four Golden Signals** (Google SRE):
   - Latency (time to serve requests)
   - Traffic (demand on system)
   - Errors (rate of failed requests)
   - Saturation (how "full" the service is)

**Alerting in Grafana**:
```yaml
# Example alert rule
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value }}%"
```

---

### **Jaeger/Zipkin—Distributed Tracing**

**Jaeger Architecture**:
```
┌─────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│   App   │───►│  Agent   │───►│ Collector│───►│ Storage  │
│(Library)│    │  (Daemon)│    │          │    │(Cassandra│
└─────────┘    └──────────┘    └──────────┘    │ or ES)   │
                                               └──────────┘
                                                    │
                                               ┌──────────┐
                                               │  Jaeger  │
                                               │   UI     │
                                               └──────────┘
```

**OpenTelemetry—The Standard**:
Modern applications use OpenTelemetry (OTel) as the vendor-neutral instrumentation standard, exporting to Jaeger, Zipkin, or cloud vendors (AWS X-Ray, Google Cloud Trace).

**Instrumentation**:
```python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider

# Initialize tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Create spans
with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("payment.id", "pay_123")
    span.set_attribute("user.id", "user_456")
    
    with tracer.start_as_current_span("validate_card"):
        # Validation logic
        pass
        
    with tracer.start_as_current_span("charge_stripe"):
        # Stripe API call
        pass
```

**Sampling Strategies**:
- **Head-based**: Decide at request start (simple, but might sample uninteresting requests)
- **Tail-based**: Collect all spans, decide after completion (catches rare errors, expensive)
- **Probabilistic**: Fixed percentage (1% of all requests)

---

## **19.3 Alerting Strategies**

### **SLO-Based Alerting (The Google Way)**

**Definitions**:
- **SLI** (Service Level Indicator): What you measure (e.g., "latency of homepage requests")
- **SLO** (Service Level Objective): The target (e.g., "99.9% of requests < 200ms")
- **SLA** (Service Level Agreement): The contract with consequences (e.g., "99.9% uptime or refund")

**Error Budgets**:
If your SLO is 99.9% availability, your **error budget** is 0.1% downtime per month (43 minutes).

**Alerting Rules**:
1. **Fast Burn**: Will exhaust budget in hours (page immediately)
   ```promql
   # Burn rate > 10x (consume 2% budget in 1 hour)
   job:request_latency:rate1h{job="api"} > 10 * 0.001
   ```

2. **Slow Burn**: Will exhaust budget in days (ticket, not page)
   ```promql
   # Burn rate > 2x (consume 5% budget in 3 days)
   job:request_latency:rate3d{job="api"} > 2 * 0.001
   ```

**Alert Fatigue Prevention**:
- **Actionable alerts**: If it pages at 3 AM, you must be able to do something
- **Symptom-based**: Alert on user pain (latency/errors), not causes (disk usage)
- **Severity levels**:
  - **P1 (Page)**: Revenue-impacting outage
  - **P2 (Ticket)**: Degradation, workaround exists
  - **P3 (Log/Monitor)**: Anomaly, investigate during business hours

**The "Wakes You Up" Test**:
Before adding an alert, ask: "If this wakes me up at 3 AM, will I fix it or just mute it?" If the latter, don't add it.

---

## **19.4 Log Aggregation**

### **The ELK Stack**

**Elasticsearch + Logstash + Kibana** (or **EFK**: Elasticsearch + Fluentd + Kibana)

**Architecture**:
```
┌─────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│   App   │───►│ Logstash │───►│Elasticsearch│───►│  Kibana  │
│ (Logs)  │    │ (Parse)  │    │  (Index)   │    │ (Search) │
└─────────┘    └──────────┘    └──────────┘    └──────────┘
                                    │
                               ┌──────────┐
                               │  Archive │
                               │  (S3)    │
                               └──────────┘
```

**Logstash Parsing** (Grok patterns):
```ruby
# Parse Apache logs
filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  date {
    match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
}
```

**Elasticsearch Index Strategy**:
- **Time-based indices**: `logs-2024.01.15`, `logs-2024.01.16`
- **Retention**: Hot (SSD, 7 days), Warm (HDD, 30 days), Cold (Archive, 7 years)
- **Sharding**: One shard per 20-50 GB of data

**Structured Logging Best Practices**:
1. **Use JSON**: Machine-readable, queryable
2. **Consistent Schema**: Same field names across services (`user_id`, not `userId` or `user-id`)
3. **Contextual Fields**: Every log includes service name, trace ID, version, host
4. **Sensitive Data**: Never log passwords, tokens, or PII (use masking)

---

### **Splunk—Enterprise Alternative**

Splunk dominates enterprise due to its powerful search language and indexing, but at significant cost.

**Splunk Search Processing Language (SPL)**:
```
# Find errors with high latency
index=web status=500 
| stats count by uri 
| where count > 100 
| sort -count

# Transaction tracing
index=app trace_id="abc123" 
| transaction trace_id 
| table _time, service, duration, message
```

---

## **19.5 Health Checks & Readiness Probes**

**Kubernetes Health Probes**:

1. **Liveness Probe**: "Is the application running?"
   - If fails: Kubernetes restarts the container
   - Simple: HTTP GET `/health` returns 200
   
2. **Readiness Probe**: "Is the application ready to receive traffic?"
   - If fails: Removed from service endpoints (no traffic)
   - Checks: DB connections, cache warmth, feature flags loaded
   
3. **Startup Probe**: "Has the application finished starting?"
   - Disables liveness/readiness until successful
   - For slow-starting Java apps

**Implementation**:
```python
@app.route('/health')
def health():
    # Liveness: Just return 200 if process is up
    return {'status': 'alive'}, 200

@app.route('/ready')
def ready():
    # Readiness: Check dependencies
    if not db.is_connected():
        return {'status': 'not ready', 'reason': 'db_connection'}, 503
    if not cache.is_warm():
        return {'status': 'not ready', 'reason': 'cache_cold'}, 503
    return {'status': 'ready'}, 200
```

**Deep Health Checks**:
Don't check external dependencies in liveness probes (you'll restart the whole service if the DB is briefly unavailable). Use readiness probes for dependency checks.

---

## **19.6 Observability in Practice**

### **The Observability Pipeline**

```
┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│   Agents    │───►│   Pipeline   │───►│   Storage    │
│ (OTel,      │    │  (Kafka,      │    │ (Prometheus, │
│  Fluentd)   │    │   Vector)     │    │  Elasticsearch│
└─────────────┘    └──────────────┘    └──────────────┘
                                              │
                                       ┌──────┴──────┐
                                       ▼             ▼
                                  ┌─────────┐  ┌─────────┐
                                  │ Alerting│  │Dashboards│
                                  │(PagerDuty)│ │(Grafana) │
                                  └─────────┘  └─────────┘
```

### **Correlation—The Holy Grail**

Link metrics, logs, and traces:
```
Dashboard: High error rate spike at 10:30 AM
    ↓ Click
Trace: Representative failed trace showing timeout at Payment Service
    ↓ Click
Logs: Exact error message "Connection timeout to payment-db:5432"
    ↓ Click
Metric: Database connection pool exhausted at 10:29 AM
```

**Implementation**:
- Trace ID in all logs
- Metric timestamps align with log timestamps
- Exemplars: Specific trace IDs attached to metric data points

---

## **19.7 Chapter Summary**

Observability transforms debugging from "print statements and prayer" into data-driven engineering:

1. **Metrics** (RED/USE) tell you *that* something is wrong
2. **Logs** tell you *what* happened in detail
3. **Traces** tell you *where* the problem originated

**Key Takeaways**:
- Monitor symptoms (user pain), not causes (disk usage)
- Use SLOs and error budgets to balance reliability with feature velocity
- High cardinality kills metrics; use logs/traces for high-cardinality data
- Alert sparingly—alert fatigue kills teams and systems
- Instrument everything with OpenTelemetry for vendor portability

**The Observability Maturity Model**:
- **Level 1**: Reactive (check logs when users complain)
- **Level 2**: Proactive (dashboards, basic alerts)
- **Level 3**: Instrumented (distributed tracing, SLOs)
- **Level 4**: Intelligent (anomaly detection, automated remediation)

---

**Exercises**:

1. **Cardinality Calculation**: If you have 10 endpoints, 5 HTTP methods, and 10 status codes, how many time series does `http_requests_total` create? What if you add `user_id` (1 million users) as a label?

2. **SLO Math**: If your SLO is 99.95% availability, how much downtime is your error budget for a 30-day month? If you have a 2-hour outage on day 1, what percentage of your budget is consumed?

3. **Trace Analysis**: A trace shows API Gateway (10ms) → Auth Service (200ms) → Database (5ms). Where is the latency problem? How would you verify?

4. **Alert Design**: Design an alert rule for "Database connection pool exhaustion" that pages during business hours but creates a ticket at night (assuming manual intervention isn't possible at 3 AM).

5. **Log Cost**: Calculate storage costs for 1 TB/day of logs with 90-day retention, comparing hot storage ($0.023/GB) vs. archive ($0.004/GB) for 80% of data.

---

The next chapter covers **Performance Optimization**—profiling techniques, database tuning, caching strategies, and the systematic approach to making fast systems faster.