# Chapter 44: Metrics and Monitoring

Metrics provide quantitative insight into system behavior, transforming qualitative observations ("the system seems slow") into measurable data points that drive automated alerting and capacity planning. Unlike logs that describe discrete events, metrics represent time-series data—numerical values collected at regular intervals that reveal trends, patterns, and anomalies across distributed systems. Effective monitoring strategies distinguish between infrastructure metrics (CPU, memory, disk), application metrics (request latency, error rates, business KPIs), and client-side metrics (real user monitoring), establishing service level objectives (SLOs) that define acceptable reliability thresholds.

This chapter establishes quantitative observability practices using Prometheus as the foundational metrics system, covering metric instrumentation, service discovery for dynamic environments, query language (PromQL) for operational analysis, and alerting strategies that balance sensitivity with noise reduction.

## 44.1 Prometheus Fundamentals

Prometheus is an open-source systems monitoring and alerting toolkit built around a dimensional data model, flexible query language, and pull-based architecture. Originally developed at SoundCloud, it became a Cloud Native Computing Foundation (CNCF) graduated project and serves as the de facto standard for Kubernetes monitoring.

### Architecture Components

**Prometheus Server**
The core component that scrapes metrics from targets, stores them in a time-series database, and evaluates alerting rules. It operates on a pull model, periodically fetching metrics via HTTP endpoints.

**Client Libraries**
Instrumentation libraries for application metrics in various languages (Go, Java, Python, Ruby, etc.).

**Exporters**
Sidecar processes that expose metrics from third-party systems (Node Exporter for hardware/OS metrics, MySQL Exporter, Blackbox Exporter for probing).

**Alertmanager**
Handles alerts sent by Prometheus, deduplicating, grouping, and routing them to notification channels (PagerDuty, Slack, email).

**Pushgateway**
Accepts push-based metrics for short-lived jobs that cannot be scraped (batch jobs, CI/CD pipelines).

### Deployment in Kubernetes

```yaml
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
        - name: prometheus
          image: prom/prometheus:v2.48.0
          args:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus/'
            - '--storage.tsdb.retention.time=15d'
            - '--storage.tsdb.retention.size=50GB'
            - '--web.console.libraries=/usr/share/prometheus/console_libraries'
            - '--web.console.templates=/usr/share/prometheus/consoles'
            - '--web.enable-lifecycle'
            - '--web.enable-admin-api'
          ports:
            - containerPort: 9090
              name: web
          resources:
            requests:
              memory: "4Gi"
              cpu: "1000m"
            limits:
              memory: "8Gi"
              cpu: "2000m"
          volumeMounts:
            - name: prometheus-config
              mountPath: /etc/prometheus
            - name: prometheus-storage
              mountPath: /prometheus
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: 9090
            initialDelaySeconds: 30
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /-/ready
              port: 9090
            initialDelaySeconds: 5
            periodSeconds: 5
      volumes:
        - name: prometheus-config
          configMap:
            name: prometheus-config
        - name: prometheus-storage
          persistentVolumeClaim:
            claimName: prometheus-storage
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: production
        replica: '{{.ExternalURL}}'
    
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093
    
    rule_files:
      - /etc/prometheus/rules/*.yml
    
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
      
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https
      
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
      
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
```

**Explanation:**
The Prometheus configuration defines scrape targets:
- **kubernetes-pods**: Discovers Pods with `prometheus.io/scrape: "true"` annotation, extracting port and path from annotations. The `relabel_configs` transform Kubernetes metadata into Prometheus labels.
- **kubernetes-nodes**: Scrapes node-level metrics (kubelet).
- **kubernetes-apiservers**: Scrapes Kubernetes API server metrics.

## 44.2 Metric Types Deep Dive

Prometheus defines four core metric types, each suited to different measurement scenarios.

### Counter

Counters represent cumulative values that only increase (or reset to zero on restart). Use for: request counts, error counts, tasks completed.

```java
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;

@Component
public class PaymentMetrics {
    private final Counter paymentCounter;
    private final Counter errorCounter;
    
    public PaymentMetrics(MeterRegistry registry) {
        this.paymentCounter = Counter.builder("payments_processed_total")
            .description("Total payments processed")
            .tag("service", "payment-service")
            .register(registry);
            
        this.errorCounter = Counter.builder("payments_failed_total")
            .description("Total payment failures")
            .tag("service", "payment-service")
            .register(registry);
    }
    
    public void recordPayment(PaymentResult result) {
        paymentCounter.increment();
        
        // Dimensional labels for analysis
        Counter.builder("payments_by_method_total")
            .tag("method", result.getPaymentMethod())
            .tag("currency", result.getCurrency())
            .tag("status", result.getStatus())
            .register(registry)
            .increment();
    }
    
    public void recordError(String errorType, boolean retryable) {
        errorCounter.increment();
        
        Counter.builder("payment_errors_total")
            .tag("error_type", errorType)
            .tag("retryable", String.valueOf(retryable))
            .register(registry)
            .increment();
    }
}
```

**PromQL Queries:**
```promql
# Total payments in last 5 minutes
rate(payments_processed_total[5m])

# Error rate by type
sum(rate(payment_errors_total[5m])) by (error_type)

# Success ratio
rate(payments_processed_total[5m]) 
/ 
(rate(payments_processed_total[5m]) + rate(payments_failed_total[5m]))
```

**Explanation:**
Counters are monotonic (only increase). The `rate()` function calculates per-second increase over the time range, handling counter resets automatically. Labels (`method`, `currency`) enable dimensional analysis—aggregating by different dimensions without separate metric names.

### Gauge

Gauges represent values that can arbitrarily go up and down. Use for: temperatures, current memory usage, queue depths, in-progress requests.

```java
@Component
public class SystemMetrics {
    private final AtomicInteger activeConnections = new AtomicInteger(0);
    private final Gauge connectionsGauge;
    private final Gauge memoryGauge;
    private final Gauge queueDepthGauge;
    
    public SystemMetrics(MeterRegistry registry) {
        // Connection gauge
        this.connectionsGauge = Gauge.builder("db_active_connections")
            .description("Current active database connections")
            .register(registry, activeConnections, AtomicInteger::get);
        
        // Memory gauge (updated periodically)
        this.memoryGauge = Gauge.builder("jvm_memory_used_bytes")
            .description("JVM memory used")
            .tag("area", "heap")
            .register(registry);
        
        // Queue depth
        this.queueDepthGauge = Gauge.builder("payment_queue_depth")
            .description("Pending payments in queue")
            .register(registry);
        
        // Update gauges periodically
        Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(
            this::updateGauges, 0, 15, TimeUnit.SECONDS
        );
    }
    
    public void connectionOpened() {
        activeConnections.incrementAndGet();
    }
    
    public void connectionClosed() {
        activeConnections.decrementAndGet();
    }
    
    public void updateQueueDepth(int depth) {
        queueDepthGauge.set(depth);
    }
    
    private void updateGauges() {
        Runtime runtime = Runtime.getRuntime();
        long usedMemory = runtime.totalMemory() - runtime.freeMemory();
        memoryGauge.set(usedMemory);
    }
}
```

**PromQL Queries:**
```promql
# Current connections
db_active_connections

# Memory usage trend
jvm_memory_used_bytes[1h]

# Queue depth alerts
payment_queue_depth > 1000
```

**Explanation:**
Gauges report point-in-time values. Unlike counters, `rate()` doesn't apply—gauges can decrease. Alert on thresholds (queue depth > 1000) or visualize trends over time.

### Histogram

Histograms sample observations (request durations, response sizes) into configurable buckets, providing distribution data.

```java
@Component
public class RequestMetrics {
    private final Histogram requestDuration;
    private final Histogram paymentAmount;
    
    public RequestMetrics(MeterRegistry registry) {
        // Request duration histogram
        this.requestDuration = Histogram.builder("http_request_duration_seconds")
            .description("HTTP request duration in seconds")
            .baseUnit("seconds")
            .tags("service", "payment-service")
            // Explicit buckets for SLA boundaries
            .serviceLevelObjectives(
                Duration.ofMillis(100),  // 100ms
                Duration.ofMillis(250),  // 250ms
                Duration.ofMillis(500),  // 500ms
                Duration.ofSeconds(1),   // 1s
                Duration.ofSeconds(2),   // 2s
                Duration.ofSeconds(5)    // 5s
            )
            .register(registry);
        
        // Payment amount distribution
        this.paymentAmount = Histogram.builder("payment_amount_usd")
            .description("Payment amount distribution")
            .baseUnit("dollars")
            .buckets(10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000)
            .register(registry);
    }
    
    public void recordRequest(HttpServletRequest request, 
                              HttpServletResponse response,
                              long durationMs) {
        double durationSeconds = durationMs / 1000.0;
        
        requestDuration.record(durationSeconds, 
            Tags.of("method", request.getMethod(),
                   "path", request.getRequestURI(),
                   "status", String.valueOf(response.getStatus())));
    }
    
    public void recordPayment(BigDecimal amount) {
        paymentAmount.record(amount.doubleValue(),
            Tags.of("currency", "USD"));
    }
}
```

**PromQL Queries:**
```promql
# 95th percentile latency
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# Average payment amount
sum(payment_amount_usd_sum) / sum(payment_amount_usd_count)

# Request rate by status
sum(rate(http_request_duration_seconds_count[5m])) by (status)
```

**Explanation:**
Histograms expose `_bucket` (cumulative counters per bucket), `_sum` (sum of all observations), and `_count` (total observations). `histogram_quantile` calculates percentiles from buckets. The `le` (less than or equal) label distinguishes buckets.

### Summary

Summaries calculate configurable quantiles (percentiles) over a sliding time window, using less memory than histograms but without aggregation across instances.

```java
@Component
public class LatencySummary {
    private final DistributionSummary responseTime;
    
    public LatencySummary(MeterRegistry registry) {
        this.responseTime = DistributionSummary.builder("response_time_ms")
            .description("Response time in milliseconds")
            .baseUnit("milliseconds")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);
    }
    
    public void record(long milliseconds) {
        responseTime.record(milliseconds);
    }
}
```

**Key Difference:**
- **Histogram**: Pre-defined buckets, aggregatable across instances, client-side calculation
- **Summary**: Calculated percentiles (0.5, 0.95, 0.99), not aggregatable across instances, server-side calculation

Use histograms when you need to aggregate across services (e.g., cluster-wide latency percentiles). Use summaries when you need precise percentiles for a single instance and memory efficiency matters.

## 44.3 Service Discovery

Prometheus discovers targets dynamically in Kubernetes environments.

### Kubernetes SD Configuration

```yaml
# prometheus-kubernetes-sd.yml
scrape_configs:
  # API servers
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Nodes (kubelet)
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

  # Pods (application metrics)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Scrape only pods with annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      
      # Set metrics path
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # Add labels
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: kubernetes_pod_name
```

**Explanation:**
- **role: pod**: Discovers all Pods in the cluster
- **relabel_configs**: Transform discovered metadata into scrape configuration
- **keep**: Only scrape Pods with `prometheus.io/scrape: "true"` annotation
- **replace**: Construct target address from Pod IP and annotation-specified port

### Pod Annotations

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/actuator/prometheus"
    spec:
      containers:
        - name: payment
          image: payment-service:v2.1.0
          ports:
            - containerPort: 8080
```

## 44.4 PromQL

Prometheus Query Language (PromQL) retrieves and manipulates time-series data.

### Basic Selectors

```promql
# Select all time series with metric name
http_requests_total

# Select with label matchers
http_requests_total{service="payment-service", status="200"}

# Regex matchers
http_requests_total{service=~"payment.*", status!~"4..|5.."}

# Range vectors (last 5 minutes)
http_requests_total[5m]

# Offset (1 hour ago)
http_requests_total offset 1h
```

### Aggregation Operators

```promql
# Sum across all instances
sum(http_requests_total)

# Sum by label
sum(http_requests_total) by (service, status)

# Average
avg(http_request_duration_seconds_sum) by (service)

# Percentiles (histograms)
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# Top 3 services by request rate
topk(3, 
  sum(rate(http_requests_total[5m])) by (service)
)

# Rate of change
rate(http_requests_total[5m])

# Increase over time range
increase(http_requests_total[1h])
```

### Advanced Queries

```promql
# Error rate calculation
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))

# Latency SLO: 99th percentile < 200ms
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < 0.2

# Predict disk full in 4 hours
predict_linear(
  node_filesystem_avail_bytes{mountpoint="/"}[1h], 
  4 * 3600
) < 0

# Joining metrics
node_cpu_seconds_total * on(instance) group_left(nodename) 
  node_uname_info
```

## 44.5 Grafana Dashboards

### Dashboard as Code

```yaml
# dashboards/payment-service.json (simplified)
{
  "dashboard": {
    "title": "Payment Service",
    "tags": ["microservice", "payment", "production"],
    "timezone": "UTC",
    "schemaVersion": 36,
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"payment-service\"}[5m])) by (status)",
            "legendFormat": "{{status}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "min": 0,
            "color": {
              "mode": "palette-classic"
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Latency Distribution",
        "type": "heatmap",
        "targets": [
          {
            "expr": "sum(rate(http_request_duration_seconds_bucket{service=\"payment-service\"}[5m])) by (le)",
            "format": "heatmap"
          }
        ],
        "dataFormat": "tsbuckets",
        "heatmap": {
          "color": {
            "mode": "opacity",
            "fill": "dark-orange"
          }
        }
      },
      {
        "id": 3,
        "title": "Error Budget",
        "type": "stat",
        "targets": [
          {
            "expr": "1 - (sum(rate(http_requests_total{service=\"payment-service\",status=~\"5..\"}[30d])) / sum(rate(http_requests_total{service=\"payment-service\"}[30d])))",
            "legendFormat": "Availability"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percentunit",
            "thresholds": {
              "steps": [
                { "color": "red", "value": 0 },
                { "color": "yellow", "value": 0.99 },
                { "color": "green", "value": 0.999 }
              ]
            }
          }
        }
      }
    ]
  }
}
```

**Explanation:**
- **Timeseries**: Line graphs showing request rates over time, colored by status code.
- **Heatmap**: Latency distribution showing frequency of different response times (bucketed).
- **Stat**: Single value panel showing error budget remaining (99.9% availability target).

## 44.6 SLOs and Error Budgets

Service Level Objectives (SLOs) define target reliability levels. Error budgets quantify how much unreliability is acceptable.

### Defining SLIs and SLOs

```yaml
# slo-definitions.yml
slos:
  - name: payment-service-availability
    description: "Successful payment requests"
    service: payment-service
    type: availability
    
    # SLI: What we measure
    sli:
      numerator: |
        sum(rate(http_requests_total{service="payment-service",status!~"5.."}[{{.window}}]))
      denominator: |
        sum(rate(http_requests_total{service="payment-service"}[{{.window}}]))
    
    # SLO: Target threshold
    target: 0.999  # 99.9%
    
    # Windows for evaluation
    windows:
      - name: "30d"
        duration: 30d
        burn_rates:
          - name: "fast"
            factor: 14.4  # 2% budget in 1 hour
            alert: page
          - name: "slow"
            factor: 2  # 5% budget in 6 hours
            alert: ticket

  - name: payment-service-latency
    description: "Fast payment processing"
    service: payment-service
    type: latency
    
    sli:
      query: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket{service="payment-service"}[{{.window}}])) by (le)
        ) < 0.5  # 500ms
    
    target: 0.99  # 99% of requests under 500ms
```

**Explanation:**
- **SLI (Service Level Indicator)**: The query that measures service behavior. For availability: good requests / total requests.
- **SLO (Service Level Objective)**: The target percentage (99.9% availability).
- **Error Budget**: 0.1% of requests can fail (or 0.1% of time can be >500ms for latency).
- **Burn Rates**: How fast the error budget is consumed. 14.4x burn rate means exhausting 2% of monthly budget in 1 hour—page immediately. 2x means 5% in 6 hours—create ticket.

### Prometheus Alerting Rules

```yaml
# rules/slo-alerts.yml
groups:
  - name: slo-alerts
    interval: 30s
    rules:
      # Fast burn alert - page immediately
      - alert: ErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) 
            / 
            sum(rate(http_requests_total[1h]))
          ) 
          > 14.4 * (1 - 0.999)  # 14.4x burn rate
        for: 2m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error rate is {{ $value | humanizePercentage }} over last hour"
          
      # Slow burn alert - ticket
      - alert: ErrorBudgetSlowBurn
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h])) 
            / 
            sum(rate(http_requests_total[6h]))
          ) 
          > 2 * (1 - 0.999)  # 2x burn rate
        for: 30m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Slow error budget burn detected"
          
      # Latency SLO violation
      - alert: HighLatency99th
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "99th percentile latency exceeds 500ms"
```

**Explanation:**
- **Fast burn**: Triggers if error rate exceeds 14.4 × 0.001 = 1.44% over 1 hour. This would consume 2% of monthly budget in 1 hour.
- **Slow burn**: Triggers if error rate exceeds 2 × 0.001 = 0.2% over 6 hours.
- **histogram_quantile**: Calculates the 99th percentile from bucket data. If > 0.5 (500ms), alert.

## 44.7 Alerting

### Alertmanager Configuration

```yaml
# alertmanager.yml
global:
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: 'alerts@company.com'
  smtp_auth_username: 'alerts@company.com'
  smtp_auth_password: '${SMTP_PASSWORD}'
  slack_api_url: '${SLACK_WEBHOOK_URL}'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  opsgenie_api_url: 'https://api.opsgenie.com/'
  resolve_timeout: 5m

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  
  routes:
    # Critical alerts -> PagerDuty immediately
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
      
    # Platform team alerts
    - match_re:
        team: platform|sre
      receiver: 'slack-platform'
      group_by: ['alertname', 'namespace']
      
    # Payment service alerts -> dedicated channel
    - match:
        service: payment-service
      receiver: 'slack-payments'
      routes:
        - match:
            severity: critical
          receiver: 'pagerduty-payments'

inhibit_rules:
  # Inhibit warning if critical is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']
    
  # Inhibit node alerts if cluster is down
  - source_match:
      alertname: 'ClusterDown'
    target_match_re:
      alertname: 'Node.*'
    equal: ['cluster']

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#alerts'
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'
        send_resolved: true
        
  - name: 'slack-platform'
    slack_configs:
      - channel: '#platform-alerts'
        title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        fields:
          - title: Severity
            value: '{{ .CommonLabels.severity }}'
          - title: Service
            value: '{{ .CommonLabels.service }}'
            
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '${PAGERDUTY_SERVICE_KEY}'
        severity: '{{ .CommonLabels.severity }}'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.default.instances" . }}'
          runbook_url: '{{ .CommonAnnotations.runbook_url }}'
          
  - name: 'email-sre'
    email_configs:
      - to: 'sre@company.com'
        from: 'alerts@company.com'
        smarthost: 'smtp.company.com:587'
        auth_username: 'alerts@company.com'
        auth_password: '${EMAIL_PASSWORD}'
        headers:
          Subject: '{{ .CommonAnnotations.summary }}'
        html: '{{ template "email.default.html" . }}'
```

**Explanation:**
- **route**: Tree-based routing. Alerts traverse the tree, matching conditions. `continue: true` allows matching multiple routes.
- **inhibit_rules**: Suppresses notifications. If "ClusterDown" fires, inhibit all "NodeDown" alerts (they're symptoms, not causes).
- **group_by**: Groups alerts with same labels into single notification (prevents spam).
- **group_wait**: Wait 30s for additional alerts to arrive before sending first notification.
- **repeat_interval**: Re-alert every 12 hours if condition persists.

## 44.8 CI/CD Integration

### Deployment Metrics

Track deployment frequency and success:

```yaml
# CI pipeline metric push
- name: Record Deployment
  run: |
    cat <<EOF | curl -X POST http://pushgateway:9091/metrics/job/ci-pipeline \
      --data-binary @-
    # HELP deployment_timestamp_seconds Unix timestamp of deployment
    # TYPE deployment_timestamp_seconds gauge
    deployment_timestamp_seconds{service="payment-service",version="${{ github.sha }}",environment="production",status="success"} $(date +%s)
    
    # HELP deployment_duration_seconds Duration of deployment pipeline
    # TYPE deployment_duration_seconds gauge
    deployment_duration_seconds{service="payment-service",environment="production"} ${{ steps.deploy.outputs.duration }}
    EOF
```

### Canary Analysis Metrics

Flagger and Argo Rollouts use metrics for automated promotion:

```yaml
# AnalysisTemplate for CI/CD
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: deployment-success
spec:
  provider:
    type: prometheus
    address: http://prometheus:9090
  query: |
    sum(
      rate(
        http_requests_total{
          service="{{ service }}",
          version="canary",
          status!~"5.."
        }[1m]
      )
    )
    /
    sum(
      rate(
        http_requests_total{
          service="{{ service }}",
          version="canary"
        }[1m]
      )
    )
```

---

## Chapter Summary and Preview

This chapter established metrics and monitoring as quantitative foundations for reliability engineering, complementing the qualitative insights of logging with measurable, actionable data. We examined Prometheus as the central metrics system, exploring its pull-based architecture, dimensional data model using key-value labels, and the four core metric types: counters for cumulative events, gauges for point-in-time values, histograms for distribution analysis, and summaries for configurable quantiles. The distinction between these types—particularly the aggregability of histograms versus the precise quantiles of summaries—guides instrumentation decisions based on query requirements and resource constraints.

PromQL query language enables sophisticated operational analysis, from simple selectors to complex aggregations, percentile calculations using `histogram_quantile`, and rate calculations that handle counter resets. Service discovery mechanisms automatically detect Kubernetes targets, using relabeling to transform pod metadata into scrape configurations without manual maintenance. The Alertmanager routing tree provides sophisticated notification management, with inhibition rules that suppress symptomatic alerts during root cause events, and grouping that prevents alert storms.

Service Level Objectives (SLOs) and error budgets translate business requirements into technical targets, defining acceptable unreliability levels and burn rates that trigger alerts before budget exhaustion. CI/CD integration ensures deployment events are tracked as metrics, enabling correlation between releases and metric changes, while canary analysis uses automated metric evaluation to determine promotion or rollback without human judgment.

**Key Takeaways:**
- Instrument applications with Prometheus client libraries using appropriate metric types: counters for monotonic events, gauges for fluctuating values, histograms for latency distributions with SLO-aligned buckets, and summaries only when precise quantiles without aggregation are required.
- Use histograms with explicit buckets aligned to SLO boundaries (e.g., 100ms, 250ms, 500ms) rather than default buckets, enabling accurate SLO measurement via `histogram_quantile`.
- Implement distributed tracing context (trace_id, span_id) in metric labels to correlate metrics with logs and traces during incident investigation.
- Configure Alertmanager with inhibition rules to suppress symptomatic alerts when root cause alerts fire, and use grouping to batch related alerts into single notifications.
- Define SLOs with explicit error budgets and burn rate alerts (fast burn at 14.4x, slow burn at 2x) to detect reliability degradation before budget exhaustion.
- Integrate deployment markers into metrics (deployment_timestamp gauge) to correlate metric changes with releases, enabling quick identification of problematic deployments.

**Next Chapter Preview:**
Chapter 45: Distributed Tracing completes the observability triad by examining request flow across microservices. We will explore OpenTelemetry as the unified instrumentation standard, trace context propagation (W3C Trace Context, B3), span creation and attributes, sampling strategies (head-based, tail-based, probabilistic), and trace visualization in Jaeger and Zipkin. The chapter covers baggage for cross-service context propagation, correlation with logs and metrics using trace IDs, and performance considerations for high-throughput services, establishing the final observability pillar that enables understanding of request latency decomposition across distributed architectures.