# Chapter 68: Monitoring and Alerting

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the importance of monitoring and alerting for production machine learning systems
- Define a monitoring strategy covering system, application, and business metrics for the NEPSE prediction system
- Instrument a Python application to expose metrics (counters, gauges, histograms) using Prometheus
- Set up Prometheus to scrape metrics and store them as time‑series data
- Build dashboards in Grafana to visualise key performance indicators
- Implement structured logging and aggregate logs using the ELK stack or Loki
- Use distributed tracing to debug latency issues in microservices
- Design alerting rules to notify the team of anomalies and incidents
- Define Service Level Objectives (SLOs) and track error budgets
- Apply best practices for on‑call rotations and incident response

---

## Introduction

Deploying the NEPSE prediction system is only the beginning. Once live, we need to ensure it continues to function correctly, perform within acceptable limits, and deliver value. **Monitoring** provides visibility into the system's health and behaviour, while **alerting** notifies us when something requires attention. Together, they form the eyes and ears of a production system.

In traditional software, monitoring focuses on metrics like CPU usage, request latency, and error rates. For machine learning systems, we must also monitor model‑specific concerns: prediction drift, data drift, and feature distribution shifts. A model that is technically responding may still be producing poor predictions because the underlying data has changed.

In this chapter, we will build a comprehensive monitoring stack for the NEPSE system. We'll use **Prometheus** for metrics, **Grafana** for dashboards, **Loki** for logs, and **Jaeger** for tracing. We'll also discuss alerting with **Alertmanager** and define SLOs to guide our reliability efforts.

---

## 68.1 Monitoring Strategy

A good monitoring strategy answers three questions:

1. **What to monitor?** – Identify the key indicators of system health and business performance.
2. **How to collect data?** – Instrument the application, set up exporters, and aggregate logs.
3. **How to respond?** – Define alerts, escalation policies, and runbooks.

For the NEPSE system, we categorise metrics into three layers:

### 68.1.1 System Metrics

- **Infrastructure**: CPU, memory, disk, network I/O (per server/container).
- **Kubernetes**: Pod status, resource usage, restart counts.
- **Database**: Connection pool usage, query latency, replication lag.

### 68.1.2 Application Metrics

- **Request rate**: Number of prediction requests per second, per endpoint.
- **Latency**: Distribution of response times (p50, p95, p99).
- **Error rate**: Percentage of failed requests (4xx, 5xx).
- **Model inference time**: Time spent in the model itself.
- **Feature retrieval time**: Time to fetch features from the online store.

### 68.1.3 Business Metrics

- **Prediction volume**: Total predictions made, per symbol.
- **Prediction distribution**: Histogram of predicted probabilities (for classification).
- **Model performance**: When ground truth becomes available (e.g., next day), track accuracy, precision, recall.
- **Drift metrics**: Data drift scores (PSI) for each feature.
- **User engagement**: If exposed to users, track number of active users, requests per user.

### 68.1.4 Alerting Philosophy

We follow the **"alert on symptoms, not causes"** principle. For example, instead of alerting on high CPU (a cause), we alert on high latency or error rate (symptoms that users experience). However, we still monitor causes for debugging.

Alerts should be:

- **Actionable**: Something needs to be done.
- **Timely**: Not too late.
- **Relevant**: Not too noisy.

---

## 68.2 Instrumenting the Application with Prometheus

Prometheus is a popular open‑source monitoring system that collects metrics via HTTP scraping. We'll instrument our FastAPI prediction service to expose metrics.

### 68.2.1 Installing Prometheus Client

```bash
pip install prometheus-client
```

### 68.2.2 Adding Metrics to FastAPI

We'll create a `metrics.py` module that defines our metrics and a `/metrics` endpoint.

```python
# app/metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from fastapi import Response
import time

# Define metrics
PREDICTIONS = Counter('predictions_total', 'Total number of predictions', ['symbol', 'model_version'])
PREDICTION_LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency', buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5])
ERRORS = Counter('prediction_errors_total', 'Total prediction errors', ['error_type'])
FEATURE_FETCH_LATENCY = Histogram('feature_fetch_latency_seconds', 'Feature fetch latency')
MODEL_VERSION = Gauge('model_version_info', 'Model version', ['version'])
FEATURE_DRIFT = Gauge('feature_drift_psi', 'Feature drift PSI', ['feature'])

# For tracking the current model version
def set_model_version(version):
    MODEL_VERSION.labels(version=version).set(1)

# For updating drift metrics (called periodically)
def update_drift_metrics(drift_scores):
    for feature, score in drift_scores.items():
        FEATURE_DRIFT.labels(feature=feature).set(score)

# FastAPI endpoint for Prometheus to scrape
@app.get("/metrics")
async def get_metrics():
    return Response(content=generate_latest(), media_type="text/plain")
```

### 68.2.3 Instrumenting the Prediction Endpoint

Now we use these metrics in our prediction endpoint.

```python
# app/main.py (simplified)
from fastapi import FastAPI, HTTPException
import time
from .metrics import PREDICTIONS, PREDICTION_LATENCY, ERRORS, FEATURE_FETCH_LATENCY

app = FastAPI()

@app.post("/predict")
async def predict(symbol: str, features: dict):
    start_time = time.time()
    try:
        # Measure feature fetch time
        fetch_start = time.time()
        # ... fetch features from online store ...
        fetch_time = time.time() - fetch_start
        FEATURE_FETCH_LATENCY.observe(fetch_time)

        # Model inference
        inference_start = time.time()
        # ... run model ...
        prob = 0.75  # dummy
        inference_time = time.time() - inference_start

        # Record metrics
        PREDICTIONS.labels(symbol=symbol, model_version="v1.2").inc()
        PREDICTION_LATENCY.observe(time.time() - start_time)

        return {"probability": prob}
    except Exception as e:
        ERRORS.labels(error_type=type(e).__name__).inc()
        raise HTTPException(status_code=500, detail=str(e))
```

**Explanation:**  
- `PREDICTIONS` counts each successful prediction, labelled by symbol and model version.
- `PREDICTION_LATENCY` records the total latency.
- `FEATURE_FETCH_LATENCY` measures how long it takes to retrieve features.
- `ERRORS` counts failures by exception type.

---

## 68.3 Setting Up Prometheus

Prometheus scrapes metrics endpoints at regular intervals. We need to configure it to target our service.

### 68.3.1 Prometheus Configuration

Create a `prometheus.yml` file:

```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'nepse-predictor'
    static_configs:
      - targets: ['localhost:8000']  # if running locally
    metrics_path: /metrics

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']  # node_exporter for system metrics

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
```

**Explanation:**  
- The first job scrapes our prediction service directly.
- The second job scrapes `node_exporter` for host‑level metrics (CPU, memory, etc.).
- The third job demonstrates auto‑discovery of Kubernetes pods annotated with `prometheus.io/scrape: true`. This is useful in a Kubernetes deployment.

### 68.3.2 Running Prometheus

Using Docker:

```bash
docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
```

Then access the UI at `http://localhost:9090`.

### 68.3.3 Key Prometheus Queries

Some useful queries for the NEPSE system:

- Request rate per second: `rate(predictions_total[5m])`
- Error rate: `rate(prediction_errors_total[5m])`
- 95th percentile latency: `histogram_quantile(0.95, sum(rate(prediction_latency_seconds_bucket[5m])) by (le))`
- Feature fetch latency average: `avg(rate(feature_fetch_latency_seconds_sum[5m]) / rate(feature_fetch_latency_seconds_count[5m]))`
- CPU usage (if node_exporter is running): `100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`

---

## 68.4 Visualising with Grafana

Grafana connects to Prometheus and provides rich dashboards.

### 68.4.1 Installing Grafana

```bash
docker run -d -p 3000:3000 --name=grafana grafana/grafana
```

Default login: admin/admin.

### 68.4.2 Adding Prometheus Data Source

1. Go to Configuration → Data Sources → Add data source.
2. Select Prometheus.
3. Set URL to `http://localhost:9090` (or the Prometheus server address).
4. Save & Test.

### 68.4.3 Building a Dashboard

Create a new dashboard and add panels. Example panels:

- **Request Rate** – Graph of `rate(predictions_total[5m])` by `symbol`.
- **Error Rate** – Graph of `rate(prediction_errors_total[5m])`.
- **Latency (p95)** – Graph using the histogram quantile query.
- **Model Version** – Stat panel showing the current model version (using `model_version_info`).
- **Drift Scores** – Table or gauge of `feature_drift_psi`.
- **System Metrics** – CPU, memory of the pods.

Grafana also supports alerting (though we'll use Alertmanager for production).

---

## 68.5 Log Aggregation

Metrics give aggregate numbers, but logs provide detailed events. For the NEPSE system, we want to log:

- Every prediction request (with anonymised features, if necessary).
- Errors and stack traces.
- Model version changes.
- Anomaly detection events.

### 68.5.1 Structured Logging with JSON

We'll use the `structlog` library to output JSON logs, which are easy to ingest into log aggregation systems.

```python
# app/logging_config.py
import structlog
import logging

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logger = structlog.get_logger()

# In prediction endpoint
logger.info("prediction_request", symbol=symbol, features=features, model_version="v1.2")
```

### 68.5.2 Log Aggregation with Loki

Loki is a log aggregation system by Grafana Labs, designed to be lightweight and integrated with Grafana. It indexes only metadata, not the full log text, making it cost‑effective.

**Running Loki with Docker:**

```bash
docker run -d --name=loki -p 3100:3100 grafana/loki:latest
```

**Installing Promtail** (log collector) to ship logs from our service:

```yaml
# promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
  - job_name: nepse-predictor
    static_configs:
      - targets: [localhost]
        labels:
          job: nepse-predictor
          __path__: /var/log/nepse/*.log   # where our app writes logs
```

**Explanation:**  
Promtail tails log files and sends them to Loki. In Kubernetes, you would run Promtail as a DaemonSet to collect logs from all pods.

### 68.5.3 Alternative: ELK Stack

If you need full‑text search and more advanced analytics, use Elasticsearch, Logstash, and Kibana. Filebeat ships logs to Logstash or directly to Elasticsearch.

---

## 68.6 Distributed Tracing

In a microservices architecture (e.g., prediction service, feature store, model service), a single request may traverse multiple services. Distributed tracing helps understand where time is spent.

### 68.6.1 Instrumenting with OpenTelemetry

OpenTelemetry is the industry standard for tracing. We'll instrument our FastAPI app.

```bash
pip install opentelemetry-distro opentelemetry-exporter-jaeger
```

**Setup:**

```python
# app/tracing.py
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def setup_tracing(app):
    resource = Resource(attributes={
        SERVICE_NAME: "nepse-predictor"
    })
    provider = TracerProvider(resource=resource)
    processor = BatchSpanProcessor(
        JaegerExporter(
            agent_host_name="localhost",
            agent_port=6831,
        )
    )
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

    FastAPIInstrumentor.instrument_app(app)

# In main.py
from .tracing import setup_tracing
setup_tracing(app)
```

**Explanation:**  
- The tracer provider is configured to export spans to Jaeger (running locally or in the cluster).
- `FastAPIInstrumentor` automatically creates spans for each request, capturing timing and metadata.

### 68.6.2 Running Jaeger

```bash
docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  jaegertracing/all-in-one:latest
```

Access UI at `http://localhost:16686`.

### 68.6.3 Creating Custom Spans

You can also create custom spans for specific operations, like fetching features:

```python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("fetch_features") as span:
    # fetch features
    span.set_attribute("symbol", symbol)
    span.set_attribute("feature_count", len(features))
```

---

## 68.7 Alerting with Alertmanager

Prometheus includes Alertmanager for handling alerts. It can group, inhibit, and route alerts to various receivers (email, Slack, PagerDuty).

### 68.7.1 Defining Alert Rules

Create a file `alerts.yml`:

```yaml
groups:
  - name: nepse_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(prediction_errors_total[5m]) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | printf \"%.2f\" }} errors/s for job {{ $labels.job }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, sum(rate(prediction_latency_seconds_bucket[5m])) by (le)) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for predictions"
          description: "P95 latency is {{ $value | printf \"%.2f\" }}s for job {{ $labels.job }}"

      - alert: FeatureDriftHigh
        expr: feature_drift_psi > 0.25
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Feature drift detected"
          description: "Feature {{ $labels.feature }} has PSI {{ $value | printf \"%.2f\" }}"
```

**Explanation:**  
- The `HighErrorRate` alert fires if the error rate exceeds 0.01 per second for 2 minutes.
- `HighLatency` fires if p95 latency > 0.5s for 5 minutes.
- `FeatureDriftHigh` fires if any feature drift PSI exceeds 0.25 for an hour.

Include this file in your Prometheus configuration under `rule_files`.

### 68.7.2 Configuring Alertmanager

Create `alertmanager.yml`:

```yaml
route:
  group_by: ['alertname', 'job']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'
        send_resolved: true
```

Start Alertmanager:

```bash
docker run -p 9093:9093 -v /path/to/alertmanager.yml:/etc/alertmanager/alertmanager.yml prom/alertmanager
```

Then configure Prometheus to send alerts to Alertmanager by adding:

```yaml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']
```

---

## 68.8 Service Level Objectives (SLOs) and Error Budgets

An SLO is a target for a service level indicator (SLI), such as availability or latency. For the NEPSE system, we might define:

- **SLI: Availability** – Proportion of successful prediction requests (HTTP 200) over a 30‑day window.
- **SLO**: 99.9% availability.
- **SLI: Latency** – Proportion of requests served under 200ms.
- **SLO**: 95% of requests under 200ms.

An **error budget** is 100% minus the SLO. For a 99.9% availability SLO, the error budget is 0.1%. If we exceed this budget (i.e., availability drops below 99.9%), we should stop deploying new features and focus on reliability.

### 68.8.1 Calculating SLOs in Prometheus

We can compute the current availability over a rolling 30‑day window:

```promql
(sum(rate(predictions_total[30d])) + sum(rate(prediction_errors_total[30d]))) / sum(rate(predictions_total[30d]))
```

But availability is usually (successful requests) / (total requests). More accurately:

```
availability = sum(rate(predictions_total[30d])) / (sum(rate(predictions_total[30d])) + sum(rate(prediction_errors_total[30d])))
```

We can create a recording rule to pre‑compute this metric.

### 68.8.2 Alerting on Error Budget Burn

Instead of alerting when we drop below the SLO (which means we've already exhausted the budget), we alert when we are burning budget too fast. For example, if we have a 30‑day error budget, we can alert if we've used 10% of it in the last hour. This gives early warning.

**Burn rate alert example:**

```yaml
- alert: HighErrorBudgetBurn
  expr: (1 - (sum(rate(predictions_total[1h])) / (sum(rate(predictions_total[1h])) + sum(rate(prediction_errors_total[1h]))))) > (0.001 * 0.1)
  for: 1h
  labels:
    severity: page
  annotations:
    summary: "High error budget burn rate"
```

Here `0.001` is the error budget (0.1%) and `0.1` is 10% of it. So this alerts if the error rate over the last hour exceeds 0.01% (10% of the budget).

---

## 68.9 Incident Management

When an alert fires, an incident begins. A good incident management process includes:

- **Detection**: Alert triggers.
- **Response**: On‑call engineer acknowledges.
- **Investigation**: Determine root cause.
- **Mitigation**: Fix or workaround.
- **Resolution**: Service restored.
- **Post‑mortem**: Document what happened, why, and how to prevent recurrence.

Tools like **PagerDuty** or **Opsgenie** handle on‑call rotations, escalations, and incident tracking.

### 68.9.1 Runbooks

For common alerts, create **runbooks** – step‑by‑step guides for investigation and mitigation. For example:

**Alert: HighErrorRate**
1. Check the logs for recent errors (`kubectl logs ...`).
2. Check if the model serving pod is running out of memory.
3. Verify the database connection.
4. If feature store is down, fall back to a cached model.
5. Escalate to the on‑call data scientist if needed.

---

## 68.10 Best Practices for Monitoring and Alerting

1. **Start with the four golden signals**: Latency, traffic, errors, saturation.
2. **Monitor both infrastructure and application**.
3. **Use labels wisely** in Prometheus to allow slicing by service, version, etc.
4. **Avoid alert fatigue** – tune thresholds, use `for` clauses, and resolve quickly.
5. **Write runbooks** for every alert.
6. **Regularly review alerts** – archive unused ones, adjust thresholds.
7. **Use SLOs to guide priorities** – reliability vs. feature development.
8. **Test your monitoring** – simulate failures to ensure alerts fire.
9. **Secure monitoring endpoints** – do not expose `/metrics` publicly without authentication.
10. **Retain metrics and logs** for at least 30 days (or longer for compliance).

---

## Chapter Summary

In this chapter, we built a comprehensive monitoring and alerting stack for the NEPSE prediction system. We covered:

- A monitoring strategy covering system, application, and business metrics.
- Instrumenting a FastAPI application with Prometheus metrics (counters, histograms, gauges).
- Setting up Prometheus to scrape metrics and query them.
- Building dashboards in Grafana to visualise key indicators.
- Structured logging with JSON and aggregating logs with Loki.
- Distributed tracing with OpenTelemetry and Jaeger.
- Defining alerting rules and routing through Alertmanager.
- Introducing SLOs and error budgets to guide reliability efforts.
- Incident management and runbook best practices.

With monitoring and alerting in place, we can sleep soundly knowing that the NEPSE system is being watched, and that we'll be woken only when it truly matters. In the next chapter, we will discuss **Model Drift Detection**, which extends monitoring to the model's performance over time.

---

**End of Chapter 68**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='67. infrastructure_as_code.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='69. cost_management.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
