# Chapter 44: Monitoring and Observability

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the three pillars of observability—metrics, logs, and traces—and how they apply to a prediction system
- Set up monitoring for system‑level metrics such as CPU, memory, latency, and throughput of your NEPSE prediction service
- Instrument your application to emit custom metrics (e.g., number of predictions, model inference time, prediction distribution)
- Implement structured logging to capture detailed events for debugging and audit trails
- Use distributed tracing to follow a request through multiple microservices
- Configure alerting rules to notify you of anomalies or service degradation
- Build real‑time dashboards with Grafana to visualise system health and model performance
- Detect and respond to model drift (data drift and concept drift) using statistical tests and monitoring tools

---

## Introduction

Once your NEPSE stock prediction model is deployed into production, the journey is far from over. In fact, it has just begun. A model that performs well today may degrade tomorrow because of changing market conditions, data distribution shifts, or infrastructure issues. **Monitoring and observability** are what allow you to detect, diagnose, and remediate these problems before they affect users.

Observability goes beyond traditional monitoring. It gives you the ability to ask arbitrary questions about your system’s internal state based on the data it emits—metrics, logs, and traces. In this chapter, we will build a comprehensive observability stack for our real‑time prediction system. We will instrument the Python service, collect metrics with Prometheus, visualise them in Grafana, implement structured logging with Elasticsearch and Kibana (or a lighter stack), and add distributed tracing with Jaeger. Finally, we will discuss how to monitor for model drift, a critical concern for any machine learning system.

---

## 44.1 The Three Pillars of Observability

Observability is built on three complementary data types:

1. **Metrics** – Numerical measurements aggregated over time (e.g., requests per second, error rate, latency percentiles). Metrics are lightweight and ideal for alerting and dashboards.
2. **Logs** – Discrete events with timestamps and structured or unstructured messages. Logs provide detailed context for debugging.
3. **Traces** – Records of a request’s journey through distributed services, showing where time is spent and where errors occur.

A mature observability practice uses all three in concert. For our NEPSE system, we will:

- **Metrics**: Count predictions per symbol, measure inference latency, track CPU and memory usage.
- **Logs**: Log each prediction request with input features, predicted probability, and any warnings.
- **Traces**: Trace a prediction request from the API gateway through the model inference to the database.

---

## 44.2 System Metrics

System metrics tell us about the health of the infrastructure: are the servers overloaded? Is the network saturated? These metrics are typically collected by agents like **Prometheus Node Exporter** (for host metrics) and **cAdvisor** (for container metrics). For our Python service, we can expose custom application metrics via the Prometheus client library.

### 44.2.1 Exposing Metrics from a Python Service

First, install the Prometheus client:

```bash
pip install prometheus-client
```

Then, in your FastAPI application, create a metrics endpoint that Prometheus can scrape.

```python
# app/metrics.py
from prometheus_client import Counter, Histogram, generate_latest, REGISTRY
from fastapi import Response
import time

# Define metrics
PREDICTIONS = Counter('predictions_total', 'Total number of predictions', ['symbol'])
LATENCY = Histogram('prediction_latency_seconds', 'Prediction latency in seconds', buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5])
ERRORS = Counter('prediction_errors_total', 'Total prediction errors', ['error_type'])

# Instrument the prediction function
@app.post("/predict")
async def predict(symbol: str, features: dict):
    start = time.time()
    try:
        # ... actual prediction logic ...
        PREDICTIONS.labels(symbol=symbol).inc()
        latency = time.time() - start
        LATENCY.observe(latency)
        return {"probability": prob}
    except Exception as e:
        ERRORS.labels(error_type=type(e).__name__).inc()
        raise

# Expose metrics endpoint
@app.get("/metrics")
async def get_metrics():
    return Response(content=generate_latest(REGISTRY), media_type="text/plain")
```

**Explanation:**  
- We define a `Counter` for predictions, labelled by stock symbol. This allows us to see prediction volume per symbol.  
- A `Histogram` tracks latency distribution. The buckets are chosen to capture the typical range (from a few milliseconds to a couple of seconds).  
- An error counter helps track failure rates.  
- The `/metrics` endpoint returns all metrics in the format Prometheus expects. Prometheus will scrape this endpoint periodically.

### 44.2.2 Running Prometheus

Prometheus is a time‑series database that scrapes metrics from configured targets. A basic `prometheus.yml` configuration:

```yaml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'nepse-predictor'
    static_configs:
      - targets: ['localhost:8000']
```

Start Prometheus with:

```bash
prometheus --config.file=prometheus.yml
```

### 44.2.3 Visualising with Grafana

Grafana connects to Prometheus and builds dashboards. After installing Grafana, add Prometheus as a data source. Then create a dashboard with panels for:

- Prediction rate (rate(predictions_total[5m]))
- Error rate (rate(prediction_errors_total[5m]))
- Latency p99 (histogram_quantile(0.99, sum(rate(prediction_latency_seconds_bucket[5m])) by (le)))
- CPU and memory (from Node Exporter)

**Example query for latency p99:**  
```
histogram_quantile(0.99, sum(rate(prediction_latency_seconds_bucket[5m])) by (le))
```

This gives the 99th percentile latency over the last 5 minutes.

---

## 44.3 Application Logging

While metrics give aggregate numbers, logs provide individual events. For a prediction service, you might log:

- Each prediction request and response
- Model version used
- Any anomalies (e.g., input features out of expected range)
- Errors with stack traces

### 44.3.1 Structured Logging with Python

Using the `structlog` library, we can output JSON‑formatted logs that are easy to ingest into systems like Elasticsearch.

```python
import structlog
import logging

# Configure structlog to output JSON
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logger = structlog.get_logger()

@app.post("/predict")
async def predict(symbol: str, features: dict):
    log = logger.bind(symbol=symbol, model_version="v1.2")
    log.info("prediction_start", features=features)
    try:
        prob = model.predict_proba([features])[0][1]
        log.info("prediction_success", probability=prob)
        return {"probability": prob}
    except Exception as e:
        log.error("prediction_error", error=str(e), exc_info=True)
        raise
```

**Explanation:**  
- `structlog` enriches logs with timestamps and allows key‑value pairs.  
- We bind common fields (symbol, model version) to a logger instance so they appear in every log line from that request.  
- The final output is a JSON line, e.g.,  
  `{"event": "prediction_success", "timestamp": "2025-03-15T10:30:00Z", "symbol": "NEPSE", "model_version": "v1.2", "probability": 0.87}`.

### 44.3.2 Log Aggregation

For a production system, logs should be collected centrally. A common stack is **Elasticsearch, Logstash, and Kibana (ELK)** or the lighter **Loki** from Grafana Labs.

**Fluentd** or **Fluent Bit** can be deployed as a daemonset in Kubernetes to ship logs to Elasticsearch or Loki.

With Loki, you can query logs using LogQL and correlate them with metrics in Grafana.

---

## 44.4 Distributed Tracing

When your system consists of multiple services (e.g., prediction API, feature store, database), a single user request may span several components. Distributed tracing helps you understand the end‑to‑end flow and pinpoint bottlenecks.

**OpenTelemetry** is the emerging standard for instrumenting applications. It supports multiple backends like Jaeger, Zipkin, and Tempo.

### 44.4.1 Instrumenting with OpenTelemetry

Install the required packages:

```bash
pip install opentelemetry-distro opentelemetry-exporter-jaeger
```

Set up tracing in your FastAPI app:

```python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure tracer provider
resource = Resource(attributes={
    SERVICE_NAME: "nepse-predictor"
})
provider = TracerProvider(resource=resource)

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

# Add span processor
provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(provider)

# Instrument FastAPI
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
```

**Explanation:**  
- The tracer provider creates spans for each request.  
- The Jaeger exporter sends spans to a Jaeger agent (running locally or in the cluster).  
- `FastAPIInstrumentor` automatically traces incoming requests and creates spans for each endpoint.

Now you can view traces in the Jaeger UI, seeing how long each part of the request took.

---

## 44.5 Alerting

Metrics are useless if no one looks at them. Alerting notifies you when something goes wrong. With Prometheus, you define alerting rules, and **Alertmanager** handles routing to channels like email, Slack, PagerDuty.

### 44.5.1 Defining Alert Rules

Create a file `alerts.yml`:

```yaml
groups:
  - name: nepse_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(prediction_errors_total[5m]) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value }} errors/s for job {{ $labels.job }}"

      - alert: HighLatency
        expr: histogram_quantile(0.99, sum(rate(prediction_latency_seconds_bucket[5m])) by (le)) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for predictions"
```

Then include this file in your Prometheus configuration under `rule_files`.

### 44.5.2 Configuring Alertmanager

Alertmanager receives alerts from Prometheus and sends notifications. A minimal configuration to send to Slack:

```yaml
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'
```

---

## 44.6 Dashboard Design

Dashboards should provide a high‑level overview of system health and drill‑down capabilities. For the NEPSE system, consider these panels:

- **Request Rate** – total predictions per second, coloured by symbol.
- **Latency** – p50, p95, p99 latency over time.
- **Error Rate** – percentage of failed predictions.
- **Model Drift** – plots of feature distributions over time (from logs or feature store).
- **Resource Usage** – CPU, memory of each pod.

Grafana allows you to create variables (e.g., `$symbol`) to filter panels dynamically.

**Example panel for prediction rate by symbol:**  
Query: `sum(rate(predictions_total[$__rate_interval])) by (symbol)`

**Dashboard screenshot** (described textually):  
A line chart showing lines for each symbol, a latency heatmap, a table of recent errors.

---

## 44.7 Model Drift Detection

Beyond system health, we must monitor the model itself. **Drift** occurs when the statistical properties of the input data or the relationship between inputs and outputs change over time. Two main types:

- **Data drift**: The distribution of input features shifts (e.g., average traded volume changes).
- **Concept drift**: The relationship between features and target changes (e.g., previously strong indicators become weak).

### 44.7.1 Detecting Data Drift

We can compare the current feature distribution with a reference distribution (e.g., training data) using statistical tests.

**Example using the `evidently` library:**

```python
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Reference data (training set)
reference = pd.read_csv('nepse_training_features.csv')

# Current production data (last 1000 predictions)
current = get_production_features(1000)

# Generate drift report
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
report.save_html('drift_report.html')
```

Evidently computes drift for each feature using statistical tests (e.g., Kolmogorov‑Smirnov for numerical, chi‑square for categorical) and produces an HTML report.

### 44.7.2 Detecting Concept Drift

Concept drift is harder to detect because it requires ground truth labels, which may be delayed (e.g., next day’s actual price). We can monitor prediction error over time on a held‑out validation set or on production data when labels become available.

**Implementation with Prometheus:**  
Emit a metric `prediction_error` that records the absolute error when the true value is known. Then alert if the rolling average exceeds a threshold.

```python
ERROR = Histogram('prediction_error', 'Prediction absolute error', buckets=[0.01, 0.05, 0.1, 0.2, 0.5])
...
# When true value arrives (e.g., next day)
error = abs(true_value - predicted_value)
ERROR.observe(error)
```

Then create an alert if the recent error rate is too high:

```yaml
- alert: HighPredictionError
  expr: rate(prediction_error_sum[1d]) / rate(prediction_error_count[1d]) > 0.1
  for: 1h
```

### 44.7.3 Automated Response to Drift

When drift is detected, you may want to trigger actions:

- Log a warning
- Send an alert
- Automatically trigger model retraining
- Roll back to a previous model version

This can be orchestrated with a tool like **Apache Airflow** or **Kubeflow**.

---

## 44.8 Putting It All Together

A complete observability stack for the NEPSE prediction system might look like this:

- **Prometheus** for metrics (application and infrastructure).
- **Grafana** for dashboards (visualisation) and alerting.
- **Loki** for log aggregation (or Elasticsearch).
- **Jaeger** for distributed tracing.
- **Evidently** (or custom scripts) for periodic drift analysis.

All components can run in Kubernetes using Helm charts (e.g., kube‑prometheus‑stack, loki‑stack, jaeger‑operator).

---

## Chapter Summary

In this chapter, we built a comprehensive observability framework for our NEPSE real‑time prediction system. We covered:

- The three pillars of observability—metrics, logs, traces—and why each is necessary.
- Instrumenting a Python FastAPI service with Prometheus metrics, including custom counters and histograms.
- Configuring Prometheus to scrape metrics and Grafana to visualise them.
- Implementing structured logging with `structlog` and sending logs to a central aggregator.
- Adding distributed tracing with OpenTelemetry and Jaeger.
- Setting up alerting rules in Prometheus and routing notifications via Alertmanager.
- Detecting data drift and concept drift using statistical tests and error monitoring.
- Designing dashboards that give both operational and business visibility.

With these tools in place, you can ensure that your NEPSE prediction system remains reliable, performant, and accurate over time. In the next chapter, we will explore **Model Drift Detection** in greater depth, focusing on automated retraining strategies to keep your model fresh.

---

**End of Chapter 44**