A production-grade, fully containerised infrastructure observability system that ingests, processes, and visualises real-time server metrics using Kafka, PySpark Structured Streaming, TimescaleDB, and a FastAPI dashboard.
┌─────────────────────────────────────────────────────────────────────────┐
│ Docker Network: observability-net │
│ │
│ ┌──────────────────┐ ┌─────────────────────────────────────────┐ │
│ │ metrics-producer│ │ Kafka Cluster │ │
│ │ (FastAPI + bg │─────▶│ Broker (bitnami/kafka) │ │
│ │ thread) │ │ Zookeeper (bitnami/zookeeper) │ │
│ │ port: 8000 │ │ Topic: infrastructure-metrics │ │
│ └──────────────────┘ └───────────────────┬─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ spark-consumer │ │
│ │ PySpark Structured Streaming │ │
│ │ • Raw metrics → TimescaleDB │ │
│ │ • 5-min windowed aggregations │ │
│ │ • Z-score anomaly detection │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ TimescaleDB │ │
│ │ (PostgreSQL + TimescaleDB ext) │ │
│ │ • raw_metrics (hypertable) │ │
│ │ • aggregated_metrics (hypertable) │ │
│ │ • anomalies │ │
│ └──────────────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ dashboard (FastAPI) │ │
│ │ REST APIs + Jinja2 template │ │
│ │ Chart.js line charts │ │
│ │ Auto-refresh every 5 seconds │ │
│ │ port: 8080 │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
| Tool | Version | Notes |
|---|---|---|
| Docker | 24+ | Docker Desktop or Docker CE |
| Docker Compose | v2.20+ | Included with Docker Desktop |
| RAM | ≥ 6 GB | Spark needs headroom |
| Disk | ≥ 3 GB free | For Docker images |
# 1. Clone / create the project directory
cd cloud-observability
# 2. Build and start all services
docker-compose up --build
# 3. Open the dashboard
open http://localhost:8080Note: The Spark consumer waits ~45 seconds before connecting to allow Kafka to fully elect leaders. The first metrics appear on the dashboard within ~90 seconds of startup.
| Service | URL / Address |
|---|---|
| Dashboard | http://localhost:8080 |
| Metrics Producer | http://localhost:8000/health |
| Kafka | kafka:9092 (internal network only) |
| TimescaleDB | localhost:5432 (user: admin) |
A FastAPI application that runs a background thread publishing simulated metrics every 2 seconds. It models 5 servers with distinct CPU/memory/disk baselines, sine wave load patterns, and Gaussian noise. Every ~2 minutes it injects a deliberate anomaly spike (2.5× normal) on a randomly chosen server and metric.
Apache Kafka acts as the message bus. All metric events flow through the infrastructure-metrics topic. Zookeeper manages cluster coordination. Uses the battle-tested Bitnami images.
A PySpark Structured Streaming job that micro-batches every 5 seconds:
- Raw persistence — each JSON event is written directly to
raw_metrics. - Windowed aggregation — 5-minute windows are accumulated per server; averages are flushed to
aggregated_metrics. - Anomaly detection — Z-scores are computed per server+metric using the last 50 readings.
PostgreSQL 15 with the TimescaleDB extension. The raw_metrics and aggregated_metrics tables are hypertables chunked by day for fast time-series queries. Seed data is inserted at init so the dashboard is never empty on first load.
FastAPI serving a single-page dark-theme dashboard:
- Server cards — live CPU/Memory/Disk bars, colour-coded by severity (green/yellow/red)
- CPU chart — Chart.js line chart, all 5 servers, last 30 minutes
- Anomaly feed — scrollable list of recent anomalies with severity badges
- Cost panel — estimated hourly cost per server (avg metrics × unit price)
The Z-score measures how many standard deviations a reading is from the recent mean:
z = (x - μ) / σ
Where:
x= current metric valueμ= mean of last 50 readings for that server+metricσ= standard deviation of last 50 readings
| |z-score| Range | Severity | |----------------|----------| | 2.5 – 3.5 | WARNING | | > 3.5 | CRITICAL |
At least 10 readings are required before anomaly detection activates (cold-start protection). Injected anomalies (2.5× base value) reliably push z-scores into the CRITICAL range.
hourly_cost = (avg_cpu% × $0.048) + (avg_memory% × $0.006)
This is a simplified model treating CPU utilisation percentage as a billable unit (e.g., vCPU-hours) and memory percentage as a proxy for GB/hour. Real implementations would map these to instance types, reserved vs on-demand pricing, etc.
| Layer | Technology |
|---|---|
| Message Bus | Apache Kafka + Zookeeper (Bitnami) |
| Stream Processing | Apache Spark 3.5 (PySpark) |
| Time-Series DB | TimescaleDB (PostgreSQL 15) |
| API Backend | FastAPI + Uvicorn |
| Templating | Jinja2 |
| Frontend Charts | Chart.js 4.4 |
| DB Driver | psycopg2-binary |
| Containerisation | Docker + Docker Compose v2 |
All credentials are set via environment variables in docker-compose.yml:
| Variable | Default |
|---|---|
| POSTGRES_USER | admin |
| POSTGRES_PASSWORD | admin123 |
| POSTGRES_DB | observability |
| KAFKA_BOOTSTRAP_SERVERS | kafka:9092 |
| KAFKA_TOPIC | infrastructure-metrics |
# Stop all containers
docker-compose down
# Stop AND remove volumes (wipes DB data)
docker-compose down -vDashboard shows "No data yet" Wait 90–120 seconds after startup for the full pipeline to warm up.
Spark consumer keeps restarting It will retry automatically. Kafka sometimes takes 30–60s to elect leaders. The consumer has built-in retry logic.
Port 5432 conflict
If you have a local PostgreSQL running, either stop it or change the TimescaleDB port mapping in docker-compose.yml.