⬡ ObserveOps — Real-Time Cloud Observability & Cost Platform

A production-grade, fully containerised infrastructure observability system that ingests, processes, and visualises real-time server metrics using Kafka, PySpark Structured Streaming, TimescaleDB, and a FastAPI dashboard.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         Docker Network: observability-net                │
│                                                                         │
│  ┌──────────────────┐      ┌─────────────────────────────────────────┐  │
│  │  metrics-producer│      │              Kafka Cluster              │  │
│  │  (FastAPI + bg   │─────▶│   Broker (bitnami/kafka)                │  │
│  │   thread)        │      │   Zookeeper (bitnami/zookeeper)         │  │
│  │  port: 8000      │      │   Topic: infrastructure-metrics         │  │
│  └──────────────────┘      └───────────────────┬─────────────────────┘  │
│                                                │                        │
│                                                ▼                        │
│                             ┌──────────────────────────────────────┐    │
│                             │         spark-consumer               │    │
│                             │  PySpark Structured Streaming        │    │
│                             │  • Raw metrics → TimescaleDB         │    │
│                             │  • 5-min windowed aggregations       │    │
│                             │  • Z-score anomaly detection         │    │
│                             └──────────────────┬───────────────────┘    │
│                                                │                        │
│                                                ▼                        │
│                             ┌──────────────────────────────────────┐    │
│                             │          TimescaleDB                 │    │
│                             │  (PostgreSQL + TimescaleDB ext)      │    │
│                             │  • raw_metrics       (hypertable)    │    │
│                             │  • aggregated_metrics (hypertable)   │    │
│                             │  • anomalies                         │    │
│                             └──────────────────┬───────────────────┘    │
│                                                │                        │
│                                                ▼                        │
│                             ┌──────────────────────────────────────┐    │
│                             │         dashboard (FastAPI)          │    │
│                             │  REST APIs + Jinja2 template         │    │
│                             │  Chart.js line charts                │    │
│                             │  Auto-refresh every 5 seconds        │    │
│                             │  port: 8080                          │    │
│                             └──────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────┘

Prerequisites

Tool	Version	Notes
Docker	24+	Docker Desktop or Docker CE
Docker Compose	v2.20+	Included with Docker Desktop
RAM	≥ 6 GB	Spark needs headroom
Disk	≥ 3 GB free	For Docker images

Quick Start

# 1. Clone / create the project directory
cd cloud-observability

# 2. Build and start all services
docker-compose up --build

# 3. Open the dashboard
open http://localhost:8080

Note: The Spark consumer waits ~45 seconds before connecting to allow Kafka to fully elect leaders. The first metrics appear on the dashboard within ~90 seconds of startup.

Accessing the Services

Service	URL / Address
Dashboard	http://localhost:8080
Metrics Producer	http://localhost:8000/health
Kafka	kafka:9092 (internal network only)
TimescaleDB	localhost:5432 (user: admin)

Service Descriptions

metrics-producer

A FastAPI application that runs a background thread publishing simulated metrics every 2 seconds. It models 5 servers with distinct CPU/memory/disk baselines, sine wave load patterns, and Gaussian noise. Every ~2 minutes it injects a deliberate anomaly spike (2.5× normal) on a randomly chosen server and metric.

kafka (+ zookeeper)

Apache Kafka acts as the message bus. All metric events flow through the infrastructure-metrics topic. Zookeeper manages cluster coordination. Uses the battle-tested Bitnami images.

spark-consumer

A PySpark Structured Streaming job that micro-batches every 5 seconds:

Raw persistence — each JSON event is written directly to raw_metrics.
Windowed aggregation — 5-minute windows are accumulated per server; averages are flushed to aggregated_metrics.
Anomaly detection — Z-scores are computed per server+metric using the last 50 readings.

timescaledb

PostgreSQL 15 with the TimescaleDB extension. The raw_metrics and aggregated_metrics tables are hypertables chunked by day for fast time-series queries. Seed data is inserted at init so the dashboard is never empty on first load.

dashboard

FastAPI serving a single-page dark-theme dashboard:

Server cards — live CPU/Memory/Disk bars, colour-coded by severity (green/yellow/red)
CPU chart — Chart.js line chart, all 5 servers, last 30 minutes
Anomaly feed — scrollable list of recent anomalies with severity badges
Cost panel — estimated hourly cost per server (avg metrics × unit price)

How Anomaly Detection Works

The Z-score measures how many standard deviations a reading is from the recent mean:

z = (x - μ) / σ

Where:

x = current metric value
μ = mean of last 50 readings for that server+metric
σ = standard deviation of last 50 readings

| |z-score| Range | Severity | |----------------|----------| | 2.5 – 3.5 | WARNING | | > 3.5 | CRITICAL |

At least 10 readings are required before anomaly detection activates (cold-start protection). Injected anomalies (2.5× base value) reliably push z-scores into the CRITICAL range.

How Cost Estimation Works

hourly_cost = (avg_cpu% × $0.048) + (avg_memory% × $0.006)

This is a simplified model treating CPU utilisation percentage as a billable unit (e.g., vCPU-hours) and memory percentage as a proxy for GB/hour. Real implementations would map these to instance types, reserved vs on-demand pricing, etc.

Tech Stack

Layer	Technology
Message Bus	Apache Kafka + Zookeeper (Bitnami)
Stream Processing	Apache Spark 3.5 (PySpark)
Time-Series DB	TimescaleDB (PostgreSQL 15)
API Backend	FastAPI + Uvicorn
Templating	Jinja2
Frontend Charts	Chart.js 4.4
DB Driver	psycopg2-binary
Containerisation	Docker + Docker Compose v2

Configuration

All credentials are set via environment variables in docker-compose.yml:

Variable	Default
POSTGRES_USER	admin
POSTGRES_PASSWORD	admin123
POSTGRES_DB	observability
KAFKA_BOOTSTRAP_SERVERS	kafka:9092
KAFKA_TOPIC	infrastructure-metrics

Stopping the Platform

# Stop all containers
docker-compose down

# Stop AND remove volumes (wipes DB data)
docker-compose down -v

Troubleshooting

Dashboard shows "No data yet" Wait 90–120 seconds after startup for the full pipeline to warm up.

Spark consumer keeps restarting It will retry automatically. Kafka sometimes takes 30–60s to elect leaders. The consumer has built-in retry logic.

Port 5432 conflict If you have a local PostgreSQL running, either stop it or change the TimescaleDB port mapping in docker-compose.yml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⬡ ObserveOps — Real-Time Cloud Observability & Cost Platform

Architecture

Prerequisites

Quick Start

Accessing the Services

Service Descriptions

metrics-producer

kafka (+ zookeeper)

spark-consumer

timescaledb

dashboard

How Anomaly Detection Works

How Cost Estimation Works

Tech Stack

Configuration

Stopping the Platform

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dashboard		dashboard
metrics-producer		metrics-producer
spark-consumer		spark-consumer
timescaledb		timescaledb
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

⬡ ObserveOps — Real-Time Cloud Observability & Cost Platform

Architecture

Prerequisites

Quick Start

Accessing the Services

Service Descriptions

metrics-producer

kafka (+ zookeeper)

spark-consumer

timescaledb

dashboard

How Anomaly Detection Works

How Cost Estimation Works

Tech Stack

Configuration

Stopping the Platform

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages