Skip to content

AsMetOP/cloud-observability-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⬡ ObserveOps — Real-Time Cloud Observability & Cost Platform

A production-grade, fully containerised infrastructure observability system that ingests, processes, and visualises real-time server metrics using Kafka, PySpark Structured Streaming, TimescaleDB, and a FastAPI dashboard.


Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         Docker Network: observability-net                │
│                                                                         │
│  ┌──────────────────┐      ┌─────────────────────────────────────────┐  │
│  │  metrics-producer│      │              Kafka Cluster              │  │
│  │  (FastAPI + bg   │─────▶│   Broker (bitnami/kafka)                │  │
│  │   thread)        │      │   Zookeeper (bitnami/zookeeper)         │  │
│  │  port: 8000      │      │   Topic: infrastructure-metrics         │  │
│  └──────────────────┘      └───────────────────┬─────────────────────┘  │
│                                                │                        │
│                                                ▼                        │
│                             ┌──────────────────────────────────────┐    │
│                             │         spark-consumer               │    │
│                             │  PySpark Structured Streaming        │    │
│                             │  • Raw metrics → TimescaleDB         │    │
│                             │  • 5-min windowed aggregations       │    │
│                             │  • Z-score anomaly detection         │    │
│                             └──────────────────┬───────────────────┘    │
│                                                │                        │
│                                                ▼                        │
│                             ┌──────────────────────────────────────┐    │
│                             │          TimescaleDB                 │    │
│                             │  (PostgreSQL + TimescaleDB ext)      │    │
│                             │  • raw_metrics       (hypertable)    │    │
│                             │  • aggregated_metrics (hypertable)   │    │
│                             │  • anomalies                         │    │
│                             └──────────────────┬───────────────────┘    │
│                                                │                        │
│                                                ▼                        │
│                             ┌──────────────────────────────────────┐    │
│                             │         dashboard (FastAPI)          │    │
│                             │  REST APIs + Jinja2 template         │    │
│                             │  Chart.js line charts                │    │
│                             │  Auto-refresh every 5 seconds        │    │
│                             │  port: 8080                          │    │
│                             └──────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────┘

Prerequisites

Tool Version Notes
Docker 24+ Docker Desktop or Docker CE
Docker Compose v2.20+ Included with Docker Desktop
RAM ≥ 6 GB Spark needs headroom
Disk ≥ 3 GB free For Docker images

Quick Start

# 1. Clone / create the project directory
cd cloud-observability

# 2. Build and start all services
docker-compose up --build

# 3. Open the dashboard
open http://localhost:8080

Note: The Spark consumer waits ~45 seconds before connecting to allow Kafka to fully elect leaders. The first metrics appear on the dashboard within ~90 seconds of startup.


Accessing the Services

Service URL / Address
Dashboard http://localhost:8080
Metrics Producer http://localhost:8000/health
Kafka kafka:9092 (internal network only)
TimescaleDB localhost:5432 (user: admin)

Service Descriptions

metrics-producer

A FastAPI application that runs a background thread publishing simulated metrics every 2 seconds. It models 5 servers with distinct CPU/memory/disk baselines, sine wave load patterns, and Gaussian noise. Every ~2 minutes it injects a deliberate anomaly spike (2.5× normal) on a randomly chosen server and metric.

kafka (+ zookeeper)

Apache Kafka acts as the message bus. All metric events flow through the infrastructure-metrics topic. Zookeeper manages cluster coordination. Uses the battle-tested Bitnami images.

spark-consumer

A PySpark Structured Streaming job that micro-batches every 5 seconds:

  1. Raw persistence — each JSON event is written directly to raw_metrics.
  2. Windowed aggregation — 5-minute windows are accumulated per server; averages are flushed to aggregated_metrics.
  3. Anomaly detection — Z-scores are computed per server+metric using the last 50 readings.

timescaledb

PostgreSQL 15 with the TimescaleDB extension. The raw_metrics and aggregated_metrics tables are hypertables chunked by day for fast time-series queries. Seed data is inserted at init so the dashboard is never empty on first load.

dashboard

FastAPI serving a single-page dark-theme dashboard:

  • Server cards — live CPU/Memory/Disk bars, colour-coded by severity (green/yellow/red)
  • CPU chart — Chart.js line chart, all 5 servers, last 30 minutes
  • Anomaly feed — scrollable list of recent anomalies with severity badges
  • Cost panel — estimated hourly cost per server (avg metrics × unit price)

How Anomaly Detection Works

The Z-score measures how many standard deviations a reading is from the recent mean:

z = (x - μ) / σ

Where:

  • x = current metric value
  • μ = mean of last 50 readings for that server+metric
  • σ = standard deviation of last 50 readings

| |z-score| Range | Severity | |----------------|----------| | 2.5 – 3.5 | WARNING | | > 3.5 | CRITICAL |

At least 10 readings are required before anomaly detection activates (cold-start protection). Injected anomalies (2.5× base value) reliably push z-scores into the CRITICAL range.


How Cost Estimation Works

hourly_cost = (avg_cpu% × $0.048) + (avg_memory% × $0.006)

This is a simplified model treating CPU utilisation percentage as a billable unit (e.g., vCPU-hours) and memory percentage as a proxy for GB/hour. Real implementations would map these to instance types, reserved vs on-demand pricing, etc.


Tech Stack

Layer Technology
Message Bus Apache Kafka + Zookeeper (Bitnami)
Stream Processing Apache Spark 3.5 (PySpark)
Time-Series DB TimescaleDB (PostgreSQL 15)
API Backend FastAPI + Uvicorn
Templating Jinja2
Frontend Charts Chart.js 4.4
DB Driver psycopg2-binary
Containerisation Docker + Docker Compose v2

Configuration

All credentials are set via environment variables in docker-compose.yml:

Variable Default
POSTGRES_USER admin
POSTGRES_PASSWORD admin123
POSTGRES_DB observability
KAFKA_BOOTSTRAP_SERVERS kafka:9092
KAFKA_TOPIC infrastructure-metrics

Stopping the Platform

# Stop all containers
docker-compose down

# Stop AND remove volumes (wipes DB data)
docker-compose down -v

Troubleshooting

Dashboard shows "No data yet" Wait 90–120 seconds after startup for the full pipeline to warm up.

Spark consumer keeps restarting It will retry automatically. Kafka sometimes takes 30–60s to elect leaders. The consumer has built-in retry logic.

Port 5432 conflict If you have a local PostgreSQL running, either stop it or change the TimescaleDB port mapping in docker-compose.yml.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors