Skip to content

feature request: Add proper monitoring stack #122

@simonvanlierde

Description

@simonvanlierde

Feature: Observability & Monitoring Stack

Background

RELab has no active monitoring. Issues are discovered reactively — someone notices the app is broken, then we SSH in and run docker logs -f. There is no visibility into performance degradation, rising error rates, or infrastructure pressure before they become user-facing problems.

Goal

Implement observability covering logs, metrics, and traces, with alerting, so that production health is visible at a glance and problems are surfaced proactively.

This is split into two phases. V1 is fast to ship using SaaS tools with minimal infrastructure. V2 replaces/extends with a fully self-hosted stack.


V1 — Lightweight self-hosted (two new Docker services)

Stack: Dozzle · Uptime Kuma

Both are open-source, single-container, near-zero config, and use negligible resources.

Dozzle — log viewer

A read-only web UI for Docker container logs. No storage, no agents, no config — just mount the Docker socket. Replaces docker logs -f with a browser UI that supports multi-container tailing, search, and filtering.

Uptime Kuma — uptime & alerting

Point at the existing /live healthcheck endpoint for each service. Gives:

  • Uptime monitoring with a status page
  • Alerts via email, Slack, Discord, ntfy, or webhook on downtime

Both services should be added to compose.prod.yml and exposed via the Cloudflare Tunnel with access control.

V1 Acceptance Criteria

  • Dozzle is accessible in production and shows live logs for all containers
  • Uptime Kuma monitors the backend /live endpoint and sends an alert on failure
  • No SSH required to see what is happening in production

V2 — Self-hosted Grafana LGTM Stack (full observability)

Stack: Prometheus · Loki · Tempo · Grafana · Grafana Alloy

All open-source, runs on the existing server as additional Docker Compose services. Grafana exposed via the existing Cloudflare Tunnel (with access control).

Backend instrumentation

  • OpenTelemetry auto-instrumentation for FastAPI, SQLAlchemy, and Redis. Export traces to Tempo via OTLP.
  • Prometheus /metrics endpoint (request rate, latency histograms, error rate per endpoint).
  • Trace IDs injected into loguru records for log-trace correlation.

New Compose services

Prometheus, Loki, Tempo, Grafana, Grafana Alloy, postgres_exporter — all on an internal network.

Alloy tails Docker log streams → Loki and scrapes host metrics (CPU, memory, disk, network).

Dashboards (provisioned as JSON, no manual setup)

  • Service health — request rate, error rate, p95/p99 latency per endpoint
  • Infrastructure — CPU, memory, disk, network per container and host
  • Database & cache — Postgres connection pool, Redis hit/miss ratio
  • Logs — Loki panels per service with level breakdown

Alerting

Notify via email or webhook on:

  • Any service healthcheck failing > 1 min
  • HTTP error rate > 1% over 5 min
  • p99 latency > 2s over 5 min
  • Disk > 80% or memory > 90% sustained
  • No logs from a service for > 5 min (silent crash detection)

V2 Acceptance Criteria

  • All observability services start via docker compose -f compose.yml -f compose.prod.yml up
  • Grafana loads with dashboards intact on a fresh deploy — no manual configuration
  • A trace from an API request links to correlated log lines in Loki
  • Prometheus scrapes metrics from the backend, Postgres, and Redis
  • At least one alert fires end-to-end (e.g. stop the backend container, alert fires within 2 min)

Out of Scope

  • Frontend RUM
  • CI observability (test timing, flake detection)

Metadata

Metadata

Assignees

No one assigned

    Labels

    infraChanges to the technical infrastructure

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions