feature request: Add proper monitoring stack

# Feature: Observability & Monitoring Stack

## Background

RELab has no active monitoring. Issues are discovered reactively — someone notices the app is broken, then we SSH in and run `docker logs -f`. There is no visibility into performance degradation, rising error rates, or infrastructure pressure before they become user-facing problems.

## Goal

Implement observability covering logs, metrics, and traces, with alerting, so that production health is visible at a glance and problems are surfaced proactively.

This is split into two phases. V1 is fast to ship using SaaS tools with minimal infrastructure. V2 replaces/extends with a fully self-hosted stack.

---

## V1 — Lightweight self-hosted (two new Docker services)

**Stack:** Dozzle · Uptime Kuma

Both are open-source, single-container, near-zero config, and use negligible resources.

### Dozzle — log viewer

A read-only web UI for Docker container logs. No storage, no agents, no config — just mount the Docker socket. Replaces `docker logs -f` with a browser UI that supports multi-container tailing, search, and filtering.

### Uptime Kuma — uptime & alerting

Point at the existing `/live` healthcheck endpoint for each service. Gives:

- Uptime monitoring with a status page
- Alerts via email, Slack, Discord, ntfy, or webhook on downtime

Both services should be added to `compose.prod.yml` and exposed via the Cloudflare Tunnel with access control.

### V1 Acceptance Criteria

- [ ] Dozzle is accessible in production and shows live logs for all containers
- [ ] Uptime Kuma monitors the backend `/live` endpoint and sends an alert on failure
- [ ] No SSH required to see what is happening in production

---

## V2 — Self-hosted Grafana LGTM Stack (full observability)

**Stack:** Prometheus · Loki · Tempo · Grafana · Grafana Alloy

All open-source, runs on the existing server as additional Docker Compose services. Grafana exposed via the existing Cloudflare Tunnel (with access control).

### Backend instrumentation

- OpenTelemetry auto-instrumentation for FastAPI, SQLAlchemy, and Redis. Export traces to Tempo via OTLP.
- Prometheus `/metrics` endpoint (request rate, latency histograms, error rate per endpoint).
- Trace IDs injected into loguru records for log-trace correlation.

### New Compose services

Prometheus, Loki, Tempo, Grafana, Grafana Alloy, postgres\_exporter — all on an internal network.

Alloy tails Docker log streams → Loki and scrapes host metrics (CPU, memory, disk, network).

### Dashboards (provisioned as JSON, no manual setup)

- **Service health** — request rate, error rate, p95/p99 latency per endpoint
- **Infrastructure** — CPU, memory, disk, network per container and host
- **Database & cache** — Postgres connection pool, Redis hit/miss ratio
- **Logs** — Loki panels per service with level breakdown

### Alerting

Notify via email or webhook on:

- Any service healthcheck failing > 1 min
- HTTP error rate > 1% over 5 min
- p99 latency > 2s over 5 min
- Disk > 80% or memory > 90% sustained
- No logs from a service for > 5 min (silent crash detection)

### V2 Acceptance Criteria

- [ ] All observability services start via `docker compose -f compose.yml -f compose.prod.yml up`
- [ ] Grafana loads with dashboards intact on a fresh deploy — no manual configuration
- [ ] A trace from an API request links to correlated log lines in Loki
- [ ] Prometheus scrapes metrics from the backend, Postgres, and Redis
- [ ] At least one alert fires end-to-end (e.g. stop the backend container, alert fires within 2 min)

---

## Out of Scope

- Frontend RUM
- CI observability (test timing, flake detection)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: Add proper monitoring stack #122

Feature: Observability & Monitoring Stack

Background

Goal

V1 — Lightweight self-hosted (two new Docker services)

Dozzle — log viewer

Uptime Kuma — uptime & alerting

V1 Acceptance Criteria

V2 — Self-hosted Grafana LGTM Stack (full observability)

Backend instrumentation

New Compose services

Dashboards (provisioned as JSON, no manual setup)

Alerting

V2 Acceptance Criteria

Out of Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feature request: Add proper monitoring stack #122

Description

Feature: Observability & Monitoring Stack

Background

Goal

V1 — Lightweight self-hosted (two new Docker services)

Dozzle — log viewer

Uptime Kuma — uptime & alerting

V1 Acceptance Criteria

V2 — Self-hosted Grafana LGTM Stack (full observability)

Backend instrumentation

New Compose services

Dashboards (provisioned as JSON, no manual setup)

Alerting

V2 Acceptance Criteria

Out of Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions