Context
The only health surface is GET /health returning a static { status: "ok" } (src/routes/health.ts). There are no counters for login failures, no histograms for OPAQUE latency, no gauges for sessions/devices/users-by-status.
Problem / Observation
Self-hosters who run keyfount alongside Prometheus/Grafana, or who use Posthog/Loki for visibility, have nothing to scrape. Their only visibility is grepping logs (already redacted to remove bodies) and direct SQLite queries. Operators won't notice a brute-force campaign in progress; they won't see registration backlogs growing; they won't see request latency creep.
/health itself is a static OK — it doesn't actually probe the database, so a DB-locked instance returns 200 to a healthcheck and only fails when a real request comes in.
Proposed approach
- Make
/health probe the DB (e.g. SELECT 1) and return { status, db_ok, uptime_s }.
- Add
/metrics (Prometheus exposition format) gated behind the admin instance — counters for login_total{result=success|failure|rate_limited}, register_total{status=pending|approved|rejected}, events_appended_total, bytes_stored_total{user="…"} (or just total), histogram for request_duration_seconds{route}.
- Decide if metrics live on admin port (private) or behind a separate
METRICS_TOKEN for scraping from outside.
Acceptance criteria
Context
The only health surface is
GET /healthreturning a static{ status: "ok" }(src/routes/health.ts). There are no counters for login failures, no histograms for OPAQUE latency, no gauges for sessions/devices/users-by-status.Problem / Observation
Self-hosters who run keyfount alongside Prometheus/Grafana, or who use Posthog/Loki for visibility, have nothing to scrape. Their only visibility is grepping logs (already redacted to remove bodies) and direct SQLite queries. Operators won't notice a brute-force campaign in progress; they won't see registration backlogs growing; they won't see request latency creep.
/healthitself is a static OK — it doesn't actually probe the database, so a DB-locked instance returns 200 to a healthcheck and only fails when a real request comes in.Proposed approach
/healthprobe the DB (e.g.SELECT 1) and return{ status, db_ok, uptime_s }./metrics(Prometheus exposition format) gated behind the admin instance — counters forlogin_total{result=success|failure|rate_limited},register_total{status=pending|approved|rejected},events_appended_total,bytes_stored_total{user="…"}(or just total), histogram forrequest_duration_seconds{route}.METRICS_TOKENfor scraping from outside.Acceptance criteria