Skip to content

ms: storage health + query latency + maintenance ops in Settings#451

Merged
Ehco1996 merged 2 commits into
masterfrom
feat/db-health-panel
May 4, 2026
Merged

ms: storage health + query latency + maintenance ops in Settings#451
Ehco1996 merged 2 commits into
masterfrom
feat/db-health-panel

Conversation

@Ehco1996
Copy link
Copy Markdown
Owner

@Ehco1996 Ehco1996 commented May 4, 2026

Summary

Adds a self-contained, dependency-free observability layer over the local SQLite metrics store, embedded directly in ehco's own admin SPA. No external Prometheus / OTel collector — ehco does it all itself.

Three new cards in Settings:

  • Storage — file size, pages, freelist (with VACUUM hint when fragmentation > 30%), row counts, last rule write age. Calls out an empty rule_metrics table inline so a broken sync pipeline is visible.
  • Query Latency — count / avg / max / last per op (add_node, add_rule, query_node, query_rule, cleanup, vacuum, truncate), since process start. Reset button.
  • Maintenance — Clean older-than-N-days · VACUUM · Truncate (literal-match confirm).

Design notes

  • No third-party libs. ~80 LoC for stats, ~10 LoC PRAGMA per metric. The track(*opStats) func() helper is the only instrumentation shape — every public ms method ends with one defer track(&ms.stats.X)() line.
  • Row count is atomic.Int64-cached. Avoids SELECT COUNT(*) on every Settings refresh; reconciled by recountRows() after Vacuum/Truncate and at startup. INSERT OR REPLACE can briefly overcount on duplicate PKs — bounded, doc'd, self-healing.
  • Truncate requires confirm == "yes I am sure" exactly (string, not bool — a defaulted JSON field can never wipe data).
  • Vacuum is synchronous. Current ~2.5MB db: <100ms. Past ~1GB: lock window stretches into seconds; SPA confirm copy mentions this.
  • All maintenance ops auth-gated by the existing /api/v1 echo group middleware — confirm strings are a second line of defence, not the first.

API surface

GET  /api/v1/db/health
POST /api/v1/db/cleanup       {older_than_days: int}
POST /api/v1/db/vacuum
POST /api/v1/db/truncate      {confirm: "yes I am sure"}
POST /api/v1/db/reset_stats

When the underlying store is disabled (no upstream sync URL), every endpoint returns 503 via cmgr.ErrMetricsDisabled.

Test plan

  • go test -race ./internal/cmgr/ms/... — round-trip, cleanup, truncate strictness, reset (all PASS)
  • make lint clean
  • make test full suite green
  • make ui builds (Vite OK, +~3KB gzipped)
  • Smoke on a live node: hit each endpoint, verify Settings cards render, confirm rule_metrics=0 warning shows on the boxes that surfaced in the original investigation
  • Watch latency card under load — sanity-check that add_node p~max stays sub-ms on real traffic

Why

Direct outcome of investigating PR #443's "SQLite 撑不住" claim. Probing a real node showed: 2.5MB db, all queries <1ms, but rule_metrics empty (separate sync-pipeline bug). Decision: stay on SQLite long-term; build observability so future "撑不住" judgements are data-driven, and surface the empty-table case loudly so the underlying bug doesn't hide.

🤖 Generated with Claude Code

Adds a self-contained, dependency-free observability layer over the
local SQLite metrics store, surfaced in the SPA's Settings page so
operators can see at a glance whether the store is healthy and act on
it without shelling into the box.

Backend (internal/cmgr/ms):
- stats.go: opStats{count,total,max,last} per op + Stats.Snapshot. The
  shared `track` helper instruments every public method via one-line
  defer; new ops register in Stats.all and the SPA picks them up.
- health.go: DBHealth aggregates file size, page/freelist counts, row
  counts, last rule write, and the stats snapshot. Maintenance ops
  (CleanupOlderThan, Vacuum, Truncate, ResetStats) return a unified
  MaintenanceResult with duration + before/after byte counts.
- ms.go: nodeRows/ruleRows atomic.Int64 caches keep Health() off the
  COUNT(*) hot path; reconciled via recountRows on startup, Vacuum,
  and Truncate. INSERT OR REPLACE may overcount briefly — bounded.
- Truncate requires confirm == "yes I am sure" exactly; a defaulted
  JSON field cannot wipe data.

cmgr: Cmgr interface gains DBHealth/DBCleanup/DBVacuum/DBTruncate/
DBResetStats, returning ErrMetricsDisabled when the store is not
configured.

Web (internal/web):
- 5 new routes under /api/v1/db/* (auth-gated via the existing api
  group middleware). dbMaintenanceErr maps cmgr/ms domain errors onto
  HTTP status uniformly.
- Settings.tsx grows Storage / Query Latency / Maintenance cards;
  Truncate uses prompt() with literal-match confirm; every op funnels
  through one runMaint helper for consistent loading/toast state.

Tests: round-trip, cleanup row-affected, truncate confirm strictness,
reset-stats — all green under -race.
Density / IA pass on /settings, plus the copy-button bug surfaced on
LAN (plain-HTTP) deployments where navigator.clipboard is undefined.

Layout
- Drop the standalone "theme" card — broken and already covered by the
  sidebar toggle, no need for a duplicate.
- Drop the "api surface" card — a hardcoded endpoint enumeration with
  no operator value; OpenAPI is the right home if we ever want it.
- Fold the "reload configuration" card into the runtime-configuration
  card's right slot. One button + one paragraph no longer steals a
  whole grid cell; the reload status pill renders inline.
- Group storage / query-latency / maintenance under a "database"
  section title; group the updates panel under "updates". Adds the
  vertical hierarchy that 11 sibling cards in a 2-col grid lacked.
- Maintenance card switches from md:grid-cols-3 to flex-wrap so the
  three actions hug the start instead of floating in unequal cells;
  status pill moves to the card-header right slot.
- Latency table drops the redundant "last" column and pins column
  widths via table-fixed + colgroup so the right edge no longer
  overflows the card on lg viewports.

UpdatesPanel
- Collapse "current build" + "check for updates" into a single card.
  Build DescList sits in the body; channel selector + Check button
  + nightly/stable pill move into the card-header right slot.
- Drop the in-panel h2 "updates" header — the new SectionTitle in
  Settings owns that label, and rendering it twice was redundant.
- "update progress" stays as a separate conditional card.

Copy bug
- New util/clipboard.ts wraps navigator.clipboard with the legacy
  document.execCommand fallback. Plain-HTTP origins (e.g. ehco on a
  LAN IP) are not secure contexts, so navigator.clipboard is
  undefined and the previous catch-and-ignore meant the button
  appeared dead. The fallback works in non-secure contexts and is
  the canonical pattern for this scenario.
- Settings calls copyText() instead of doing its own try/catch; the
  helper is generic enough for any future copy affordance.
@Ehco1996 Ehco1996 merged commit 31fd219 into master May 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant