Add fleet-wide operator dashboard#91
Merged
chrisbliss18 merged 7 commits intov2from Apr 30, 2026
Merged
Conversation
added 7 commits
April 29, 2026 22:57
Introduce a MySQL-backed fleet dashboard source that aggregates jetmon_process_health, jetmon_hosts bucket ownership, delivery queue summaries, projection drift, and dependency health into a single fleet snapshot. Expose the snapshot through /api/fleet and add a /fleet operator page with red/amber/green summary, stale heartbeat detection, delivery-owner posture, bucket coverage, process rows, dependency rollups, and suggested next actions. Document the new fleet dashboard surface and update the dashboard roadmap items completed by this first implementation slice.
Tighten the fleet rollup after the review pass so stale process snapshots are treated as red rollout blockers and sorted ahead of healthy rows. Delivery posture now ignores stale or stopped processes, warns when queued delivery rows have no fresh worker, and treats enabled workers without DELIVERY_OWNER_HOST as an amber guardrail instead of a green state. Clarify bucket ownership mode by reporting pinned and mixed rollout states separately from dynamic jetmon_hosts coverage. This avoids false red dynamic-coverage failures during pinned rollout while still surfacing mixed ownership as operator attention. Improve the operator interface by linking host and fleet dashboards, adding per-table delivery queue and bucket-owner details, caching fleet snapshots briefly, refreshing fleet dashboard config on SIGHUP, and returning generic fleet API failures without leaking backend error text. Document the shared dashboard exposure model and mark the fleet dashboard exposure follow-up complete in the roadmap.
Move feature/fleet-dashboard out of the future candidate list now that this branch implements the global fleet dashboard surface. Update the project overview so jetmon_process_health is described as the current data source behind /fleet, alongside jetmon_hosts, delivery queues, projection drift, and dependency rollups, instead of only future dashboard plumbing.
Add operator-facing fleet dashboard guidance before opening the PR. The operations guide now covers dashboard configuration, local and tunneled access, the read-only /api/fleet endpoint, top-level red/amber/green interpretation, pinned versus dynamic bucket modes, delivery-owner posture, and direct SQL checks for process health and delivery queues. Update rollout and migration docs so operators know when to use the host dashboard and /fleet during pinned cutover and final dynamic ownership validation. Clarify that DASHBOARD_BIND_ADDR protects both unauthenticated host and fleet dashboard views, and clean up the changelog wording now that the fleet dashboard exists on this branch.
Address the second review pass across operator usefulness, security, and code quality. The dashboard now sends no-store and nosniff headers on read-only HTML and JSON responses, rejects write methods across the read-only dashboard routes, and keeps SSE constrained to GET. Improve the fleet UI by showing when additional summary issues are hidden, removing empty bucket-detail separators, and avoiding misleading oldest-due ages when no delivery rows are due. Clarify in the operations guide that /fleet does not scrape other host dashboards. Any monitor dashboard can serve the same fleet view from shared MySQL state, while standalone deliverers publish process-health rows but do not serve the dashboard. Validation run before this commit: go test ./..., go vet ./..., and make rollout-docs-verify.
Return 404 for unknown dashboard paths instead of falling through to the host dashboard. This keeps URL mistakes and stale links from rendering a misleading page. Add stronger browser hardening headers to read-only dashboard responses, including CSP, frame denial, no-referrer, no-store, and nosniff. The dashboard remains intentionally simple, but it should still behave like an internal operational surface. Ignore stale or stopped monitor snapshots when deriving the fleet bucket ownership mode so old pinned rollout data cannot make an otherwise healthy dynamic fleet look mixed. Also keep healthy monitors ahead of deliverers in the process table when urgency is otherwise equal. Validated with go test ./..., go vet ./..., and make rollout-docs-verify.
Collapse each delivery table summary from six separate aggregate queries into one UNION ALL aggregate query. This keeps the same indexed predicates and dashboard fields while cutting delivery-summary database round trips from twelve to two per uncached fleet snapshot. Remove the duplicated active-process helper logic by routing bucket-ownership and delivery-posture checks through one shared fleet-rollup helper. Add SQL coverage for the consolidated delivery summary path so the query shape and aggregate field mapping stay explicit. Validated with go test ./..., go vet ./..., and make rollout-docs-verify.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
This adds the fleet-wide operator view that sits on top of the host dashboard plumbing merged in #90. The host dashboard is useful when an operator is already looking at one server, but rollout decisions need a single view of monitor hosts, standalone deliverers, bucket coverage, stale heartbeats, delivery backlog, projection drift, dependency health, and the next safest action.
What changed
/fleetand/api/fleetto the existing operator dashboard listener.jetmon_process_health,jetmon_hosts, webhook and alert delivery queues, projection drift, and dependency rollups.DELIVERY_OWNER_HOST./api/fleetbackend errors while logging the real error server-side.Example fleet dashboard output
An operator opening
http://127.0.0.1:8080/fleetduring a pinned rollout would see the most important action at the top:/api/fleetreturns the same complete snapshot that the HTML dashboard renders. It is not a reduced API surface; operator scripts get the summary, process rows, bucket-owner rows, delivery queue tables, delivery posture, projection drift, and dependency rollups. This representative JSON example keeps the arrays short while showing the full response shape:{ "generated_at": "2026-04-30T14:00:00Z", "summary": { "status": "amber", "message": "fleet needs operator attention", "suggested_next_action": "Confirm whether the fleet is still in pinned rollout before expecting dynamic bucket coverage.", "issues": [ "bucket coverage: monitor process snapshots report pinned bucket ranges; dynamic jetmon_hosts coverage is not active" ], "red_processes": 0, "amber_processes": 0, "green_processes": 4, "stale_processes": 0, "monitor_processes": 3, "deliverer_processes": 1, "dependency_red_count": 0, "dependency_amber_count": 0 }, "processes": [ { "process_id": "jetmon-a:monitor", "host_id": "jetmon-a", "process_type": "monitor", "state": "running", "health_status": "green", "last_heartbeat_age_sec": 4, "stale": false, "bucket_min": 0, "bucket_max": 333, "bucket_ownership": "pinned range=0-333", "api_port": 8090, "dashboard_port": 8080, "delivery_workers_enabled": false, "delivery_owner_host": "jetmon-deliverer-1", "worker_count": 256, "active_checks": 128, "queue_depth": 40, "retry_queue_size": 2, "wpcom_circuit_open": false, "wpcom_queue_depth": 0, "go_sys_mem_mb": 96, "dependency_health": [ { "name": "mysql", "status": "green", "latency_ms": 3, "checked_at": "2026-04-30T13:59:56Z" } ] } ], "process_counts": { "monitor": 3, "deliverer": 1 }, "bucket_coverage": { "status": "amber", "mode": "pinned", "bucket_total": 1000, "host_count": 0, "error": "monitor process snapshots report pinned bucket ranges; dynamic jetmon_hosts coverage is not active", "hosts": [] }, "delivery": { "status": "green", "pending": 0, "due_now": 0, "future_retry": 0, "delivered_since": 60, "abandoned_since": 0, "failed_since": 0, "oldest_pending_age_sec": 0, "oldest_due_age_sec": 0, "tables": [ { "kind": "webhook", "pending": 0, "due_now": 0, "future_retry": 0, "delivered_since": 42, "abandoned_since": 0, "failed_since": 0, "oldest_pending_age_sec": 0, "oldest_due_age_sec": 0 }, { "kind": "alert", "pending": 0, "due_now": 0, "future_retry": 0, "delivered_since": 18, "abandoned_since": 0, "failed_since": 0, "oldest_pending_age_sec": 0, "oldest_due_age_sec": 0 } ], "posture": { "status": "green", "enabled_process_count": 1, "enabled_hosts": [ "jetmon-deliverer-1" ], "owner_hosts": [ "jetmon-deliverer-1" ], "message": "delivery owner is constrained to jetmon-deliverer-1" } }, "projection_drift": { "status": "green", "count": 0 }, "dependencies": [ { "name": "mysql", "status": "green", "red_count": 0, "amber_count": 0, "green_count": 4, "stale_count": 0 } ] }Operational notes
DASHBOARD_PORTandDASHBOARD_BIND_ADDR.jetmon2monitor dashboard connected to the same MySQL database can serve the fleet view from shared process-health, bucket, delivery, and event state.mode=pinnedis expected. After dynamic ownership cutover, operators should expect greenmode=dynamic, freshjetmon_hostscoverage, no stale processes, zero projection drift, and no failed or abandoned delivery rows.Validation
go test ./...go vet ./...make rollout-docs-verify