Improve host dashboard and add fleet health plumbing#90
Merged
chrisbliss18 merged 5 commits intov2from Apr 30, 2026
Merged
Conversation
added 5 commits
April 29, 2026 21:47
Improve the existing per-host operator dashboard while adding the durable process-health plumbing needed for a future fleet dashboard. The dashboard now exposes a combined /api/host snapshot, renders a clearer red/amber/green host summary, and surfaces the rollout commands operators need during cutover. Add the jetmon_process_health migration plus a small fleethealth package that upserts process heartbeat snapshots. The monitor publishes bucket, queue, WPCOM, dependency, RSS, version, and delivery-owner state; the standalone deliverer publishes active/idle owner state, DB/StatsD health, RSS, version, and lifecycle state. Update roadmap and operator/data-model docs to describe the two-PR dashboard plan, the process health table, and how operators can inspect the host snapshot and future fleet dashboard data source.
Expand the dashboard roadmap checklist with the specific review decisions for the current host-dashboard branch: localhost-first binding, red/amber issue summaries, separate process lifecycle and health rollup state, real per-host throughput metrics, and clearer Go runtime memory labeling. Also record the fleet-dashboard follow-ups that should stay out of this PR but not be forgotten: explicit delivery ownership posture and optional future true RSS collection if operators need OS-level memory accounting.
Make the host dashboard safer and more actionable before this fleet-health plumbing lands. The dashboard now binds to DASHBOARD_BIND_ADDR with a localhost default, serves through an http.Server with defensive read-header and idle timeouts, and keeps the existing remote-access path explicit for trusted operator networks. Improve operator signal by adding named red/amber host-summary issues, surfacing delivery runtime versus current config eligibility, and wiring real sites-per-second plus last-round duration from the orchestrator instead of placeholder zero values. Split durable process lifecycle from health rollup in jetmon_process_health, add bounded write contexts around fleet-health updates, and rename the stored/dashboard memory value to Go runtime system memory so it is not confused with operating-system RSS. Update config samples, operations docs, schema docs, roadmap status, and tests to match.
Warn operators when the unauthenticated host dashboard binds to a non-loopback address, and document that validate-config/startup will surface the risk. Keep dependency health from reporting green before the first health sample has been published, cap dashboard SSE clients, and validate fleet-health state/status values before writing them to the durable process-health table. Serialize process-health publishing during shutdown so a late running heartbeat cannot overwrite stopping or stopped state while the monitor is draining.
Move the host-dashboard fleet-health branch out of the candidate follow-up list and into the recently completed roadmap section. This keeps the roadmap aligned with the PR state before merging and leaves feature/fleet-dashboard as the next dashboard candidate branch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is PR 1 of the dashboard work split. It improves the existing per-host operator dashboard and adds the durable process-health plumbing needed for a later fleet dashboard PR.
This PR does not build the global fleet dashboard yet. It gives each monitor/deliverer process a durable heartbeat row and makes the host dashboard more useful during rollout, so the next PR can aggregate those rows across monitor hosts, deliverers, and supporting processes.
What changed
Host dashboard
GET /api/host, which returnsstate,health, andsummaryin one local JSON response for tooling.go_sys_mem_mb, avoiding confusion with OS RSS.Fleet-health plumbing
jetmon_process_health, including separatestateandhealth_statusfields.internal/fleethealthfor durable process snapshots keyed by<host>:<process_type>.jetmon2: lifecycle state, health status, bucket ownership, worker/queue state, WPCOM circuit state, delivery-owner state, dependency health, version/build data, API/dashboard ports, and Go runtime memory.jetmon-deliverer: active/idle ownership state, health status, MySQL/StatsD health, version/build data, and lifecycle state.stateandhealth_statusbefore writing process-health rows.runningheartbeat cannot overwritestoppingorstoppedstate.Rollout and security guardrails
DASHBOARD_BIND_ADDR, defaulting the dashboard listener to127.0.0.1instead of all interfaces.validate-configwarnings when the dashboard is bound to a non-loopback address, because the dashboard is intentionally unauthenticated and exposes internal rollout context.jetmon_process_healthdata source.Dashboard example
These are representative text snapshots of the rendered dashboard using example host data, so the before/after shape is visible in the PR without running the branch.
Before
The old page was a compact raw metric grid. It exposed useful fields, but the operator had to infer whether the host was safe to roll forward.
After
The new page opens with a host-level status summary, keeps rollout state separate from commands, and makes delivery-owner/config warnings visible before cutover.
The same summary is available to local tooling through:
curl http://127.0.0.1:${DASHBOARD_PORT}/api/hostWhy
Rollout needs a dashboard that answers the sysadmin question first: "is this host safe to advance?" Raw metrics are still useful, but they are slower to interpret during a cutover window. This PR keeps the dashboard local and low-risk while adding the MySQL-backed process heartbeat model that the fleet dashboard can aggregate in the next PR.
The separate
health_statusfield is intentional: a process can berunningwhile degraded or red. That distinction matters for fleet views, stale heartbeat detection, rollout blockers, and eventual alerting around the monitoring system itself.Validation
go test ./...go vet ./...make rollout-docs-verify