Skip to content

Improve host dashboard and add fleet health plumbing#90

Merged
chrisbliss18 merged 5 commits intov2from
feature/host-dashboard-fleet-plumbing
Apr 30, 2026
Merged

Improve host dashboard and add fleet health plumbing#90
chrisbliss18 merged 5 commits intov2from
feature/host-dashboard-fleet-plumbing

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

@chrisbliss18 chrisbliss18 commented Apr 30, 2026

Summary

This is PR 1 of the dashboard work split. It improves the existing per-host operator dashboard and adds the durable process-health plumbing needed for a later fleet dashboard PR.

This PR does not build the global fleet dashboard yet. It gives each monitor/deliverer process a durable heartbeat row and makes the host dashboard more useful during rollout, so the next PR can aggregate those rows across monitor hosts, deliverers, and supporting processes.

What changed

Host dashboard

  • Adds a host-level red/amber/green summary that puts rollout blockers and operator warnings at the top of the page.
  • Adds GET /api/host, which returns state, health, and summary in one local JSON response for tooling.
  • Splits rollout state from operator commands so the dashboard is easier to scan during cutover.
  • Shows real monitor throughput and last-round duration from the orchestrator instead of placeholder values.
  • Shows Go runtime system memory as go_sys_mem_mb, avoiding confusion with OS RSS.
  • Treats missing dependency health as amber so the dashboard does not briefly report green before the first dependency probe completes.
  • Caps SSE clients and keeps HTTP server timeouts in place so the unauthenticated local dashboard is harder to abuse accidentally.

Fleet-health plumbing

  • Adds migrations 24 and 25 for jetmon_process_health, including separate state and health_status fields.
  • Adds internal/fleethealth for durable process snapshots keyed by <host>:<process_type>.
  • Publishes monitor snapshots from jetmon2: lifecycle state, health status, bucket ownership, worker/queue state, WPCOM circuit state, delivery-owner state, dependency health, version/build data, API/dashboard ports, and Go runtime memory.
  • Publishes standalone deliverer snapshots from jetmon-deliverer: active/idle ownership state, health status, MySQL/StatsD health, version/build data, and lifecycle state.
  • Validates process state and health_status before writing process-health rows.
  • Bounds process-health DB writes and serializes shutdown publishing so a late running heartbeat cannot overwrite stopping or stopped state.

Rollout and security guardrails

  • Adds DASHBOARD_BIND_ADDR, defaulting the dashboard listener to 127.0.0.1 instead of all interfaces.
  • Emits startup and validate-config warnings when the dashboard is bound to a non-loopback address, because the dashboard is intentionally unauthenticated and exposes internal rollout context.
  • Shows delivery runtime state separately from current config eligibility, so operators can see when a restart is needed for delivery-owner config changes to take effect.
  • Updates the roadmap and operational/data-model docs to describe the host-dashboard PR, the future fleet-dashboard PR, and the jetmon_process_health data source.

Dashboard example

These are representative text snapshots of the rendered dashboard using example host data, so the before/after shape is visible in the PR without running the branch.

Before

The old page was a compact raw metric grid. It exposed useful fields, but the operator had to infer whether the host was safe to roll forward.

Jetmon 2
jetmon-v2-a

CHECK POOL
GOROUTINES     24
ACTIVE CHECKS  3
QUEUE DEPTH    12
RETRY QUEUE    1

THROUGHPUT
SITES/SEC   0
ROUND TIME  8.5s
BUCKETS     0-99
RSS         91MB

ROLLOUT
OWNERSHIP           pinned range=0-99
LEGACY PROJECTION   enabled
DELIVERY WORKERS    enabled
DELIVERY OWNER      unset
PREFLIGHT           ./jetmon2 rollout host-preflight --file=<ranges.csv> ...
ACTIVITY            ./jetmon2 rollout activity-check --since=15m
ROLLBACK            ./jetmon2 rollout rollback-check
DRIFT REPORT        ./jetmon2 rollout projection-drift

EXTERNAL DEPENDENCIES
mysql: green 2ms
wpcom: green
statsd: amber - statsd client is not initialized
disk:logs: green
disk:stats: green
veriflier:iad: green 8ms

After

The new page opens with a host-level status summary, keeps rollout state separate from commands, and makes delivery-owner/config warnings visible before cutover.

Jetmon 2
Host dashboard                                           [AMBER]

operator attention needed before rollout
dependencies green=5 amber=1 red=0
- delivery amber: runtime workers are enabled without DELIVERY_OWNER_HOST

host: jetmon-v2-a                         updated: 10:42:15 AM

CHECK POOL
Goroutines       24
Active Checks    3
Queue Depth      12
Retry Queue      1

THROUGHPUT
Sites/Sec        173
Round Time       8.5s
Buckets          0-99
Go Sys Memory    91MB

ROLLOUT STATE
Ownership          pinned range=0-99
Legacy Projection  enabled
Delivery Runtime   enabled
Config Eligibility eligible
Delivery Owner     unset
WPCOM Circuit      closed
WPCOM Queue        0

OPERATOR COMMANDS
State Report  ./jetmon2 rollout state-report --since=15m
Preflight     ./jetmon2 rollout host-preflight --file=<ranges.csv> --host=<v1-hostname> --runtime-host=<v2-hostname> --bucket-min=0 --bucket-max=99 --bucket-total=1000
Cutover       ./jetmon2 rollout cutover-check --since=15m
Activity      ./jetmon2 rollout activity-check --since=15m
Rollback      ./jetmon2 rollout rollback-check
Drift Report  ./jetmon2 rollout projection-drift

EXTERNAL DEPENDENCIES
mysql: green 2ms
wpcom: green
statsd: amber - statsd client is not initialized
disk:logs: green
disk:stats: green
veriflier:iad: green 8ms

The same summary is available to local tooling through:

curl http://127.0.0.1:${DASHBOARD_PORT}/api/host

Why

Rollout needs a dashboard that answers the sysadmin question first: "is this host safe to advance?" Raw metrics are still useful, but they are slower to interpret during a cutover window. This PR keeps the dashboard local and low-risk while adding the MySQL-backed process heartbeat model that the fleet dashboard can aggregate in the next PR.

The separate health_status field is intentional: a process can be running while degraded or red. That distinction matters for fleet views, stale heartbeat detection, rollout blockers, and eventual alerting around the monitoring system itself.

Validation

  • go test ./...
  • go vet ./...
  • make rollout-docs-verify

Chris Jean added 5 commits April 29, 2026 21:47
Improve the existing per-host operator dashboard while adding the durable process-health plumbing needed for a future fleet dashboard. The dashboard now exposes a combined /api/host snapshot, renders a clearer red/amber/green host summary, and surfaces the rollout commands operators need during cutover.

Add the jetmon_process_health migration plus a small fleethealth package that upserts process heartbeat snapshots. The monitor publishes bucket, queue, WPCOM, dependency, RSS, version, and delivery-owner state; the standalone deliverer publishes active/idle owner state, DB/StatsD health, RSS, version, and lifecycle state.

Update roadmap and operator/data-model docs to describe the two-PR dashboard plan, the process health table, and how operators can inspect the host snapshot and future fleet dashboard data source.
Expand the dashboard roadmap checklist with the specific review decisions for the current host-dashboard branch: localhost-first binding, red/amber issue summaries, separate process lifecycle and health rollup state, real per-host throughput metrics, and clearer Go runtime memory labeling.

Also record the fleet-dashboard follow-ups that should stay out of this PR but not be forgotten: explicit delivery ownership posture and optional future true RSS collection if operators need OS-level memory accounting.
Make the host dashboard safer and more actionable before this fleet-health plumbing lands. The dashboard now binds to DASHBOARD_BIND_ADDR with a localhost default, serves through an http.Server with defensive read-header and idle timeouts, and keeps the existing remote-access path explicit for trusted operator networks.

Improve operator signal by adding named red/amber host-summary issues, surfacing delivery runtime versus current config eligibility, and wiring real sites-per-second plus last-round duration from the orchestrator instead of placeholder zero values.

Split durable process lifecycle from health rollup in jetmon_process_health, add bounded write contexts around fleet-health updates, and rename the stored/dashboard memory value to Go runtime system memory so it is not confused with operating-system RSS. Update config samples, operations docs, schema docs, roadmap status, and tests to match.
Warn operators when the unauthenticated host dashboard binds to a non-loopback address, and document that validate-config/startup will surface the risk.

Keep dependency health from reporting green before the first health sample has been published, cap dashboard SSE clients, and validate fleet-health state/status values before writing them to the durable process-health table.

Serialize process-health publishing during shutdown so a late running heartbeat cannot overwrite stopping or stopped state while the monitor is draining.
Move the host-dashboard fleet-health branch out of the candidate follow-up list and into the recently completed roadmap section.

This keeps the roadmap aligned with the PR state before merging and leaves feature/fleet-dashboard as the next dashboard candidate branch.
@chrisbliss18 chrisbliss18 merged commit 31bf48a into v2 Apr 30, 2026
@chrisbliss18 chrisbliss18 deleted the feature/host-dashboard-fleet-plumbing branch April 30, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant