Add fleet-wide operator dashboard by chrisbliss18 · Pull Request #91 · Automattic/jetmon

chrisbliss18 · 2026-04-30T14:40:24Z

Why

This adds the fleet-wide operator view that sits on top of the host dashboard plumbing merged in #90. The host dashboard is useful when an operator is already looking at one server, but rollout decisions need a single view of monitor hosts, standalone deliverers, bucket coverage, stale heartbeats, delivery backlog, projection drift, dependency health, and the next safest action.

What changed

Adds /fleet and /api/fleet to the existing operator dashboard listener.
Builds fleet snapshots from jetmon_process_health, jetmon_hosts, webhook and alert delivery queues, projection drift, and dependency rollups.
Adds fleet-level red/amber/green summary rules with suggested next actions.
Treats stale process heartbeats as red rollout blockers and sorts unhealthy processes first.
Distinguishes dynamic, pinned, mixed, and unknown bucket ownership modes so pinned rollout does not look like broken dynamic coverage.
Adds explicit delivery-owner posture so operators can spot queued delivery work without a fresh worker, multiple owner values, or enabled delivery workers without DELIVERY_OWNER_HOST.
Consolidates delivery queue aggregation so each uncached fleet snapshot uses one indexed summary query per delivery table instead of six separate round trips per table.
Caches fleet snapshots briefly so multiple open dashboard tabs do not run the full query set on every refresh.
Returns generic /api/fleet backend errors while logging the real error server-side.
Sends no-store, nosniff, no-referrer, frame-denial, and CSP headers on dashboard HTML and JSON responses; rejects write methods on read-only dashboard routes; and returns 404 for unknown dashboard paths instead of falling back to the host dashboard.
Links host and fleet dashboards together.
Updates operations, rollout, migration, config, changelog, project, and roadmap docs to explain configuration and use.

Example fleet dashboard output

An operator opening http://127.0.0.1:8080/fleet during a pinned rollout would see the most important action at the top:

Jetmon 2
Fleet dashboard

[amber] fleet needs operator attention
green=3 amber=0 red=0
next: Confirm whether the fleet is still in pinned rollout before expecting dynamic bucket coverage.

Issues
- bucket coverage: monitor process snapshots report pinned bucket ranges; dynamic jetmon_hosts coverage is not active

Fleet Rollup
Monitors:          3
Deliverers:        1
Stale Processes:   0
Bucket Coverage:   amber
  pinned - 0 dynamic hosts - dynamic ownership is not active
Delivery Due:      0
Projection Drift:  0

Delivery Ownership
Posture:       green
Enabled Hosts: jetmon-deliverer-1
Owner Hosts:   jetmon-deliverer-1

Delivery Queues
Kind      Pending  Due  Future Retry  Delivered  Failed  Abandoned  Oldest Due
webhook   0        0    0             42         0       0          0s ago
alert     0        0    0             18         0       0          0s ago

Bucket Owners
Host      Range  Status  Heartbeat
No dynamic bucket-owner rows found

Processes
Process                    Health  State    Heartbeat  Buckets  Queues
jetmon-a:monitor           green   running  4s ago     0-333    active=128 queue=40 retry=2
jetmon-b:monitor           green   running  5s ago     334-666  active=120 queue=36 retry=1
jetmon-c:monitor           green   running  4s ago     667-999  active=121 queue=38 retry=0
jetmon-deliverer-1:deliverer green running  3s ago     -        active=0 queue=0 retry=0

/api/fleet returns the same complete snapshot that the HTML dashboard renders. It is not a reduced API surface; operator scripts get the summary, process rows, bucket-owner rows, delivery queue tables, delivery posture, projection drift, and dependency rollups. This representative JSON example keeps the arrays short while showing the full response shape:

{
  "generated_at": "2026-04-30T14:00:00Z",
  "summary": {
    "status": "amber",
    "message": "fleet needs operator attention",
    "suggested_next_action": "Confirm whether the fleet is still in pinned rollout before expecting dynamic bucket coverage.",
    "issues": [
      "bucket coverage: monitor process snapshots report pinned bucket ranges; dynamic jetmon_hosts coverage is not active"
    ],
    "red_processes": 0,
    "amber_processes": 0,
    "green_processes": 4,
    "stale_processes": 0,
    "monitor_processes": 3,
    "deliverer_processes": 1,
    "dependency_red_count": 0,
    "dependency_amber_count": 0
  },
  "processes": [
    {
      "process_id": "jetmon-a:monitor",
      "host_id": "jetmon-a",
      "process_type": "monitor",
      "state": "running",
      "health_status": "green",
      "last_heartbeat_age_sec": 4,
      "stale": false,
      "bucket_min": 0,
      "bucket_max": 333,
      "bucket_ownership": "pinned range=0-333",
      "api_port": 8090,
      "dashboard_port": 8080,
      "delivery_workers_enabled": false,
      "delivery_owner_host": "jetmon-deliverer-1",
      "worker_count": 256,
      "active_checks": 128,
      "queue_depth": 40,
      "retry_queue_size": 2,
      "wpcom_circuit_open": false,
      "wpcom_queue_depth": 0,
      "go_sys_mem_mb": 96,
      "dependency_health": [
        {
          "name": "mysql",
          "status": "green",
          "latency_ms": 3,
          "checked_at": "2026-04-30T13:59:56Z"
        }
      ]
    }
  ],
  "process_counts": {
    "monitor": 3,
    "deliverer": 1
  },
  "bucket_coverage": {
    "status": "amber",
    "mode": "pinned",
    "bucket_total": 1000,
    "host_count": 0,
    "error": "monitor process snapshots report pinned bucket ranges; dynamic jetmon_hosts coverage is not active",
    "hosts": []
  },
  "delivery": {
    "status": "green",
    "pending": 0,
    "due_now": 0,
    "future_retry": 0,
    "delivered_since": 60,
    "abandoned_since": 0,
    "failed_since": 0,
    "oldest_pending_age_sec": 0,
    "oldest_due_age_sec": 0,
    "tables": [
      {
        "kind": "webhook",
        "pending": 0,
        "due_now": 0,
        "future_retry": 0,
        "delivered_since": 42,
        "abandoned_since": 0,
        "failed_since": 0,
        "oldest_pending_age_sec": 0,
        "oldest_due_age_sec": 0
      },
      {
        "kind": "alert",
        "pending": 0,
        "due_now": 0,
        "future_retry": 0,
        "delivered_since": 18,
        "abandoned_since": 0,
        "failed_since": 0,
        "oldest_pending_age_sec": 0,
        "oldest_due_age_sec": 0
      }
    ],
    "posture": {
      "status": "green",
      "enabled_process_count": 1,
      "enabled_hosts": [
        "jetmon-deliverer-1"
      ],
      "owner_hosts": [
        "jetmon-deliverer-1"
      ],
      "message": "delivery owner is constrained to jetmon-deliverer-1"
    }
  },
  "projection_drift": {
    "status": "green",
    "count": 0
  },
  "dependencies": [
    {
      "name": "mysql",
      "status": "green",
      "red_count": 0,
      "amber_count": 0,
      "green_count": 4,
      "stale_count": 0
    }
  ]
}

Operational notes

The host and fleet dashboards share DASHBOARD_PORT and DASHBOARD_BIND_ADDR.
The fleet dashboard does not scrape other host dashboards; any jetmon2 monitor dashboard connected to the same MySQL database can serve the fleet view from shared process-health, bucket, delivery, and event state.
The listener remains unauthenticated, so it defaults to loopback and should only be exposed through SSH tunnels or trusted operator-network controls.
During pinned rollout, amber mode=pinned is expected. After dynamic ownership cutover, operators should expect green mode=dynamic, fresh jetmon_hosts coverage, no stale processes, zero projection drift, and no failed or abandoned delivery rows.

Validation

go test ./...
go vet ./...
make rollout-docs-verify

Introduce a MySQL-backed fleet dashboard source that aggregates jetmon_process_health, jetmon_hosts bucket ownership, delivery queue summaries, projection drift, and dependency health into a single fleet snapshot. Expose the snapshot through /api/fleet and add a /fleet operator page with red/amber/green summary, stale heartbeat detection, delivery-owner posture, bucket coverage, process rows, dependency rollups, and suggested next actions. Document the new fleet dashboard surface and update the dashboard roadmap items completed by this first implementation slice.

Tighten the fleet rollup after the review pass so stale process snapshots are treated as red rollout blockers and sorted ahead of healthy rows. Delivery posture now ignores stale or stopped processes, warns when queued delivery rows have no fresh worker, and treats enabled workers without DELIVERY_OWNER_HOST as an amber guardrail instead of a green state. Clarify bucket ownership mode by reporting pinned and mixed rollout states separately from dynamic jetmon_hosts coverage. This avoids false red dynamic-coverage failures during pinned rollout while still surfacing mixed ownership as operator attention. Improve the operator interface by linking host and fleet dashboards, adding per-table delivery queue and bucket-owner details, caching fleet snapshots briefly, refreshing fleet dashboard config on SIGHUP, and returning generic fleet API failures without leaking backend error text. Document the shared dashboard exposure model and mark the fleet dashboard exposure follow-up complete in the roadmap.

Move feature/fleet-dashboard out of the future candidate list now that this branch implements the global fleet dashboard surface. Update the project overview so jetmon_process_health is described as the current data source behind /fleet, alongside jetmon_hosts, delivery queues, projection drift, and dependency rollups, instead of only future dashboard plumbing.

Add operator-facing fleet dashboard guidance before opening the PR. The operations guide now covers dashboard configuration, local and tunneled access, the read-only /api/fleet endpoint, top-level red/amber/green interpretation, pinned versus dynamic bucket modes, delivery-owner posture, and direct SQL checks for process health and delivery queues. Update rollout and migration docs so operators know when to use the host dashboard and /fleet during pinned cutover and final dynamic ownership validation. Clarify that DASHBOARD_BIND_ADDR protects both unauthenticated host and fleet dashboard views, and clean up the changelog wording now that the fleet dashboard exists on this branch.

Address the second review pass across operator usefulness, security, and code quality. The dashboard now sends no-store and nosniff headers on read-only HTML and JSON responses, rejects write methods across the read-only dashboard routes, and keeps SSE constrained to GET. Improve the fleet UI by showing when additional summary issues are hidden, removing empty bucket-detail separators, and avoiding misleading oldest-due ages when no delivery rows are due. Clarify in the operations guide that /fleet does not scrape other host dashboards. Any monitor dashboard can serve the same fleet view from shared MySQL state, while standalone deliverers publish process-health rows but do not serve the dashboard. Validation run before this commit: go test ./..., go vet ./..., and make rollout-docs-verify.

Return 404 for unknown dashboard paths instead of falling through to the host dashboard. This keeps URL mistakes and stale links from rendering a misleading page. Add stronger browser hardening headers to read-only dashboard responses, including CSP, frame denial, no-referrer, no-store, and nosniff. The dashboard remains intentionally simple, but it should still behave like an internal operational surface. Ignore stale or stopped monitor snapshots when deriving the fleet bucket ownership mode so old pinned rollout data cannot make an otherwise healthy dynamic fleet look mixed. Also keep healthy monitors ahead of deliverers in the process table when urgency is otherwise equal. Validated with go test ./..., go vet ./..., and make rollout-docs-verify.

Collapse each delivery table summary from six separate aggregate queries into one UNION ALL aggregate query. This keeps the same indexed predicates and dashboard fields while cutting delivery-summary database round trips from twelve to two per uncached fleet snapshot. Remove the duplicated active-process helper logic by routing bucket-ownership and delivery-posture checks through one shared fleet-rollup helper. Add SQL coverage for the consolidated delivery summary path so the query shape and aggregate field mapping stay explicit. Validated with go test ./..., go vet ./..., and make rollout-docs-verify.

Chris Jean added 7 commits April 29, 2026 22:57

chrisbliss18 merged commit 2d06e4a into v2 Apr 30, 2026

chrisbliss18 deleted the feature/fleet-dashboard branch April 30, 2026 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fleet-wide operator dashboard#91

Add fleet-wide operator dashboard#91
chrisbliss18 merged 7 commits intov2from
feature/fleet-dashboard

chrisbliss18 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisbliss18 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Example fleet dashboard output

Operational notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chrisbliss18 commented Apr 30, 2026 •

edited

Loading