Skip to content

Claude/monitoring dashboard worker tt4 x7#87

Merged
maximusunc merged 28 commits into
mainfrom
claude/monitoring-dashboard-worker-Tt4X7
May 18, 2026
Merged

Claude/monitoring dashboard worker tt4 x7#87
maximusunc merged 28 commits into
mainfrom
claude/monitoring-dashboard-worker-Tt4X7

Conversation

@maximusunc
Copy link
Copy Markdown
Collaborator

No description provided.

claude and others added 28 commits May 14, 2026 11:04
Introduces a new ``shepherd_monitor`` service that aggregates real-time
operational state for the rest of the platform and serves it via a
single-page dashboard with websocket updates.

- Worker heartbeats: every consumer that goes through
  ``shepherd_utils.shared.get_tasks`` now self-registers in Redis under
  ``worker:heartbeat:{stream}:{consumer}`` with a 15s TTL. The monitor
  reads these to count alive/stale workers per type, so worker counts
  work under both docker compose and kubernetes autoscaling without
  introspecting deployment config.
- Poller: snapshots stream depths (XLEN, XPENDING, XINFO CONSUMERS),
  Postgres query/callback state, and Redis health every few seconds.
- Rolling history: per-metric Redis sorted sets, retention configurable
  via ``MONITOR_HISTORY_DAYS`` (default 3).
- Alerts: YAML rule config with threshold / heartbeat_lost /
  oldest_callback_age types, per-rule cooldowns in Redis, Slack webhook
  + SMTP dispatch (both off by default until creds are set).
- Dashboard: vanilla HTML + Chart.js served at port 5440, websocket
  push at ``/ws`` with polling fallback.
The monitor surfaced three problems with how shepherd cleans up after
itself. Fix each at the source and add a periodic janitor in the monitor
to sweep up existing leakage.

1. Redis streams: ``mark_task_as_complete`` only acked messages, so
   XLEN grew without bound. Delete each message right after the XACK.

2. Postgres ``callbacks`` table: only the timeout path in lookup
   workers and the happy path in ``merge_message`` ever removed rows,
   so any other completion path left them behind. ``finish_query`` now
   reaps callback rows for every query it terminates. Adds a
   ``reap_completed_callbacks`` helper for the janitor.

3. Per-ARA dashboard panel: the ``domain`` column was never written.
   ``add_query`` now accepts ``target`` and stores it; ``run_query``
   passes it through.

4. Monitor janitor: new ``workers/monitor/janitor.py`` runs every 5
   minutes and (a) trims each stream via XTRIM MINID using either the
   smallest pending id or last-delivered-id+1 so in-flight messages are
   never deleted, and (b) reaps any callback rows whose parent query is
   already COMPLETED. ``POST /api/admin/cleanup`` triggers it on demand.
Three follow-on cleanups for issues the dashboard surfaced:

* Stale consumer entries: every worker restart picks a new UUID for
  CONSUMER, leaving dozens of phantom entries inside each Redis Streams
  consumer group. The janitor now cross-references heartbeat keys (the
  authoritative "alive" set) against XINFO CONSUMERS and runs XGROUP
  DELCONSUMER on entries that have no heartbeat, no pending messages,
  and have been idle for more than an hour. This also fixes the
  inflated "max idle" values in the dashboard, which were dominated by
  these phantoms.

* History memory: writing a snapshot every poll tick (default 3s) was
  the dominant Redis consumer. New ``monitor_history_interval_sec``
  setting (default 30s) decouples history persistence from the live UI
  tick rate. Each series is also now capped by sample count
  (``HISTORY_MAX_SAMPLES = 10000``) in addition to the existing time
  cutoff -- so the next write after deploying will shrink any
  oversized series in place.
Workers that get killed mid-task leave pending messages stuck in the
consumer group's pending-entries list, referencing a consumer that will
never come back. The earlier ``cleanup_stale_consumers`` pass
deliberately leaves these alone (it only deletes consumers with zero
pending) because dropping in-flight work is destructive.

New ``POST /api/admin/reclaim_dead_consumers`` endpoint enumerates
those messages, ACKs and XDELs them, then removes the dead consumer.
Defaults to ``min_idle_seconds=3600`` and exposes a ``dry_run`` flag so
the caller can preview the impact first. Intentionally not wired into
the periodic janitor -- this is for manual triage.

Note: in production a real fix is XAUTOCLAIM at worker startup so
in-flight tasks get retried instead of dropped on restart. That's a
separate, larger change.
Every worker now periodically scans its stream's pending-entries list
and uses XCLAIM to take over messages whose owner is no longer alive.
This closes the gap where a crashed or restarted worker would strand
in-flight tasks forever -- they now get retried by another worker on
the same stream.

Multi-consumer safety is enforced by two independent checks:

1. **Heartbeat filter**: messages whose owner has a live heartbeat key
   are never claimed, even if their idle time is high.
2. **Idle floor** (``reclaim_min_idle_sec``, default 300s): the message
   must have been idle longer than the longest plausible legitimate
   task processing time. Even a momentary heartbeat refresh failure on
   an actively-working consumer cannot trigger a claim within this
   buffer.

Reclaim runs once per ``reclaim_interval_sec`` (default 30s) inside the
existing ``get_tasks`` poll loop. Reclaimed messages are fed through
the normal task pipeline so the existing wrap_up_task /
handle_task_failure error paths apply. Concurrent reclaim by multiple
consumers is safe because XCLAIM is atomic per message; at most one
caller wins each message.
A single global ``reclaim_min_idle_sec`` of 300s was wrong on both ends:
too long for fast filter workers (orphans wouldn't be retried until well
past the upstream 5-minute query budget), and barely enough for lookup
workers that legitimately run up to ~210s.

Replace it with a per-stream table in ``reclaim.PER_STREAM_MIN_IDLE_SEC``
keyed by worker stream name. Each entry sits just above that worker's
worst-case legitimate processing time:

  - aragorn.lookup / bte.lookup / pathfinder / omnicorp / score: 240s
  - merge_message / arax.rank / score_paths: 60s
  - example.score: 30s
  - everything else (fast filter, entry, finish workers): 30s default

Streams not in the table fall back to ``settings.reclaim_min_idle_sec``,
which now defaults to 30s. ``reclaim_interval_sec`` drops to 10s so a
crashed fast-worker task can be retried with time to spare.

``get_tasks`` also accepts an explicit ``reclaim_min_idle_sec`` override
for workers that prefer to declare their threshold inline.
Three related improvements to how the dashboard handles workers coming
and going:

1. Persistent worker visibility. The dashboard used to derive its
   worker list purely from live heartbeats, so a worker that died
   silently disappeared from the UI. We now persist every worker type
   we've ever seen in ``monitor:known_workers`` plus a
   ``monitor:worker_state:{stream}`` hash. The snapshot always includes
   the union of current heartbeats and historical workers, and each
   card carries a ``state`` (alive / scaled_down / crashed / unknown)
   that drives its color and the new state pill in the corner.

2. Clean shutdown signalling. Workers now trap SIGTERM and SIGINT in
   ``Heartbeat`` and synchronously write a
   ``worker:shutdown:{stream}:{consumer}`` marker (TTL 120s, well over
   the heartbeat TTL) before re-raising the signal. The poller uses
   that marker at the moment a worker type transitions from alive to
   zero: marker present means a clean scale-down, marker absent means a
   crash. The decision is locked in at the transition so a marker
   expiring later doesn't flip the state back.

3. Crash-only critical alerts. The Slack/email alert path that fires
   when a worker hits zero alive now checks ``ev["kind"] == "crashed"``
   and stays silent for clean scale-downs. The dashboard's new "Recent
   Scaling Events" panel still surfaces both kinds, color-coded.

Cards also show backlog and a utilization bar (backlog / capacity) so
that when autoscaling kicks in you can see whether worker count is
keeping up with load. New admin endpoint
``POST /api/admin/forget_worker?name=X`` clears a retired worker type
from the known set.
A clean scale-down to zero is still a problem -- the user wants at
least one instance of every worker type running at all times. Fire a
critical alert in both cases; the message text differentiates a crash
from a clean shutdown so the operator knows which one happened.
New durable historical view that complements the live dashboard. Live
reads stay in Redis (recent, fast); historical reads go to Postgres
(30-day retention, survives Redis flushes).

Schema (three new tables, added to init_db.sql and self-healed on
monitor startup via ``storage.ensure_schema``):

* ``monitor_metrics(ts, metric, value)`` -- generic time-series.
  Poller now dual-writes every tick: Redis (live) + Postgres (history).
  Also captures Redis used memory, Postgres DB size, and per-worker
  capacity/backlog, so the History tab can show infra trends.
* ``monitor_events(ts, type, worker, severity, detail, payload)`` --
  scale_up/scale_down events from the poller and fired alerts from the
  alert engine are mirrored here.
* ``monitor_task_latency(ts, stream, count, mean/p50/p90/p95/p99/min/max
  _ms)`` -- per-worker task durations aggregated into 30s buckets.

Latency capture is automatic: ``shared.py:get_tasks`` stamps each
delivered task with ``_started_at``, and ``wrap_up_task`` /
``handle_task_failure`` push the elapsed ms onto a bounded Redis list.
A new ``latency.aggregator_loop`` in the monitor drains those lists
every 30s, sorts and percentiles in-process, then inserts one row per
stream into Postgres. No worker files need to change.

New endpoints under ``/api/historical/``:
  - ``metrics`` (comma-separated names)
  - ``metrics_by_prefix``
  - ``latency``
  - ``events``
  - ``summary``

All read helpers auto-downsample to ~200 points per series via an
``epoch / N`` floor bucket, so a 30-day query never returns 86k points
per metric.

History tab UI at ``/history`` is a separate page with tab navigation
in the header. Range bar (1h/6h/24h/7d/30d + custom datetime) drives a
manual refresh of:

  - Summary cards (queries / crashes / scale events / alerts)
  - Throughput line chart
  - Worker fleet stacked-style chart
  - Queue backlog per stream
  - Per-worker latency small-multiples (p50 / p95 / p99)
  - Utilization heatmap (workers x time)
  - Infra (Redis memory + Postgres DB size, PG connections + Redis ops)
  - Events timeline (color-coded dots)
  - Incidents table (severity >= warning)

Retention: ``janitor.sweep_history_retention`` runs daily (rate-limited
via a Redis flag) to DELETE rows older than 30 days from all three
tables.
Two issues that showed up bringing the stack up cold.

* Startup alert spam: when ``docker compose up`` runs, persistent
  worker state from the previous run says ``alive`` for every worker
  type, but current heartbeats haven't arrived yet. The poller's first
  tick classified every worker as transitioning to scaled_down (or
  crashed if no marker existed), which immediately fired worker-down
  alerts to Slack. The AlertEngine now captures its boot time and
  suppresses worker-down dispatches for ``MONITOR_STARTUP_GRACE_SEC``
  (default 90s). Real worker losses that happen after grace still
  alert; the events themselves still flow through the poller and the
  History event log.

* Redis healthcheck silently passing during ``LOADING``. The original
  ``redis-cli ping`` returned exit 0 in two cases that aren't healthy:
  unauthenticated (NOAUTH error) and dataset still loading (LOADING
  error). Dependent services then started up and got connection
  errors. The healthcheck now authenticates and requires an exact
  ``PONG`` response; ``start_period: 60s`` gives a large RDB time to
  load before failure retries start counting against it.
``add_query`` now accepts an optional ``target`` arg and writes it into
the ``shepherd_brain.domain`` column so the History tab can report
per-ARA query volume. The existing test pinned the exact INSERT params
to the old 5-tuple. Update it to expect the trailing ``domain`` value
(None by default) and add a second test that exercises the populated
target path.
@maximusunc maximusunc merged commit a062cc5 into main May 18, 2026
2 checks passed
@maximusunc maximusunc deleted the claude/monitoring-dashboard-worker-Tt4X7 branch May 18, 2026 13:53
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 4.91132% with 1394 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.29%. Comparing base (51c0931) to head (efb831b).
⚠️ Report is 58 commits behind head on main.

Files with missing lines Patch % Lines
workers/monitor/poller.py 0.00% 272 Missing ⚠️
workers/monitor/janitor.py 0.00% 237 Missing ⚠️
workers/monitor/alerts.py 0.00% 232 Missing ⚠️
workers/monitor/storage.py 0.00% 189 Missing ⚠️
workers/monitor/worker.py 0.00% 170 Missing ⚠️
workers/monitor/latency.py 0.00% 73 Missing ⚠️
shepherd_utils/heartbeat.py 30.10% 65 Missing ⚠️
workers/monitor/history.py 0.00% 48 Missing ⚠️
shepherd_utils/reclaim.py 16.98% 44 Missing ⚠️
shepherd_utils/shared.py 20.00% 43 Missing and 1 partial ⚠️
... and 3 more
Files with missing lines Coverage Δ
shepherd_utils/broker.py 72.72% <100.00%> (+35.74%) ⬆️
shepherd_utils/config.py 100.00% <100.00%> (ø)
shepherd_server/base_routes.py 0.00% <0.00%> (ø)
workers/finish_query/worker.py 81.94% <50.00%> (+29.00%) ⬆️
shepherd_utils/db.py 75.96% <5.88%> (+46.33%) ⬆️
shepherd_utils/reclaim.py 16.98% <16.98%> (ø)
shepherd_utils/shared.py 67.31% <20.00%> (+21.73%) ⬆️
workers/monitor/history.py 0.00% <0.00%> (ø)
shepherd_utils/heartbeat.py 30.10% <30.10%> (ø)
workers/monitor/latency.py 0.00% <0.00%> (ø)
... and 5 more

... and 20 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 848df11...efb831b. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants