Good-Native · simonsmallchua · May 2, 2026 · May 2, 2026 · May 2, 2026 · May 2, 2026
diff --git a/.claude/commands/monitor.md b/.claude/commands/monitor.md
@@ -1,41 +1,63 @@
 # Monitor Fly Logs
 
-Sample and analyse Fly.io logs at regular intervals.
+Capture, search, and analyse Fly.io logs via `scripts/logs.sh`.
 
 ## Default usage
 
 ```bash
-./scripts/monitor_logs.sh
+./scripts/logs.sh
 ```
 
-Runs for 4 hours with 10-second intervals.
+Captures from all five Fly apps (`hover`, `hover-worker`, `hover-analysis`,
+`hover-autoscaler-worker`, `hover-autoscaler-analysis`) every 3s, runs an
+analyse snapshot every 5 minutes, and writes a final report when the run
+finishes (~72 minutes by default). Press Ctrl+C to stop early — the final report
+still writes.
 
-## Options
+## Subcommands
 
 ```bash
-./scripts/monitor_logs.sh --run-id "descriptive-name"    # Custom name
-./scripts/monitor_logs.sh --interval 30 --iterations 120 # 30s for 1 hour
+./scripts/logs.sh monitor [...]   # explicit form of the default
+./scripts/logs.sh search  [...]   # grep captured raw logs
+./scripts/logs.sh analyse [...]   # run probes, write analysis.md/json
 ```
 
-## Output structure
+## Common options
+
+```bash
+./scripts/logs.sh --interval 5 --iterations 720       # 5s × 1h
+./scripts/logs.sh --run-id "incident-pr349"           # custom slug
+./scripts/logs.sh --analyse-every 30s                 # tighter snapshots
+./scripts/logs.sh --analyse-every 0                   # disable snapshots
+./scripts/logs.sh --app hover,hover-worker            # subset of apps
+
+./scripts/logs.sh search --keyword panic --keyword pgx
+./scripts/logs.sh search --regex 'status[":]+5\d\d' --app hover
 
+./scripts/logs.sh analyse --keyword "deadline exceeded"
+./scripts/logs.sh analyse --run 20260502/1430_mellow-rose_3s_1h
 ```
-logs/YYYYMMDD/HHMM_<name>_<interval>s_<duration>h/
-├── raw/
-│   ├── <timestamp>_iter1.log
-│   ├── <timestamp>_iter2.log
-│   └── ...
-├── <timestamp>_iter1.json
-├── <timestamp>_iter2.json
-├── time_series.csv
-└── summary.md
+
+## Output structure
+
+```text
+logs/YYYYMMDD/HHMM_<slug>_<settings>/
+├── <app>/raw/*.log          # cursor-filtered captures (one per iteration)
+├── <app>/.cursor            # last-seen ISO timestamp per app
+├── snapshots/
+│   ├── analysis_<HHMMSS>Z.md
+│   └── analysis_<HHMMSS>Z.json
+├── analysis.md              # final probe report
+├── analysis.json
+└── monitor.log              # verbose run history
 ```
 
-## What it captures
+## Probes (analyse)
 
-- Raw log samples from Fly
-- JSON summaries per iteration
-- Aggregated time series data
-- Summary markdown report
+Severity, panics & fatals, HTTP status, latency (p50/p95/p99 + slowest),
+heartbeat, process health, autoscaler, database/external errors, Sentry, plus
+any ad-hoc `--keyword`/`--regex`. Every finding records `count`, `first_seen`,
+`last_seen`, and `peak` (timestamp of the highest-count minute).
 
-Automatic aggregation runs via `scripts/aggregate_logs.py` after each iteration.
+The legacy `scripts/monitor_logs.sh` still works — it forwards to
+`./scripts/logs.sh monitor`.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -28,7 +28,31 @@ On merge, CI will:
 
 ## [Unreleased]
 
-_Add unreleased changes here._
+### Added
+
+- `scripts/logs.sh` unified Fly log tool with `monitor`, `search`, and `analyse`
+  subcommands. Bare `logs.sh` runs `monitor` with all defaults (3s capture
+  across all five Fly apps, 5-minute analyse snapshots, final report at run
+  end). `search` greps captured raw logs by keyword/regex across one or more
+  runs (case-insensitive by default, also reads `raw.zip` after cleanup).
+  `analyse` runs a fixed probe set — severity, panics, HTTP status, latency,
+  heartbeat, process health, autoscaler, database/external errors, Sentry, plus
+  ad-hoc keywords — and writes `analysis.{md,json}` with `first_seen`,
+  `last_seen`, and `peak` timestamps for every finding.
+- `scripts/filter_since.py` per-app cursor filter wired into `capture_app` so
+  each iteration only persists log lines newer than the previous capture,
+  eliminating the 4× overlap inflation that came from `flyctl logs --no-tail`.
+
+### Changed
+
+- `scripts/monitor_logs.sh` reduced to a back-compat shim that forwards all
+  flags to `logs.sh monitor`. Existing call sites in `opencode.json`,
+  `.claude/commands/monitor.md`, `.claude/settings.local.json`, and
+  `docs/development/DEVELOPMENT.md` continue to work unchanged.
+- Default monitor app list now includes the two autoscaler sidecars
+  (`hover-autoscaler-worker`, `hover-autoscaler-analysis`).
+- Default `--run-id` is auto-generated as a `<adjective>-<colour>` slug (e.g.
+  `mellow-rose`) so concurrent runs are easy to distinguish.
 
 ## Full changelog history
 

diff --git a/docs/development/DEVELOPMENT.md b/docs/development/DEVELOPMENT.md
@@ -294,29 +294,52 @@ logic ├── crawler/ # Web crawling functionality ├── db/ # Database o
 
 ## Monitoring Fly Logs
 
-For production investigations use `scripts/monitor_logs.sh`:
+For production investigations use `scripts/logs.sh`:
 
 ```bash
-# Default: 10-second intervals for 4 hours
-./scripts/monitor_logs.sh
+# Default: 3-second capture across all five Fly apps, 5-minute analyse
+# snapshots, ~72 minutes total. Press Ctrl+C to stop early — the final
+# report still writes.
+./scripts/logs.sh
 
-# Custom run with descriptive name
-./scripts/monitor_logs.sh --run-id "heavy-load-test"
+# Custom slug, longer interval and duration
+./scripts/logs.sh --interval 30 --iterations 120 --run-id "30min-check"
 
-# Custom intervals and duration
-./scripts/monitor_logs.sh --interval 30 --iterations 120 --run-id "30min-check"
+# Tighter snapshot cadence, or disable snapshots
+./scripts/logs.sh --analyse-every 30s
+./scripts/logs.sh --analyse-every 0
 ```
 
+`logs.sh` has three subcommands sharing the same run layout:
+
+```bash
+./scripts/logs.sh search --keyword panic --keyword pgx
+./scripts/logs.sh analyse --keyword "deadline exceeded"
+./scripts/logs.sh analyse --run 20260502/1430_mellow-rose_3s_1h
+```
+
+The legacy `scripts/monitor_logs.sh` still works — it forwards to
+`./scripts/logs.sh monitor`.
+
 **Output structure:**
 
-- Folder: `logs/YYYYMMDD/HHMM_<name>_<interval>s_<duration>h/`
-  - Example: `logs/20251105/0833_heavy-load-test_10s_4h/`
-- Raw logs: `raw/<timestamp>_iter<N>.log`
-- JSON summaries: `<timestamp>_iter<N>.json`
-- Aggregated outputs:
-  - `time_series.csv` - per-minute log level counts
-  - `summary.md` - human-readable report with critical patterns
-  - Automatically regenerated after each iteration
+- Run dir: `logs/YYYYMMDD/HHMM_<slug>_<interval>s_<duration>/`
+  - Example: `logs/20260502/1430_mellow-rose_3s_1h/`
+- Per-app captures: `<app>/raw/<timestamp>_iter<N>.log` (cursor-filtered against
+  `<app>/.cursor` so each iteration only persists lines newer than the previous)
+- Per-iteration JSON: `<app>/<timestamp>_iter<N>.json`
+- Aggregated outputs (per app):
+  - `time_series.csv` — per-minute log level counts
+  - `summary.md` — human-readable per-app report
+- Cross-app analysis (whole run):
+  - `analysis.md` / `analysis.json` — final probe report (severity, panics,
+    HTTP, latency, heartbeat, process health, autoscaler, DB/external, Sentry,
+    ad-hoc keywords) with `first_seen` / `last_seen` / `peak` timestamps for
+    every finding
+  - `snapshots/analysis_<HHMMSS>Z.{md,json}` — point-in-time snapshots written
+    every `--analyse-every` while the run is in progress
+- `monitor.log` — verbose run history (cleanup, per-iteration capture, errors).
+  The TTY shows only a startup banner and a self-overwriting ticker.
 
 **Defaults:**
 

diff --git a/opencode.json b/opencode.json
@@ -8,7 +8,7 @@
     },
     "monitor-fly": {
       "description": "Collect and summarise Fly logs",
-      "template": "Run ./scripts/monitor_logs.sh with suitable arguments from $ARGUMENTS if provided. Summarise critical patterns, error spikes, and likely causes."
+      "template": "Run ./scripts/logs.sh with suitable arguments from $ARGUMENTS if provided (use the `search` or `analyse` subcommand for grep / probe-driven analysis of an existing run). Summarise critical patterns, error spikes, and likely causes."
     },
     "load-test": {
       "description": "Run scripted load test safely",

diff --git a/scripts/aggregate_logs.py b/scripts/aggregate_logs.py
@@ -404,7 +404,10 @@ def watch_mode(log_dir, interval=10):
     parser.add_argument("--full", action="store_true", help="Full reprocess (ignore state)")
     args = parser.parse_args()
 
-    if args.watch:
-        watch_mode(args.log_dir, args.interval)
-    else:
-        aggregate_logs(args.log_dir, incremental=not args.full)
+    try:
+        if args.watch:
+            watch_mode(args.log_dir, args.interval)
+        else:
+            aggregate_logs(args.log_dir, incremental=not args.full)
+    except KeyboardInterrupt:
+        sys.exit(130)