Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 44 additions & 22 deletions .claude/commands/monitor.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,63 @@
# Monitor Fly Logs

Sample and analyse Fly.io logs at regular intervals.
Capture, search, and analyse Fly.io logs via `scripts/logs.sh`.

## Default usage

```bash
./scripts/monitor_logs.sh
./scripts/logs.sh
```

Runs for 4 hours with 10-second intervals.
Captures from all five Fly apps (`hover`, `hover-worker`, `hover-analysis`,
`hover-autoscaler-worker`, `hover-autoscaler-analysis`) every 3s, runs an
analyse snapshot every 5 minutes, and writes a final report when the run
finishes (~72 minutes by default). Press Ctrl+C to stop early — the final report
still writes.

## Options
## Subcommands

```bash
./scripts/monitor_logs.sh --run-id "descriptive-name" # Custom name
./scripts/monitor_logs.sh --interval 30 --iterations 120 # 30s for 1 hour
./scripts/logs.sh monitor [...] # explicit form of the default
./scripts/logs.sh search [...] # grep captured raw logs
./scripts/logs.sh analyse [...] # run probes, write analysis.md/json
```

## Output structure
## Common options

```bash
./scripts/logs.sh --interval 5 --iterations 720 # 5s × 1h
./scripts/logs.sh --run-id "incident-pr349" # custom slug
./scripts/logs.sh --analyse-every 30s # tighter snapshots
./scripts/logs.sh --analyse-every 0 # disable snapshots
./scripts/logs.sh --app hover,hover-worker # subset of apps

./scripts/logs.sh search --keyword panic --keyword pgx
./scripts/logs.sh search --regex 'status[":]+5\d\d' --app hover

./scripts/logs.sh analyse --keyword "deadline exceeded"
./scripts/logs.sh analyse --run 20260502/1430_mellow-rose_3s_1h
```
logs/YYYYMMDD/HHMM_<name>_<interval>s_<duration>h/
├── raw/
│ ├── <timestamp>_iter1.log
│ ├── <timestamp>_iter2.log
│ └── ...
├── <timestamp>_iter1.json
├── <timestamp>_iter2.json
├── time_series.csv
└── summary.md

## Output structure

```text
logs/YYYYMMDD/HHMM_<slug>_<settings>/
├── <app>/raw/*.log # cursor-filtered captures (one per iteration)
├── <app>/.cursor # last-seen ISO timestamp per app
├── snapshots/
│ ├── analysis_<HHMMSS>Z.md
│ └── analysis_<HHMMSS>Z.json
├── analysis.md # final probe report
├── analysis.json
└── monitor.log # verbose run history
```

## What it captures
## Probes (analyse)

- Raw log samples from Fly
- JSON summaries per iteration
- Aggregated time series data
- Summary markdown report
Severity, panics & fatals, HTTP status, latency (p50/p95/p99 + slowest),
heartbeat, process health, autoscaler, database/external errors, Sentry, plus
any ad-hoc `--keyword`/`--regex`. Every finding records `count`, `first_seen`,
`last_seen`, and `peak` (timestamp of the highest-count minute).

Automatic aggregation runs via `scripts/aggregate_logs.py` after each iteration.
The legacy `scripts/monitor_logs.sh` still works — it forwards to
`./scripts/logs.sh monitor`.
26 changes: 25 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,31 @@ On merge, CI will:

## [Unreleased]

_Add unreleased changes here._
### Added

- `scripts/logs.sh` unified Fly log tool with `monitor`, `search`, and `analyse`
subcommands. Bare `logs.sh` runs `monitor` with all defaults (3s capture
across all five Fly apps, 5-minute analyse snapshots, final report at run
end). `search` greps captured raw logs by keyword/regex across one or more
runs (case-insensitive by default, also reads `raw.zip` after cleanup).
`analyse` runs a fixed probe set — severity, panics, HTTP status, latency,
heartbeat, process health, autoscaler, database/external errors, Sentry, plus
ad-hoc keywords — and writes `analysis.{md,json}` with `first_seen`,
`last_seen`, and `peak` timestamps for every finding.
- `scripts/filter_since.py` per-app cursor filter wired into `capture_app` so
each iteration only persists log lines newer than the previous capture,
eliminating the 4× overlap inflation that came from `flyctl logs --no-tail`.

### Changed

- `scripts/monitor_logs.sh` reduced to a back-compat shim that forwards all
flags to `logs.sh monitor`. Existing call sites in `opencode.json`,
`.claude/commands/monitor.md`, `.claude/settings.local.json`, and
`docs/development/DEVELOPMENT.md` continue to work unchanged.
- Default monitor app list now includes the two autoscaler sidecars
(`hover-autoscaler-worker`, `hover-autoscaler-analysis`).
- Default `--run-id` is auto-generated as a `<adjective>-<colour>` slug (e.g.
`mellow-rose`) so concurrent runs are easy to distinguish.

## Full changelog history

Expand Down
53 changes: 38 additions & 15 deletions docs/development/DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,29 +294,52 @@ logic ├── crawler/ # Web crawling functionality ├── db/ # Database o

## Monitoring Fly Logs

For production investigations use `scripts/monitor_logs.sh`:
For production investigations use `scripts/logs.sh`:

```bash
# Default: 10-second intervals for 4 hours
./scripts/monitor_logs.sh
# Default: 3-second capture across all five Fly apps, 5-minute analyse
# snapshots, ~72 minutes total. Press Ctrl+C to stop early — the final
# report still writes.
./scripts/logs.sh

# Custom run with descriptive name
./scripts/monitor_logs.sh --run-id "heavy-load-test"
# Custom slug, longer interval and duration
./scripts/logs.sh --interval 30 --iterations 120 --run-id "30min-check"

# Custom intervals and duration
./scripts/monitor_logs.sh --interval 30 --iterations 120 --run-id "30min-check"
# Tighter snapshot cadence, or disable snapshots
./scripts/logs.sh --analyse-every 30s
./scripts/logs.sh --analyse-every 0
```

`logs.sh` has three subcommands sharing the same run layout:

```bash
./scripts/logs.sh search --keyword panic --keyword pgx
./scripts/logs.sh analyse --keyword "deadline exceeded"
./scripts/logs.sh analyse --run 20260502/1430_mellow-rose_3s_1h
```

The legacy `scripts/monitor_logs.sh` still works — it forwards to
`./scripts/logs.sh monitor`.

**Output structure:**

- Folder: `logs/YYYYMMDD/HHMM_<name>_<interval>s_<duration>h/`
- Example: `logs/20251105/0833_heavy-load-test_10s_4h/`
- Raw logs: `raw/<timestamp>_iter<N>.log`
- JSON summaries: `<timestamp>_iter<N>.json`
- Aggregated outputs:
- `time_series.csv` - per-minute log level counts
- `summary.md` - human-readable report with critical patterns
- Automatically regenerated after each iteration
- Run dir: `logs/YYYYMMDD/HHMM_<slug>_<interval>s_<duration>/`
- Example: `logs/20260502/1430_mellow-rose_3s_1h/`
- Per-app captures: `<app>/raw/<timestamp>_iter<N>.log` (cursor-filtered against
`<app>/.cursor` so each iteration only persists lines newer than the previous)
- Per-iteration JSON: `<app>/<timestamp>_iter<N>.json`
- Aggregated outputs (per app):
- `time_series.csv` — per-minute log level counts
- `summary.md` — human-readable per-app report
- Cross-app analysis (whole run):
- `analysis.md` / `analysis.json` — final probe report (severity, panics,
HTTP, latency, heartbeat, process health, autoscaler, DB/external, Sentry,
ad-hoc keywords) with `first_seen` / `last_seen` / `peak` timestamps for
every finding
- `snapshots/analysis_<HHMMSS>Z.{md,json}` — point-in-time snapshots written
every `--analyse-every` while the run is in progress
- `monitor.log` — verbose run history (cleanup, per-iteration capture, errors).
The TTY shows only a startup banner and a self-overwriting ticker.

**Defaults:**

Expand Down
2 changes: 1 addition & 1 deletion opencode.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
},
"monitor-fly": {
"description": "Collect and summarise Fly logs",
"template": "Run ./scripts/monitor_logs.sh with suitable arguments from $ARGUMENTS if provided. Summarise critical patterns, error spikes, and likely causes."
"template": "Run ./scripts/logs.sh with suitable arguments from $ARGUMENTS if provided (use the `search` or `analyse` subcommand for grep / probe-driven analysis of an existing run). Summarise critical patterns, error spikes, and likely causes."
},
"load-test": {
"description": "Run scripted load test safely",
Expand Down
11 changes: 7 additions & 4 deletions scripts/aggregate_logs.py
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,10 @@ def watch_mode(log_dir, interval=10):
parser.add_argument("--full", action="store_true", help="Full reprocess (ignore state)")
args = parser.parse_args()

if args.watch:
watch_mode(args.log_dir, args.interval)
else:
aggregate_logs(args.log_dir, incremental=not args.full)
try:
if args.watch:
watch_mode(args.log_dir, args.interval)
else:
aggregate_logs(args.log_dir, incremental=not args.full)
except KeyboardInterrupt:
sys.exit(130)
Loading
Loading