Harden the skill against friction surfaced by a real large-fleet, chproxy-fronted investigation: fleet-scale resource caps, silent metric NAs, transient connectivity, shell-state loss across agent Bash calls, and config contamination from the wrapper's own caps. Scripts gain self-diagnosing behavior; the references gain the two recipes that incident most wanted.
Added
chq.shcap-trip hints — names the exact knob to raise (CH_MAX_ROWS/BYTES/TIME/EST_TIME/MEM) when a safety cap aborts a query, with theclusterAllReplicasfan-out reminder.- Transient-failure retry in both scripts (curl exit 6/7 no longer reads as an outage).
promq.shempty-result + error detection — distinguishes0 series(wrong/absent metric) from a real 0, surfaces Prometheus in-band errors, and no longer crashesjqon empty/non-JSON bodies.- Fleet-aware cap recipe in
SKILL.md(discover node count → narrow window → scale caps). - Per-node iowait + disk-straggler recipe in
cluster-state.md(avg by/max by (instance), latency-profile proxy). - Per-node latency-profile query (p50/p99/p999 by
hostName()) inquery-state.md. - Shell-persistence guidance + gitignored
.chenvpattern.
Changed
- Cap-contamination warning across
SKILL.md,query-state.md, and thechq.shheader (readsystem.settings, notquery_log.Settingsof probes). - README helper descriptions updated; dropped the stale
readonly=1mention.
Full changelog: https://github.com/Kryst4lskyxx/clickhouse-debug-skills/blob/main/CHANGELOG.md