Skip to content

v0.2.0 — large-fleet debugging hardening

Latest

Choose a tag to compare

@Kryst4lskyxx Kryst4lskyxx released this 18 Jun 07:53
b01e88a

Harden the skill against friction surfaced by a real large-fleet, chproxy-fronted investigation: fleet-scale resource caps, silent metric NAs, transient connectivity, shell-state loss across agent Bash calls, and config contamination from the wrapper's own caps. Scripts gain self-diagnosing behavior; the references gain the two recipes that incident most wanted.

Added

  • chq.sh cap-trip hints — names the exact knob to raise (CH_MAX_ROWS/BYTES/TIME/EST_TIME/MEM) when a safety cap aborts a query, with the clusterAllReplicas fan-out reminder.
  • Transient-failure retry in both scripts (curl exit 6/7 no longer reads as an outage).
  • promq.sh empty-result + error detection — distinguishes 0 series (wrong/absent metric) from a real 0, surfaces Prometheus in-band errors, and no longer crashes jq on empty/non-JSON bodies.
  • Fleet-aware cap recipe in SKILL.md (discover node count → narrow window → scale caps).
  • Per-node iowait + disk-straggler recipe in cluster-state.md (avg by/max by (instance), latency-profile proxy).
  • Per-node latency-profile query (p50/p99/p999 by hostName()) in query-state.md.
  • Shell-persistence guidance + gitignored .chenv pattern.

Changed

  • Cap-contamination warning across SKILL.md, query-state.md, and the chq.sh header (read system.settings, not query_log.Settings of probes).
  • README helper descriptions updated; dropped the stale readonly=1 mention.

Full changelog: https://github.com/Kryst4lskyxx/clickhouse-debug-skills/blob/main/CHANGELOG.md