Skip to content

Kiloclaw grafana improvements#2827

Merged
St0rmz1 merged 3 commits intomainfrom
kiloclaw-grafana-improvements
Apr 27, 2026
Merged

Kiloclaw grafana improvements#2827
St0rmz1 merged 3 commits intomainfrom
kiloclaw-grafana-improvements

Conversation

@St0rmz1
Copy link
Copy Markdown
Contributor

@St0rmz1 St0rmz1 commented Apr 27, 2026

Summary

Expands both kiloclaw Grafana dashboards under dev/grafana/dashboards/ and cleans up the local dev datasource provisioning yaml.

Events dashboard

  • New filter variables wired into every panel query: OpenClaw Version, Image Tag, Org ID, Instance ID.
  • Replaced Avg Duration stat with sample weighted P95.
  • New Top Error Messages aggregation table.
  • New P50 / P95 / P99 duration timeseries.
  • Removed redundant $event filter from panels that already constrain via LIKE (Reconcile Actions, Restore Activity, Capacity Evictions, Recent Lifecycle Events). Previously, picking an event that the panel's LIKE clause excludes would silently empty those panels.

Controller telemetry dashboard

  • New Disk Usage row: disk fill percentage timeseries by region and a top hosts table.
  • New Reliability row: top exit reasons table and top hosts by restart delta.
  • Bytes unit overrides on bandwidth columns so values render as KiB / MiB / GiB rather than raw integers.

Cloudflare Analytics Engine SQL compatibility

These were the most surprising part of this work. CF AE rejects several SQL constructs that the original dashboard relied on, and one of them (NULLIF) was silently failing the Error Rate panel on main as well.

  • SELECT DISTINCT col rewritten as SELECT col ... GROUP BY col in every template variable definition. CF AE returns 422 for the DISTINCT query modifier. This is why every dropdown was stuck at "All" with no other options.
  • argMax(value, timestamp) replaced with MAX(value) in the disk fill table. Columns renamed to peak_* to reflect that the values are window peak rather than most recent.
  • NULLIF(x, 0) replaced with WHERE clause guards or IF() wrappers. CF AE does not support NULLIF. Side benefit: the Error Rate stat in kiloclaw-events.json, which had been returning "No data" on main, now renders.

Other

  • extrapolate: false on stat / table panels and on AVG / quantile timeseries. The vertamedia plugin extrapolation flag corrects partial buckets on rate metrics, but distorts ratios and quantiles.
  • Panel descriptions added throughout so reviewers can see the intent of each query without reading SQL.
  • Removed the unused secureJsonData.xHeaderKey token from the datasource yaml.
  • Added a comment that tlsSkipVerify must remain false in production, since the URL points at api.cloudflare.com.

Verification

  • Loaded both dashboards in local dev Grafana (docker compose --profile grafana up grafana) and confirmed every panel renders with real data.
  • Confirmed every query driven dropdown populates with actual values (Fly Region, Supervisor State, Controller Version, Delivery, Event, OpenClaw Version, Image Tag).
  • Verified textbox variables (Sandbox ID, User ID, Org ID, Instance ID) filter every panel correctly when populated.
  • Confirmed no panels return 422 from CF AE in browser DevTools after the SQL compatibility fixes.
  • pnpm run format:check passes against the changed files.

Visual Changes

N/A. Grafana dashboards only, rendered locally. No app UI changes.

Reviewer Notes

The CF AE supported function list is short and worth bookmarking for any future Grafana panel work: https://developers.cloudflare.com/analytics/analytics-engine/sql-api/

Notably absent:

  • JOINs, subqueries, CTEs, window functions
  • SELECT DISTINCT (the query modifier)
  • argMax, argMin, topK, groupArray, uniq, uniqExact
  • NULLIF
  • Most string functions

Notably present and used here:

  • COUNT(DISTINCT col) as an aggregate (distinct from the rejected query modifier)
  • quantileWeighted(level)(value, weight)
  • IF(cond, then, else)

The two dashboards live under dev/grafana/dashboards/ and are local dev only. The new services/kilo-ops/ Grafana shipped via Worker container at ops.kiloapps.io (added in #2769) does not yet host these kiloclaw dashboards. Promoting them to that production Grafana is a sensible next step once these prove out.

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 27, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (3 files)
  • dev/grafana/dashboards/kiloclaw-controller-telemetry.json
  • dev/grafana/dashboards/kiloclaw-events.json
  • dev/grafana/provisioning/datasources/kiloclaw-clickhouse.yml

Reviewed by gpt-5.5-2026-04-23 · 1,593,757 tokens

@St0rmz1 St0rmz1 merged commit f81c552 into main Apr 27, 2026
16 checks passed
@St0rmz1 St0rmz1 deleted the kiloclaw-grafana-improvements branch April 27, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants