Skip to content

W11: uptime prober + retention sweep job#22

Merged
mastermanas805 merged 1 commit into
masterfrom
feat/w11-status-uptime-prober-fresh
May 14, 2026
Merged

W11: uptime prober + retention sweep job#22
mastermanas805 merged 1 commit into
masterfrom
feat/w11-status-uptime-prober-fresh

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

Summary

Companion to api's /api/v1/status (W11). Writes one uptime_samples row per component (api, provisioner, worker, deploys, marketing) per minute. The status endpoint READS this table; this job WRITES it.

Why in-cluster instead of pingdom: external probers run from one region and miss the failure mode where instanode's edge dies (the one persona-3 caught). This worker probes from inside the cluster, so the worker → API probe rides the same path an agent uses.

  • uptime_prober — every 1min, parallel probes, 5s budget each. Per-component fail modes documented in uptime_prober.go.
  • uptime_retention — daily prune of rows > 90d. RunOnStart=false (wait for the next 24h slot).
  • Both routed to the reconcile queue so a default-queue backlog can't starve the status page.

Override targets via UPTIME_PROBE_API_URL, UPTIME_PROBE_MARKETING_URL, UPTIME_PROBE_PROVISIONER_ADDR, UPTIME_PROBE_DEPLOYS_URL for dev/staging clusters.

Test plan

  • go test ./internal/jobs -run TestUptime -v — 3 tests (happy path, failure mix, retention DELETE)
  • Full make test — all green
  • go build ./... — clean
  • After deploy: verify SELECT count(*) FROM uptime_samples grows by 5 rows/min

🤖 Generated with Claude Code

Companion to api's /api/v1/status (W11). Writes one uptime_samples row
per component (api, provisioner, worker, deploys, marketing) per
minute. The status endpoint READS this table; this job WRITES it.

Why an in-cluster prober instead of pingdom: external probers run from
one region and miss the failure mode where instanode's edge dies (the
one persona-3 caught). This worker probes from inside the cluster, so
the worker → API probe rides the same path an agent uses.

Per-component probes:
  * api          — HTTPS GET /healthz against api.instanode.dev
  * provisioner  — TCP dial against the gRPC service (no full gRPC stack)
  * worker       — SELECT 1 (we ARE the worker, but the DB read proves
                   we can talk to platform state)
  * deploys      — HEAD against the wildcard deployment ingress
  * marketing    — HTTPS GET instanode.dev

All targets overridable via UPTIME_PROBE_* env vars for dev/staging
clusters. 5s per-probe timeout. Probes run in parallel goroutines —
one slow probe doesn't delay the others; one DB write failure doesn't
poison the rest of the tick.

Retention: daily UptimeRetentionWorker prunes uptime_samples older
than 90 days. Runs RunOnStart=false (wait for the next 24h slot).

Both jobs route to the reconcile queue so a default-queue backlog
(weekly_digest fan-out) can't starve the status page during exactly
the moment we want it to be honest.

Tests: 3 new in uptime_prober_test.go covering happy path (all
healthy), failure mix (provisioner dial fails + api 500s), and
retention DELETE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant