W11: uptime prober + retention sweep job by mastermanas805 · Pull Request #22 · InstaNode-dev/worker

mastermanas805 · 2026-05-14T06:55:58Z

Summary

Companion to api's /api/v1/status (W11). Writes one uptime_samples row per component (api, provisioner, worker, deploys, marketing) per minute. The status endpoint READS this table; this job WRITES it.

Why in-cluster instead of pingdom: external probers run from one region and miss the failure mode where instanode's edge dies (the one persona-3 caught). This worker probes from inside the cluster, so the worker → API probe rides the same path an agent uses.

uptime_prober — every 1min, parallel probes, 5s budget each. Per-component fail modes documented in uptime_prober.go.
uptime_retention — daily prune of rows > 90d. RunOnStart=false (wait for the next 24h slot).
Both routed to the reconcile queue so a default-queue backlog can't starve the status page.

Override targets via UPTIME_PROBE_API_URL, UPTIME_PROBE_MARKETING_URL, UPTIME_PROBE_PROVISIONER_ADDR, UPTIME_PROBE_DEPLOYS_URL for dev/staging clusters.

Test plan

go test ./internal/jobs -run TestUptime -v — 3 tests (happy path, failure mix, retention DELETE)
Full make test — all green
go build ./... — clean
After deploy: verify SELECT count(*) FROM uptime_samples grows by 5 rows/min

🤖 Generated with Claude Code

Companion to api's /api/v1/status (W11). Writes one uptime_samples row per component (api, provisioner, worker, deploys, marketing) per minute. The status endpoint READS this table; this job WRITES it. Why an in-cluster prober instead of pingdom: external probers run from one region and miss the failure mode where instanode's edge dies (the one persona-3 caught). This worker probes from inside the cluster, so the worker → API probe rides the same path an agent uses. Per-component probes: * api — HTTPS GET /healthz against api.instanode.dev * provisioner — TCP dial against the gRPC service (no full gRPC stack) * worker — SELECT 1 (we ARE the worker, but the DB read proves we can talk to platform state) * deploys — HEAD against the wildcard deployment ingress * marketing — HTTPS GET instanode.dev All targets overridable via UPTIME_PROBE_* env vars for dev/staging clusters. 5s per-probe timeout. Probes run in parallel goroutines — one slow probe doesn't delay the others; one DB write failure doesn't poison the rest of the tick. Retention: daily UptimeRetentionWorker prunes uptime_samples older than 90 days. Runs RunOnStart=false (wait for the next 24h slot). Both jobs route to the reconcile queue so a default-queue backlog (weekly_digest fan-out) can't starve the status page during exactly the moment we want it to be honest. Tests: 3 new in uptime_prober_test.go covering happy path (all healthy), failure mix (provisioner dial fails + api 500s), and retention DELETE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mastermanas805 merged commit 9780a04 into master May 14, 2026

mastermanas805 mentioned this pull request May 14, 2026

W11: rewrite /status to consume real backend GET /api/v1/status InstaNode-dev/instanode-web#61

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W11: uptime prober + retention sweep job#22

W11: uptime prober + retention sweep job#22
mastermanas805 merged 1 commit into
masterfrom
feat/w11-status-uptime-prober-fresh

mastermanas805 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented May 14, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant