W11: uptime prober + retention sweep job#22
Merged
Conversation
Companion to api's /api/v1/status (W11). Writes one uptime_samples row
per component (api, provisioner, worker, deploys, marketing) per
minute. The status endpoint READS this table; this job WRITES it.
Why an in-cluster prober instead of pingdom: external probers run from
one region and miss the failure mode where instanode's edge dies (the
one persona-3 caught). This worker probes from inside the cluster, so
the worker → API probe rides the same path an agent uses.
Per-component probes:
* api — HTTPS GET /healthz against api.instanode.dev
* provisioner — TCP dial against the gRPC service (no full gRPC stack)
* worker — SELECT 1 (we ARE the worker, but the DB read proves
we can talk to platform state)
* deploys — HEAD against the wildcard deployment ingress
* marketing — HTTPS GET instanode.dev
All targets overridable via UPTIME_PROBE_* env vars for dev/staging
clusters. 5s per-probe timeout. Probes run in parallel goroutines —
one slow probe doesn't delay the others; one DB write failure doesn't
poison the rest of the tick.
Retention: daily UptimeRetentionWorker prunes uptime_samples older
than 90 days. Runs RunOnStart=false (wait for the next 24h slot).
Both jobs route to the reconcile queue so a default-queue backlog
(weekly_digest fan-out) can't starve the status page during exactly
the moment we want it to be honest.
Tests: 3 new in uptime_prober_test.go covering happy path (all
healthy), failure mix (provisioner dial fails + api 500s), and
retention DELETE.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion to api's
/api/v1/status(W11). Writes oneuptime_samplesrow per component (api, provisioner, worker, deploys, marketing) per minute. The status endpoint READS this table; this job WRITES it.Why in-cluster instead of pingdom: external probers run from one region and miss the failure mode where instanode's edge dies (the one persona-3 caught). This worker probes from inside the cluster, so the worker → API probe rides the same path an agent uses.
uptime_prober.go.Override targets via
UPTIME_PROBE_API_URL,UPTIME_PROBE_MARKETING_URL,UPTIME_PROBE_PROVISIONER_ADDR,UPTIME_PROBE_DEPLOYS_URLfor dev/staging clusters.Test plan
go test ./internal/jobs -run TestUptime -v— 3 tests (happy path, failure mix, retention DELETE)make test— all greengo build ./...— cleanSELECT count(*) FROM uptime_samplesgrows by 5 rows/min🤖 Generated with Claude Code