fix(selfhost): treat a lone orb relay registration timeout as degraded telemetry, not an error#3313
Conversation
…not one hiccup registerOrbRelayWithMonitor escalated a pull-mode registration failure to warn only, and unconditionally at that -- there was no way to tell a lone transient broker timeout apart from a sustained outage. Add a consecutive-failure streak to OrbRelayRegistrationState alongside the existing lifetime attempts counter, and only alert (error, not warn) once ORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAK consecutive failures have occurred, or the pull-mode drain loop hasn't made progress in ORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MS -- whichever the caller can confirm. A single hiccup while orb_relay_drained keeps firing now stays a warning. Also record a recovered counter alongside the existing recovery log, add Grafana panels + a Prometheus alert for the new streak/no-progress gauges, and regenerate the stale self-host env-var reference the reordered server.ts wiring shifted.
|
Superagent didn't find any vulnerabilities or security issues in this PR. |
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
gittensory-ui | f2e0d2a | Commit Preview URL Branch Preview URL |
Jul 05 2026, 01:43 AM |
|
Warning 🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨 ⏸️ Gittensory review result - manual review recommendedReview updated: 2026-07-05 03:05:36 UTC
⏸️ Suggested Action - Manual Review
Review summary Nits — 6 non-blocking
Review context
Contributor next steps
Signal definitions
🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed 💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →. Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3313 +/- ##
=======================================
Coverage 93.73% 93.74%
=======================================
Files 276 276
Lines 30381 30394 +13
Branches 11073 11080 +7
=======================================
+ Hits 28479 28492 +13
Misses 1257 1257
Partials 645 645
🚀 New features to boost your workflow:
|
Summary
Pull-mode orb relay registration failures were escalated to a
warnlog on every single failure, with no way to distinguish a lone transient broker timeout from a genuinely stuck relay link. This adds a consecutive-failure streak (OrbRelayRegistrationState.consecutiveFailures, mirroring theaiConsecutiveFailures/AI_UNHEALTHY_FAILURE_STREAKpattern insrc/selfhost/ai.ts) plus a no-progress-window check against the pull-mode drain loop's last successful round-trip. A registration failure now only escalates toerror(dashboard-visible, alertable) when eitherORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAK(3) consecutive registration failures have occurred, or the drain loop hasn't completed a round-trip inORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MS(30 minutes) — whichever the caller has evidence for. A single hiccup whileorb_relay_drainedkeeps firing now stays awarn, exactly matching what "one hiccup, still draining fine" should look like operationally.Also records a
recoveredcounter series alongside the existingselfhost_orb_relay_register_recoveredlog (there was a log but no matching metric), resets the streak on any successful registration, adds two new Grafana gauges + a panel + a Prometheus alert rule so operators can see the streak and drain-staleness at a glance instead of just a bare failure counter, and regenerates the self-host env-var reference doc whosefirstReferenceline numbers shifted because of the reorderedserver.tswiring (no new env vars were added).No issue filed — this is a small, self-contained observability fix (
linkedIssuePolicyispreferred, not required for this repo).Scope
type(scope): short summaryConventional Commit format.CONTRIBUTING.mdand does not reintroduce GitHub Pages, VitePress,site/, orCNAME.Validation
git diff --checknpm run actionlint(vianpm run test:ci)npm run typechecknpm run test:coveragelocally (unsharded) — 100% line/branch coverage on every changed line insrc/orb/broker-client.ts,src/selfhost/monitored-work.ts, andsrc/selfhost/metrics.ts;src/server.tsis Codecov-ignored (self-host entrypoint, exercised by the Docker boot smoke test)npm run test:workers(vianpm run test:ci)npm run build:mcp(vianpm run test:ci)npm run test:mcp-pack(vianpm run test:ci)npm run ui:openapi:checknpm run ui:lint(vianpm run test:ci)npm run ui:typecheck(vianpm run test:ci)npm run ui:build(vianpm run test:ci)npm audit --audit-level=moderatenpm run test:ci(full local gate, includesdb:migrations:check,selfhost:env-reference:check,selfhost:validate-observability,cf-typegen:check,docs:drift-check,command-reference:check, etc.) — green end to endnpm run cf-typegen/npm run db:migrations:checkconfirmed as no-ops for this change (no wrangler binding/var or DB schema changes)Safety
ui:openapi:check.)apps/gittensory-uichanges.)UI Evidencesection. (N/A — this PR only touches backend/observability code, Grafana JSON, and a Prometheus rules file; noapps/gittensory-uiUI surface.)Notes
orb_relay_drainedkeeps firing stays awarn, not anerror; (b)ORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAKconsecutive failures escalates toerrorregardless of drain freshness; (c) the drain loop going stale pastORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MSescalates toerroreven below the streak threshold; (d) a successful registration after prior failures resets the streak to 0 and emits the newrecoveredmetric alongside the existing recovery log.test/unit/selfhost-grafana-dashboard.test.tsupdated to assert the new panel/gauges/alert rule exist.