Skip to content

fix(selfhost): treat a lone orb relay registration timeout as degraded telemetry, not an error#3313

Merged
JSONbored merged 1 commit into
mainfrom
fix-orb-relay-registration-streak
Jul 5, 2026
Merged

fix(selfhost): treat a lone orb relay registration timeout as degraded telemetry, not an error#3313
JSONbored merged 1 commit into
mainfrom
fix-orb-relay-registration-streak

Conversation

@JSONbored

Copy link
Copy Markdown
Owner

Summary

Pull-mode orb relay registration failures were escalated to a warn log on every single failure, with no way to distinguish a lone transient broker timeout from a genuinely stuck relay link. This adds a consecutive-failure streak (OrbRelayRegistrationState.consecutiveFailures, mirroring the aiConsecutiveFailures / AI_UNHEALTHY_FAILURE_STREAK pattern in src/selfhost/ai.ts) plus a no-progress-window check against the pull-mode drain loop's last successful round-trip. A registration failure now only escalates to error (dashboard-visible, alertable) when either ORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAK (3) consecutive registration failures have occurred, or the drain loop hasn't completed a round-trip in ORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MS (30 minutes) — whichever the caller has evidence for. A single hiccup while orb_relay_drained keeps firing now stays a warn, exactly matching what "one hiccup, still draining fine" should look like operationally.

Also records a recovered counter series alongside the existing selfhost_orb_relay_register_recovered log (there was a log but no matching metric), resets the streak on any successful registration, adds two new Grafana gauges + a panel + a Prometheus alert rule so operators can see the streak and drain-staleness at a glance instead of just a bare failure counter, and regenerates the self-host env-var reference doc whose firstReference line numbers shifted because of the reordered server.ts wiring (no new env vars were added).

No issue filed — this is a small, self-contained observability fix (linkedIssuePolicy is preferred, not required for this repo).

Scope

  • The PR title follows type(scope): short summary Conventional Commit format.
  • This PR is focused and does not mix unrelated backend, UI, MCP, docs, dependency, and deploy changes.
  • This follows CONTRIBUTING.md and does not reintroduce GitHub Pages, VitePress, site/, or CNAME.
  • I linked an issue, or this is small enough that the summary explains why an issue is not needed.

Validation

  • git diff --check
  • npm run actionlint (via npm run test:ci)
  • npm run typecheck
  • npm run test:coverage locally (unsharded) — 100% line/branch coverage on every changed line in src/orb/broker-client.ts, src/selfhost/monitored-work.ts, and src/selfhost/metrics.ts; src/server.ts is Codecov-ignored (self-host entrypoint, exercised by the Docker boot smoke test)
  • npm run test:workers (via npm run test:ci)
  • npm run build:mcp (via npm run test:ci)
  • npm run test:mcp-pack (via npm run test:ci)
  • npm run ui:openapi:check
  • npm run ui:lint (via npm run test:ci)
  • npm run ui:typecheck (via npm run test:ci)
  • npm run ui:build (via npm run test:ci)
  • npm audit --audit-level=moderate
  • New or changed behavior has unit/integration tests for new branches, fallback paths, and sanitizer boundaries
  • npm run test:ci (full local gate, includes db:migrations:check, selfhost:env-reference:check, selfhost:validate-observability, cf-typegen:check, docs:drift-check, command-reference:check, etc.) — green end to end
  • npm run cf-typegen / npm run db:migrations:check confirmed as no-ops for this change (no wrangler binding/var or DB schema changes)

Safety

  • No secrets, wallet details, hotkeys, coldkeys, user PATs, private keys, raw trust scores, private rankings, or private maintainer evidence are exposed.
  • Public GitHub text stays sanitized, low-noise, and does not imply compensation guarantees or optimization tactics.
  • Auth, cookie, CORS, GitHub App, Cloudflare, or session changes include negative-path tests. (N/A — no auth/session/CORS surface touched.)
  • API/OpenAPI/MCP behavior is updated and tested where needed. (N/A — no API/OpenAPI/MCP surface touched; confirmed via ui:openapi:check.)
  • UI changes use live API data or real empty/error/loading states, not production mock/demo fallbacks. (N/A — no apps/gittensory-ui changes.)
  • Visible UI changes include a UI Evidence section. (N/A — this PR only touches backend/observability code, Grafana JSON, and a Prometheus rules file; no apps/gittensory-ui UI surface.)
  • Public docs/changelogs are updated where needed; changelogs are only edited for release-prep PRs. (No changelog touched.)

Notes

  • New test coverage: (a) a single registration failure while orb_relay_drained keeps firing stays a warn, not an error; (b) ORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAK consecutive failures escalates to error regardless of drain freshness; (c) the drain loop going stale past ORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MS escalates to error even below the streak threshold; (d) a successful registration after prior failures resets the streak to 0 and emits the new recovered metric alongside the existing recovery log.
  • test/unit/selfhost-grafana-dashboard.test.ts updated to assert the new panel/gauges/alert rule exist.

…not one hiccup

registerOrbRelayWithMonitor escalated a pull-mode registration failure to warn
only, and unconditionally at that -- there was no way to tell a lone transient
broker timeout apart from a sustained outage. Add a consecutive-failure streak
to OrbRelayRegistrationState alongside the existing lifetime attempts counter,
and only alert (error, not warn) once ORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAK
consecutive failures have occurred, or the pull-mode drain loop hasn't made
progress in ORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MS -- whichever the caller can
confirm. A single hiccup while orb_relay_drained keeps firing now stays a warning.

Also record a recovered counter alongside the existing recovery log, add
Grafana panels + a Prometheus alert for the new streak/no-progress gauges, and
regenerate the stale self-host env-var reference the reordered server.ts wiring
shifted.
@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
gittensory-ui f2e0d2a Commit Preview URL

Branch Preview URL
Jul 05 2026, 01:43 AM

@gittensory-orb gittensory-orb Bot added the gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier. label Jul 5, 2026
@gittensory-orb

gittensory-orb Bot commented Jul 5, 2026

Copy link
Copy Markdown

Warning

🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨

⏸️ Gittensory review result - manual review recommended

Review updated: 2026-07-05 03:05:36 UTC

10 files · 1 AI reviewer · no blockers · readiness 93/100 · CI green · clean

⏸️ Suggested Action - Manual Review

  • Touches a guarded path — held for manual review

Review summary
The change correctly moves pull-mode registration failures behind a streak/drain-progress gate, and the production path updates the streak in `registerOrbRelayTargetWithRetry` before `registerOrbRelayWithMonitor` renders the log severity. The new drain timestamp is stamped only after the broker drain returns, so it does not fabricate progress on broker failure, and the server wiring shares the same state object between registration and drain. I do not see a reachable correctness blocker in the provided diff.

Nits — 6 non-blocking
  • nit: `prometheus/rules/alerts.yml:396` names this `GittensoryOrbRelayRegistrationStuck`, but the second arm can fire on drain-loop staleness alone, so the alert name/runbook should either say relay drain/registration health or require a registration-failure signal in that arm.
  • nit: `src/server.ts:988` registers `gittensory_orb_relay_drain_seconds_since_last` in push mode too and reports `-1`; that is documented in metric metadata, but the Grafana panel uses `or vector(0)`, so a missing series and an intentional never-drained value read differently across views.
  • In `prometheus/rules/alerts.yml:396`, either rename the alert to cover both registration streak and drain staleness, or change the drain-stale arm to include recent registration failures if the intended invariant is exactly the app-side registration alert gate.
  • In `grafana/dashboards/gittensory.json`, consider preserving `-1` rather than using `or vector(0)` for the drain-staleness panel so operators can distinguish never/app-not-scraped from freshly drained.
  • In `test/unit/selfhost-grafana-dashboard.test.ts`, add a small assertion for the alert name/summary once you settle the registration-vs-drain wording so this observability contract does not drift again.
  • Touches a guarded path — held for manual review — A maintainer must review and merge this change.
Signal Result Evidence
Code review ✅ No blockers 1 reviewer
Linked issue ⚠️ Missing No linked issue or no-issue rationale found.
Related work ✅ No active overlap found No same-issue or scoped active PR overlap found.
Change scope ✅ 20/20 Low review scope from cached public metadata (no linked issue context).
Validation posture ✅ 25/25 PR body includes validation/test evidence.
Contributor workload ✅ 10/10 Author activity: 56 registered-repo PR(s), 46 merged, 423 issue(s).
Contributor context ✅ Confirmed Gittensor contributor JSONbored; Gittensor profile; 56 PR(s), 423 issue(s).
Gate result ⚠️ Not blocking Advisory; not blocking this PR.
Review context
  • Author: JSONbored
  • Role context: owner (maintainer lane)
  • Public audience mode: oss maintainer
  • Lane context: Repository registration is not available in the local Gittensory cache.
  • Public profile languages: Python, TypeScript, JavaScript, Ruby, Go, Kotlin, MDX, Shell
  • Official Gittensor activity: 56 PR(s), 423 issue(s).
  • PR-specific overlap: none found.
Contributor next steps
  • Treat this as maintainer-lane context rather than normal contributor-lane activity.
  • Explain no-issue PR.
  • No action.
  • Link the issue being solved, or explicitly explain why this is a no-issue PR.
Signal definitions
  • Related work = same linked issue, overlapping active PRs, or title/path similarity.
  • Change scope = cached public metadata such as size labels, draft state, and review-burden hints.
  • Validation posture = whether the PR provides enough public validation/test evidence for maintainer review.
  • Contributor workload = public contributor activity and cleanup pressure, not a repo-wide quality failure.
  • Contributor context = public GitHub/Gittensor identity context; non-Gittensor status is not a blocker.

🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed


💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →.

Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.

  • Re-run Gittensory review

@codecov

codecov Bot commented Jul 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.74%. Comparing base (1e9284b) to head (f2e0d2a).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3313   +/-   ##
=======================================
  Coverage   93.73%   93.74%           
=======================================
  Files         276      276           
  Lines       30381    30394   +13     
  Branches    11073    11080    +7     
=======================================
+ Hits        28479    28492   +13     
  Misses       1257     1257           
  Partials      645      645           
Files with missing lines Coverage Δ
src/orb/broker-client.ts 99.10% <100.00%> (+0.02%) ⬆️
src/selfhost/metrics.ts 100.00% <ø> (ø)
src/selfhost/monitored-work.ts 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gittensory-orb gittensory-orb Bot added the manual-review Gittensor contributor context label Jul 5, 2026
@JSONbored JSONbored merged commit 57fe46d into main Jul 5, 2026
13 checks passed
@JSONbored JSONbored deleted the fix-orb-relay-registration-streak branch July 5, 2026 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier. manual-review Gittensor contributor context

Development

Successfully merging this pull request may close these issues.

1 participant