fix(selfhost): treat a lone orb relay registration timeout as degraded telemetry, not an error by JSONbored · Pull Request #3313 · JSONbored/gittensory

JSONbored · 2026-07-05T01:41:30Z

Summary

Pull-mode orb relay registration failures were escalated to a warn log on every single failure, with no way to distinguish a lone transient broker timeout from a genuinely stuck relay link. This adds a consecutive-failure streak (OrbRelayRegistrationState.consecutiveFailures, mirroring the aiConsecutiveFailures / AI_UNHEALTHY_FAILURE_STREAK pattern in src/selfhost/ai.ts) plus a no-progress-window check against the pull-mode drain loop's last successful round-trip. A registration failure now only escalates to error (dashboard-visible, alertable) when either ORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAK (3) consecutive registration failures have occurred, or the drain loop hasn't completed a round-trip in ORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MS (30 minutes) — whichever the caller has evidence for. A single hiccup while orb_relay_drained keeps firing now stays a warn, exactly matching what "one hiccup, still draining fine" should look like operationally.

Also records a recovered counter series alongside the existing selfhost_orb_relay_register_recovered log (there was a log but no matching metric), resets the streak on any successful registration, adds two new Grafana gauges + a panel + a Prometheus alert rule so operators can see the streak and drain-staleness at a glance instead of just a bare failure counter, and regenerates the self-host env-var reference doc whose firstReference line numbers shifted because of the reordered server.ts wiring (no new env vars were added).

No issue filed — this is a small, self-contained observability fix (linkedIssuePolicy is preferred, not required for this repo).

Scope

The PR title follows type(scope): short summary Conventional Commit format.
This PR is focused and does not mix unrelated backend, UI, MCP, docs, dependency, and deploy changes.
This follows CONTRIBUTING.md and does not reintroduce GitHub Pages, VitePress, site/, or CNAME.
I linked an issue, or this is small enough that the summary explains why an issue is not needed.

Validation

Safety

No secrets, wallet details, hotkeys, coldkeys, user PATs, private keys, raw trust scores, private rankings, or private maintainer evidence are exposed.
Public GitHub text stays sanitized, low-noise, and does not imply compensation guarantees or optimization tactics.
Auth, cookie, CORS, GitHub App, Cloudflare, or session changes include negative-path tests. (N/A — no auth/session/CORS surface touched.)
API/OpenAPI/MCP behavior is updated and tested where needed. (N/A — no API/OpenAPI/MCP surface touched; confirmed via ui:openapi:check.)
UI changes use live API data or real empty/error/loading states, not production mock/demo fallbacks. (N/A — no apps/gittensory-ui changes.)
Visible UI changes include a UI Evidence section. (N/A — this PR only touches backend/observability code, Grafana JSON, and a Prometheus rules file; no apps/gittensory-ui UI surface.)
Public docs/changelogs are updated where needed; changelogs are only edited for release-prep PRs. (No changelog touched.)

Notes

New test coverage: (a) a single registration failure while orb_relay_drained keeps firing stays a warn, not an error; (b) ORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAK consecutive failures escalates to error regardless of drain freshness; (c) the drain loop going stale past ORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MS escalates to error even below the streak threshold; (d) a successful registration after prior failures resets the streak to 0 and emits the new recovered metric alongside the existing recovery log.
test/unit/selfhost-grafana-dashboard.test.ts updated to assert the new panel/gauges/alert rule exist.

…not one hiccup registerOrbRelayWithMonitor escalated a pull-mode registration failure to warn only, and unconditionally at that -- there was no way to tell a lone transient broker timeout apart from a sustained outage. Add a consecutive-failure streak to OrbRelayRegistrationState alongside the existing lifetime attempts counter, and only alert (error, not warn) once ORB_RELAY_REGISTER_UNHEALTHY_FAILURE_STREAK consecutive failures have occurred, or the pull-mode drain loop hasn't made progress in ORB_RELAY_DRAIN_NO_PROGRESS_WINDOW_MS -- whichever the caller can confirm. A single hiccup while orb_relay_drained keeps firing now stays a warning. Also record a recovered counter alongside the existing recovery log, add Grafana panels + a Prometheus alert for the new streak/no-progress gauges, and regenerate the stale self-host env-var reference the reordered server.ts wiring shifted.

superagent-security · 2026-07-05T01:41:55Z

Superagent didn't find any vulnerabilities or security issues in this PR.

cloudflare-workers-and-pages · 2026-07-05T01:43:05Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	gittensory-ui	`f2e0d2a`	Commit Preview URL Branch Preview URL	Jul 05 2026, 01:43 AM

gittensory-orb · 2026-07-05T01:44:52Z

Warning

🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨

⏸️ Gittensory review result - manual review recommended

_{Review updated: 2026-07-05 03:05:36 UTC}

10 files · 1 AI reviewer · no blockers · readiness 93/100 · CI green · clean

⏸️ Suggested Action - Manual Review

Touches a guarded path — held for manual review

Review summary
The change correctly moves pull-mode registration failures behind a streak/drain-progress gate, and the production path updates the streak in `registerOrbRelayTargetWithRetry` before `registerOrbRelayWithMonitor` renders the log severity. The new drain timestamp is stamped only after the broker drain returns, so it does not fabricate progress on broker failure, and the server wiring shares the same state object between registration and drain. I do not see a reachable correctness blocker in the provided diff.

Nits — 6 non-blocking

nit: `prometheus/rules/alerts.yml:396` names this `GittensoryOrbRelayRegistrationStuck`, but the second arm can fire on drain-loop staleness alone, so the alert name/runbook should either say relay drain/registration health or require a registration-failure signal in that arm.
nit: `src/server.ts:988` registers `gittensory_orb_relay_drain_seconds_since_last` in push mode too and reports `-1`; that is documented in metric metadata, but the Grafana panel uses `or vector(0)`, so a missing series and an intentional never-drained value read differently across views.
In `prometheus/rules/alerts.yml:396`, either rename the alert to cover both registration streak and drain staleness, or change the drain-stale arm to include recent registration failures if the intended invariant is exactly the app-side registration alert gate.
In `grafana/dashboards/gittensory.json`, consider preserving `-1` rather than using `or vector(0)` for the drain-staleness panel so operators can distinguish never/app-not-scraped from freshly drained.
In `test/unit/selfhost-grafana-dashboard.test.ts`, add a small assertion for the alert name/summary once you settle the registration-vs-drain wording so this observability contract does not drift again.
Touches a guarded path — held for manual review — A maintainer must review and merge this change.

Signal	Result	Evidence
Code review	✅ No blockers	1 reviewer
Linked issue	⚠️ Missing	No linked issue or no-issue rationale found.
Related work	✅ No active overlap found	No same-issue or scoped active PR overlap found.
Change scope	✅ 20/20	Low review scope from cached public metadata (no linked issue context).
Validation posture	✅ 25/25	PR body includes validation/test evidence.
Contributor workload	✅ 10/10	Author activity: 56 registered-repo PR(s), 46 merged, 423 issue(s).
Contributor context	✅ Confirmed Gittensor contributor	JSONbored; Gittensor profile; 56 PR(s), 423 issue(s).
Gate result	⚠️ Not blocking	Advisory; not blocking this PR.

Review context

Author: JSONbored
Role context: owner (maintainer lane)
Public audience mode: oss maintainer
Lane context: Repository registration is not available in the local Gittensory cache.
Public profile languages: Python, TypeScript, JavaScript, Ruby, Go, Kotlin, MDX, Shell
Official Gittensor activity: 56 PR(s), 423 issue(s).
PR-specific overlap: none found.

Contributor next steps

Treat this as maintainer-lane context rather than normal contributor-lane activity.
Explain no-issue PR.
No action.
Link the issue being solved, or explicitly explain why this is a no-issue PR.

Signal definitions

Related work = same linked issue, overlapping active PRs, or title/path similarity.
Change scope = cached public metadata such as size labels, draft state, and review-burden hints.
Validation posture = whether the PR provides enough public validation/test evidence for maintainer review.
Contributor workload = public contributor activity and cleanup pressure, not a repo-wide quality failure.
Contributor context = public GitHub/Gittensor identity context; non-Gittensor status is not a blocker.

_{🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed}

💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →.

Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.

Re-run Gittensory review

codecov · 2026-07-05T01:47:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.74%. Comparing base (1e9284b) to head (f2e0d2a).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3313   +/-   ##
=======================================
  Coverage   93.73%   93.74%           
=======================================
  Files         276      276           
  Lines       30381    30394   +13     
  Branches    11073    11080    +7     
=======================================
+ Hits        28479    28492   +13     
  Misses       1257     1257           
  Partials      645      645

Files with missing lines	Coverage Δ
src/orb/broker-client.ts	`99.10% <100.00%> (+0.02%)`	⬆️
src/selfhost/metrics.ts	`100.00% <ø> (ø)`
src/selfhost/monitored-work.ts	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions Bot deployed to preview/pr-3313 July 5, 2026 01:42 View deployment

gittensory-orb Bot added the gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier. label Jul 5, 2026

gittensory-orb Bot assigned JSONbored Jul 5, 2026

gittensory-orb Bot added the manual-review Gittensor contributor context label Jul 5, 2026

jimcody1995 mentioned this pull request Jul 5, 2026

fix(review): rank long-form doc extensions in diffFilePriority #3335

Merged

24 tasks

JSONbored added this to Self-Hosted Review Stack (Gittensory Orb) Jul 5, 2026

github-project-automation Bot moved this to Todo in Self-Hosted Review Stack (Gittensory Orb) Jul 5, 2026

JSONbored added this to the Maintainer auto-maintain & convergence (finalize) milestone Jul 5, 2026

JSONbored merged commit 57fe46d into main Jul 5, 2026
13 checks passed

JSONbored deleted the fix-orb-relay-registration-streak branch July 5, 2026 04:12

github-project-automation Bot moved this from Todo to Done in Self-Hosted Review Stack (Gittensory Orb) Jul 5, 2026

jimcody1995 mentioned this pull request Jul 5, 2026

fix(rag): index TypeScript .mts and .cts module sources #3341

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(selfhost): treat a lone orb relay registration timeout as degraded telemetry, not an error#3313

fix(selfhost): treat a lone orb relay registration timeout as degraded telemetry, not an error#3313
JSONbored merged 1 commit into
mainfrom
fix-orb-relay-registration-streak

JSONbored commented Jul 5, 2026

Uh oh!

superagent-security Bot commented Jul 5, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 5, 2026

Uh oh!

gittensory-orb Bot commented Jul 5, 2026 •

edited

Loading

⏸️ Gittensory review result - manual review recommended

Uh oh!

codecov Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JSONbored commented Jul 5, 2026

Summary

Scope

Validation

Safety

Notes

Uh oh!

superagent-security Bot commented Jul 5, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jul 5, 2026

Deploying with Cloudflare Workers

Uh oh!

gittensory-orb Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⏸️ Gittensory review result - manual review recommended

Uh oh!

codecov Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gittensory-orb Bot commented Jul 5, 2026 •

edited

Loading

codecov Bot commented Jul 5, 2026 •

edited

Loading