Skip to content

fix(selfhost): stop logging routine maintenance-admission backpressure as a warning#3310

Merged
JSONbored merged 1 commit into
mainfrom
fix-maintenance-admission-deferred-log-level
Jul 5, 2026
Merged

fix(selfhost): stop logging routine maintenance-admission backpressure as a warning#3310
JSONbored merged 1 commit into
mainfrom
fix-maintenance-admission-deferred-log-level

Conversation

@JSONbored

Copy link
Copy Markdown
Owner

Summary

  • selfhost_queue_maintenance_admission_deferred was logged at console.warn / level: "warn" in both the SQLite (src/selfhost/sqlite-queue.ts) and Postgres (src/selfhost/pg-queue.ts) queue drivers, even though it fires on ordinary, expected backpressure -- a maintenance-lane job getting pushed back because the live/maintenance/backlog lanes or host load are busy. That's routine steady-state behavior, not something worth flagging at warn severity next to real failures, so operators grepping warn-level logs see noise from a well-functioning system. This PR downgrades both call sites to console.log / level: "info", matching how other routine/expected events in this codebase are leveled (e.g. check_run_cross_app_repost in src/github/app.ts, and the un-leveled console.log backfill/recovery events in sqlite-queue.ts). The metric-recording lines directly above each log call (gittensory_jobs_maintenance_admission_deferred_total and gittensory_jobs_maintenance_admission_deferred_by_reason_total) are untouched -- no new metric was needed, both already exist and are already asserted in the unit suites.
  • Confirmed clarification, not a dashboard bug fix: the "Errors & failures" Loki panel (grafana/dashboards/gittensory.json) filters on a populated top-level JSON field literally named error (| json eventf="event", errf="error" | errf != ""). This log line never sets an error field -- only event/jobType/reason/retry_after_ms -- so it never actually matched that panel, even while it sat at warn. That panel was not polluted and is not being changed for that reason; the log-level change is about the log's own severity being wrong for raw log inspection (journalctl/docker logs/any warn-level grep), independent of that panel.
  • Added a "Maintenance Admission Deferrals (total)" timeseries panel in the existing "Runtime Pressure & Maintenance" dashboard section, right next to "Maintenance Admission Deferrals by Reason", so the aggregate gittensory_jobs_maintenance_admission_deferred_total rate is visible alongside the by-reason breakdown that already existed. No new metric was added -- both panels chart counters the queue drivers already record.

Scope

  • The PR title follows type(scope): short summary Conventional Commit format, for example fix(api): restore profile access checks.
  • This PR is focused and does not mix unrelated backend, UI, MCP, docs, dependency, and deploy changes.
  • This follows CONTRIBUTING.md and does not reintroduce GitHub Pages, VitePress, site/, or CNAME.
  • I linked an issue, or this is small enough that the summary explains why an issue is not needed. (No issue linked -- this is a small, self-evident log-level fix plus a matching dashboard panel, in line with this repo's preferred linked-issue policy.)

Validation

  • git diff --check
  • npm run actionlint
  • npm run typecheck
  • npm run test:coverage locally; codecov/patch requires ≥99% coverage of the lines AND branches you changed (aim for 100% on your diff so CI variance does not fail near the threshold). Global coverage is a non-blocking trend with a loose 90% backstop, not the gate.
  • npm run test:workers
  • npm run build:mcp
  • npm run test:mcp-pack
  • npm run ui:openapi:check
  • npm run ui:lint
  • npm run ui:typecheck
  • npm run ui:build
  • npm audit --audit-level=moderate
  • New or changed behavior has unit/integration tests for new branches, fallback paths, and sanitizer boundaries

Ran the full local gate via npm run test:ci (aggregates all of the above plus db:migrations:check, test:coverage unsharded, ui:test, ui:openapi, and more) -- fully green, plus npm audit --audit-level=moderate clean. Also confirmed via npm run ui:openapi:check and npm run cf-typegen:check that neither generated artifact needed regeneration (no API/schema or wrangler binding changes here), and npm run db:migrations:check confirms no DB migration was needed.

Added a targeted regression test in each of test/unit/selfhost-sqlite-queue.test.ts and test/unit/selfhost-pg-queue.test.ts that spies on console.log/console.warn and asserts the deferred-admission event is emitted at level: "info" via console.log and never via console.warn. Added a test in test/unit/selfhost-grafana-dashboard.test.ts asserting both the by-reason panel and the new total-deferrals panel are present with their expected PromQL expressions.

Safety

  • No secrets, wallet details, hotkeys, coldkeys, user PATs, private keys, raw trust scores, private rankings, or private maintainer evidence are exposed.
  • Public GitHub text stays sanitized, low-noise, and does not imply compensation guarantees or optimization tactics.
  • Auth, cookie, CORS, GitHub App, Cloudflare, or session changes include negative-path tests. (N/A -- no auth/cookie/CORS/session surface touched.)
  • API/OpenAPI/MCP behavior is updated and tested where needed. (N/A -- no API/OpenAPI/MCP surface touched.)
  • UI changes use live API data or real empty/error/loading states, not production mock/demo fallbacks. (N/A -- no apps/gittensory-ui change; this is a Grafana dashboard JSON + backend log level.)
  • Visible UI changes include a UI Evidence section below with screenshots. (N/A -- this changes a Grafana dashboard definition file and a backend log statement, not the product UI; no screenshot evidence applies.)
  • Public docs/changelogs are updated where needed; changelogs are only edited for release-prep PRs. (No changelog edit in this PR.)

Notes

  • Both driver implementations (sqlite-queue.ts and pg-queue.ts) are parallel queue backends implementing the same admission logic, so the log-level change was made identically in both, mirroring the existing pattern where both files carry duplicated logging/metric blocks for the same events.

…e as a warning

The maintenance-admission-deferred event fires whenever the queue defers a
background job under normal pressure -- expected steady-state behavior, not
an operational problem, so it should not sit at warn level next to real
failures. Downgrade both the SQLite and Postgres queue drivers to log it at
info, matching how other routine/expected events are leveled elsewhere.

Also adds a "Maintenance Admission Deferrals (total)" panel next to the
existing by-reason breakdown in the Runtime Pressure & Maintenance section,
using the two counters already recorded alongside this log line.
@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

@gittensory-orb gittensory-orb Bot added the gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier. label Jul 5, 2026
@gittensory-orb

gittensory-orb Bot commented Jul 5, 2026

Copy link
Copy Markdown

Warning

🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨

⏸️ Gittensory review result - manual review recommended

Review updated: 2026-07-05 02:49:19 UTC

6 files · 1 AI reviewer · no blockers · readiness 100/100 · CI green · clean

⏸️ Suggested Action - Manual Review

  • Touches a guarded path — held for manual review

Review summary
This change correctly downgrades routine maintenance-admission deferral logs from warn to info in both self-host queue drivers while leaving the existing metrics intact. The new Grafana panel uses the already-emitted aggregate counter and sits next to the by-reason breakdown, so the observability change matches the logging severity change. The added tests exercise the real deferred-admission path for both SQLite and Postgres rather than fabricating the log payload.

Nits — 3 non-blocking
  • nit: test/unit/selfhost-pg-queue.test.ts:3225 and test/unit/selfhost-sqlite-queue.test.ts:3342 parse `logged.mock.calls.at(-1)` even though the assertion above only proves some call contained the event, so a future unrelated trailing `console.log` would make these tests fail for the wrong reason.
  • test/unit/selfhost-pg-queue.test.ts:3225 and test/unit/selfhost-sqlite-queue.test.ts:3342 should find the matching log call by event before parsing, e.g. `const payload = JSON.parse(logged.mock.calls.find(([line]) => String(line).includes('"event":"selfhost_queue_maintenance_admission_deferred"'))?.[0] as string);`.
  • Touches a guarded path — held for manual review — A maintainer must review and merge this change.
Signal Result Evidence
Code review ✅ No blockers 1 reviewer
Linked issue ✅ No-issue rationale PR body explains why no issue is linked.
Related work ✅ No active overlap found No same-issue or scoped active PR overlap found.
Change scope ✅ 20/20 Low review scope from cached public metadata (no linked issue context).
Validation posture ✅ 25/25 PR body includes validation/test evidence.
Contributor workload ✅ 10/10 Author activity: 56 registered-repo PR(s), 46 merged, 423 issue(s).
Contributor context ✅ Confirmed Gittensor contributor JSONbored; Gittensor profile; 56 PR(s), 423 issue(s).
Gate result ⚠️ Not blocking Advisory; not blocking this PR.
Review context
  • Author: JSONbored
  • Role context: owner (maintainer lane)
  • Public audience mode: oss maintainer
  • Lane context: Repository registration is not available in the local Gittensory cache.
  • Public profile languages: not available
  • Official Gittensor activity: 56 PR(s), 423 issue(s).
  • PR-specific overlap: none found.
Contributor next steps
  • Treat this as maintainer-lane context rather than normal contributor-lane activity.
  • No action.
  • Link the issue being solved, or explicitly explain why this is a no-issue PR.
Signal definitions
  • Related work = same linked issue, overlapping active PRs, or title/path similarity.
  • Change scope = cached public metadata such as size labels, draft state, and review-burden hints.
  • Validation posture = whether the PR provides enough public validation/test evidence for maintainer review.
  • Contributor workload = public contributor activity and cleanup pressure, not a repo-wide quality failure.
  • Contributor context = public GitHub/Gittensor identity context; non-Gittensor status is not a blocker.

🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed


💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →.

Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.

  • Re-run Gittensory review

@codecov

codecov Bot commented Jul 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.14%. Comparing base (114921a) to head (a1e7cc8).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3310   +/-   ##
=======================================
  Coverage   94.14%   94.14%           
=======================================
  Files         276      276           
  Lines       30246    30246           
  Branches    11021    11021           
=======================================
  Hits        28474    28474           
  Misses       1127     1127           
  Partials      645      645           
Files with missing lines Coverage Δ
src/selfhost/sqlite-queue.ts 99.60% <100.00%> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gittensory-orb gittensory-orb Bot added the manual-review Gittensor contributor context label Jul 5, 2026
@JSONbored JSONbored merged commit 2cc645e into main Jul 5, 2026
11 checks passed
@JSONbored JSONbored deleted the fix-maintenance-admission-deferred-log-level branch July 5, 2026 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier. manual-review Gittensor contributor context

Development

Successfully merging this pull request may close these issues.

1 participant