Skip to content

fix(queue): retry PR public-surface publish on a transient GitHub failure#3440

Merged
gittensory-orb[bot] merged 1 commit into
mainfrom
fix/pr-publish-retry-on-transient-failure
Jul 5, 2026
Merged

fix(queue): retry PR public-surface publish on a transient GitHub failure#3440
gittensory-orb[bot] merged 1 commit into
mainfrom
fix/pr-publish-retry-on-transient-failure

Conversation

@JSONbored

Copy link
Copy Markdown
Owner

Summary

  • Found via Sentry (issue GITTENSORY-5, 12 occurrences, regressed): Error: PR public-surface publish failed — review produced output but nothing was posted to the PR, culprit finishPublicSurfacePublication, most recently on JSONbored/awesome-claude PR #4251 with failedOutputs: ["comment"].
  • In src/queue/processors.ts, the three PR public-surface publish attempts (check_run, comment, label, all inside maybePublishPrPublicSurface) swallow every failure into failedOutputs and only rethrow when isGitHubRateLimitedError(error) is true. A GitHub 5xx or momentary token issue isn't rate-limit-shaped, so it fell through to swallow-and-audit: finishPublicSurfacePublication recorded the audit event and escalated to Sentry, but the job still completed "successfully" from the queue's point of view — a review that computed real output silently never reached the PR, with no retry.
  • Fix: capture whether each failure is transient (isGitHubRateLimitedError or an HTTP 5xx via githubErrorStatus) at catch timeerrorMessage() already reduces the error to a plain string before finishPublicSurfacePublication runs, discarding the status/response shape needed to reclassify later. When publishedOutputs is empty (nothing reached the PR) and at least one failure was transient, throw a new RetryablePublicSurfacePublishFailedError (same RetryableJobError shape/placement as the existing RetryablePullRequestFreshnessUnavailableError / PrActuationLockContendedError) so the queue retries the whole job. A permanent 4xx-only failure keeps today's exact swallow-and-audit behavior — retrying forever would never converge. All three publish call sites already had if (isGitHubRateLimitedError(error) || isRetryableJobError(error)) throw error; guards, so the new error propagates with zero call-site changes needed beyond capturing transient.
  • No issue filed — found via direct Sentry investigation, not a report.

Scope

  • The PR title follows type(scope): short summary Conventional Commit format, for example fix(api): restore profile access checks.
  • This PR is focused and does not mix unrelated backend, UI, MCP, docs, dependency, and deploy changes.
  • This follows CONTRIBUTING.md and does not reintroduce GitHub Pages, VitePress, site/, or CNAME.
  • I linked an issue, or this is small enough that the summary explains why an issue is not needed.

Validation

  • git diff --check
  • npm run typecheck (clean)
  • npx vitest run test/unit/queue.test.ts — 488/488 passing
  • npm run test:workers / npm run build:mcp / npm run test:mcp-pack / npm run ui:openapi:check / npm run ui:build — not run individually this PR; no worker/MCP/OpenAPI/UI surface touched.
  • New or changed behavior has unit/integration tests for new branches, fallback paths, and sanitizer boundaries — added: a transient GitHub 5xx dropping every output now retries the whole job (asserted via processJob rejecting with retryKind: "public_surface_publish_transient", the audit still recorded, the webhook row left in "error" status so the queue can reprocess it); a fully successful publish is completely unaffected. Two pre-existing tests used a 503 fixture that (before this fix) was silently swallowed — since 503 is now correctly transient, I updated those fixtures to a genuinely-permanent 403 ("Resource not accessible by integration") so they keep testing what they always intended (permanent-failure aggregate audit, and label-only duplicate-comment suppression) rather than accidentally coupling to the new retry path; one other pre-existing test's assertion was updated (not weakened) to reflect the now-correct retry-on-5xx behavior it was actually exercising.

Safety

  • No secrets, wallet details, hotkeys, coldkeys, user PATs, private keys, raw trust scores, private rankings, or private maintainer evidence are exposed.
  • Public GitHub text stays sanitized, low-noise, and does not imply compensation guarantees or optimization tactics.
  • Auth, cookie, CORS, GitHub App, Cloudflare, or session changes include negative-path tests. — N/A, no auth/session/CORS surface changed (installation-token failures are already covered by the transient-vs-permanent test pair above).
  • API/OpenAPI/MCP behavior is updated and tested where needed. — N/A, no API/OpenAPI/MCP surface changed.
  • UI changes use live API data or real empty/error/loading states, not production mock/demo fallbacks. — N/A, no UI change.
  • Visible UI changes include a UI Evidence section below. — N/A, no visible UI change.
  • Public docs/changelogs are updated where needed. — N/A, internal reliability fix, no documented/user-facing behavior change (a contributor now reliably gets their review instead of silence during a transient GitHub outage).

Notes

  • One of five fixes from a live stack-health pass (Sentry + Loki audit on the self-hosted deployment); see the sibling PRs for the RAG chunk-cap indexing priority, REES safeCodeSpan TypeError, codex hang-detection, and Sentry release-validation strict-mode fixes.

A rate-limit blip, GitHub 5xx, or momentary token issue during the
comment/check-run/label publish attempts was swallowed and only
audited, so the job still completed "successfully" even though the
review never reached the PR, with nothing left to retry it.

Classify each publish failure as transient or permanent at catch time
(errorMessage() already discards the status code needed to reclassify
later), and when nothing published at all and at least one failure was
transient, throw a RetryableJobError so the queue retries the whole
job. A permanent 4xx keeps today's swallow-and-audit behavior.
@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

@gittensory-orb gittensory-orb Bot added the gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier. label Jul 5, 2026
@gittensory-orb

gittensory-orb Bot commented Jul 5, 2026

Copy link
Copy Markdown

Tip

🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩

✅ Gittensory review result - approve/merge recommended

Review updated: 2026-07-05 07:14:45 UTC

2 files · 1 AI reviewer · no blockers · readiness 93/100 · CI green · clean

✅ Suggested Action - Approve/Merge

  • safe to merge

Review summary
This change correctly moves transient classification to the catch sites where the original GitHub error object is still available, records that classification in the aggregate audit metadata, and only converts an all-outputs-failed publish into a retryable job when at least one failure is transient. The test coverage exercises the important reachable paths: permanent 403 remains swallowed/audited, 5xx with no published output retries, and successful publication is unaffected. I do not see a correctness blocker in the visible diff.

Nits — 4 non-blocking
  • nit: src/queue/processors.ts:5732 hard-codes the transient HTTP threshold as `500`; a named constant such as `GITHUB_TRANSIENT_STATUS_MIN = 500` would make the retry boundary easier to audit later.
  • nit: src/queue/processors.ts:6949 hard-codes the retry delay as `60_000`; consider naming it alongside the other retryable job timing constants if this file already has that convention.
  • src/queue/processors.ts:5732: extract the HTTP 5xx cutoff into a named constant so future reviewers do not have to infer whether 408/429-style cases are intentionally excluded.
  • src/queue/processors.ts:6952: extract `60_000` to a local retry-delay constant or reuse an existing retry delay constant if one exists near `RetryablePullRequestFreshnessUnavailableError` / `PrActuationLockContendedError`.
Signal Result Evidence
Code review ✅ No blockers 1 reviewer
Linked issue ⚠️ Missing No linked issue or no-issue rationale found.
Related work ✅ No active overlap found No same-issue or scoped active PR overlap found.
Change scope ✅ 20/20 Low review scope from cached public metadata (no linked issue context).
Validation posture ✅ 25/25 PR body includes validation/test evidence.
Contributor workload ✅ 10/10 Author activity: 56 registered-repo PR(s), 46 merged, 416 issue(s).
Contributor context ✅ Confirmed Gittensor contributor JSONbored; Gittensor profile; 56 PR(s), 416 issue(s).
Gate result ✅ Passing No configured blocker found.
Review context
  • Author: JSONbored
  • Role context: owner (maintainer lane)
  • Public audience mode: oss maintainer
  • Lane context: Repository registration is not available in the local Gittensory cache.
  • Public profile languages: Python, TypeScript, JavaScript, Ruby, Go, Kotlin, MDX, Shell
  • Official Gittensor activity: 56 PR(s), 416 issue(s).
  • PR-specific overlap: none found.
Contributor next steps
  • Treat this as maintainer-lane context rather than normal contributor-lane activity.
  • Explain no-issue PR.
  • No action.
  • Link the issue being solved, or explicitly explain why this is a no-issue PR.
Signal definitions
  • Related work = same linked issue, overlapping active PRs, or title/path similarity.
  • Change scope = cached public metadata such as size labels, draft state, and review-burden hints.
  • Validation posture = whether the PR provides enough public validation/test evidence for maintainer review.
  • Contributor workload = public contributor activity and cleanup pressure, not a repo-wide quality failure.
  • Contributor context = public GitHub/Gittensor identity context; non-Gittensor status is not a blocker.

🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed


💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →.

Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.

  • Re-run Gittensory review

@codecov

codecov Bot commented Jul 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.47%. Comparing base (5bf0a77) to head (fc29451).
⚠️ Report is 7 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3440   +/-   ##
=======================================
  Coverage   93.47%   93.47%           
=======================================
  Files         292      292           
  Lines       30797    30804    +7     
  Branches    11225    11227    +2     
=======================================
+ Hits        28786    28793    +7     
  Misses       1355     1355           
  Partials      656      656           
Files with missing lines Coverage Δ
src/queue/processors.ts 93.04% <100.00%> (+0.01%) ⬆️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gittensory-orb gittensory-orb Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gittensory approves — the gate is satisfied and CI is green.

@gittensory-orb gittensory-orb Bot merged commit ab0fdd9 into main Jul 5, 2026
10 checks passed
@gittensory-orb gittensory-orb Bot deleted the fix/pr-publish-retry-on-transient-failure branch July 5, 2026 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant