fix(queue): retry PR public-surface publish on a transient GitHub failure#3440
Conversation
A rate-limit blip, GitHub 5xx, or momentary token issue during the comment/check-run/label publish attempts was swallowed and only audited, so the job still completed "successfully" even though the review never reached the PR, with nothing left to retry it. Classify each publish failure as transient or permanent at catch time (errorMessage() already discards the status code needed to reclassify later), and when nothing published at all and at least one failure was transient, throw a RetryableJobError so the queue retries the whole job. A permanent 4xx keeps today's swallow-and-audit behavior.
|
Superagent didn't find any vulnerabilities or security issues in this PR. |
|
Tip 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 ✅ Gittensory review result - approve/merge recommendedReview updated: 2026-07-05 07:14:45 UTC
✅ Suggested Action - Approve/Merge
Review summary Nits — 4 non-blocking
Review context
Contributor next steps
Signal definitions
🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed 💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →. Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3440 +/- ##
=======================================
Coverage 93.47% 93.47%
=======================================
Files 292 292
Lines 30797 30804 +7
Branches 11225 11227 +2
=======================================
+ Hits 28786 28793 +7
Misses 1355 1355
Partials 656 656
🚀 New features to boost your workflow:
|
Summary
Error: PR public-surface publish failed — review produced output but nothing was posted to the PR, culpritfinishPublicSurfacePublication, most recently onJSONbored/awesome-claudePR #4251 withfailedOutputs: ["comment"].src/queue/processors.ts, the three PR public-surface publish attempts (check_run,comment,label, all insidemaybePublishPrPublicSurface) swallow every failure intofailedOutputsand only rethrow whenisGitHubRateLimitedError(error)is true. A GitHub 5xx or momentary token issue isn't rate-limit-shaped, so it fell through to swallow-and-audit:finishPublicSurfacePublicationrecorded the audit event and escalated to Sentry, but the job still completed "successfully" from the queue's point of view — a review that computed real output silently never reached the PR, with no retry.isGitHubRateLimitedErroror an HTTP 5xx viagithubErrorStatus) at catch time —errorMessage()already reduces the error to a plain string beforefinishPublicSurfacePublicationruns, discarding the status/response shape needed to reclassify later. WhenpublishedOutputsis empty (nothing reached the PR) and at least one failure was transient, throw a newRetryablePublicSurfacePublishFailedError(sameRetryableJobErrorshape/placement as the existingRetryablePullRequestFreshnessUnavailableError/PrActuationLockContendedError) so the queue retries the whole job. A permanent 4xx-only failure keeps today's exact swallow-and-audit behavior — retrying forever would never converge. All three publish call sites already hadif (isGitHubRateLimitedError(error) || isRetryableJobError(error)) throw error;guards, so the new error propagates with zero call-site changes needed beyond capturingtransient.Scope
type(scope): short summaryConventional Commit format, for examplefix(api): restore profile access checks.CONTRIBUTING.mdand does not reintroduce GitHub Pages, VitePress,site/, orCNAME.Validation
git diff --checknpm run typecheck(clean)npx vitest run test/unit/queue.test.ts— 488/488 passingnpm run test:workers/npm run build:mcp/npm run test:mcp-pack/npm run ui:openapi:check/npm run ui:build— not run individually this PR; no worker/MCP/OpenAPI/UI surface touched.processJobrejecting withretryKind: "public_surface_publish_transient", the audit still recorded, the webhook row left in"error"status so the queue can reprocess it); a fully successful publish is completely unaffected. Two pre-existing tests used a 503 fixture that (before this fix) was silently swallowed — since 503 is now correctly transient, I updated those fixtures to a genuinely-permanent 403 ("Resource not accessible by integration") so they keep testing what they always intended (permanent-failure aggregate audit, and label-only duplicate-comment suppression) rather than accidentally coupling to the new retry path; one other pre-existing test's assertion was updated (not weakened) to reflect the now-correct retry-on-5xx behavior it was actually exercising.Safety
UI Evidencesection below. — N/A, no visible UI change.Notes
safeCodeSpanTypeError, codex hang-detection, and Sentry release-validation strict-mode fixes.