release: gastown-staging -> main#3151
Conversation
…ayor tools on prewarm Three independent fixes for the startAgentInContainer timeout regression introduced by #2974, plus a tighter container-instance cap. 1. Hydration gate (control-server.ts, process-manager.ts) The control server starts accepting requests immediately at boot, while bootHydration runs concurrently and serialises every registry agent + the mayor prewarm through the global sdkServerLock. Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) fired before they ever got the lock — surfacing as "TimeoutError: aborted due to timeout" and "timeout after 6000ms: ensureSDKServer for <agentId>". A new awaitHydration() promise is awaited at the top of those handlers (before any process.env mutation in the model PATCH path) so they don't compound the queue. 2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts) buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch, evicting prewarmed server" path on every warm restart — doubling lock-holding time on the critical path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id endpoint now returns the full prewarm context (model, smallModel, kilocodeToken, organizationId) resolved the same way _ensureMayor resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT to match. Falls back gracefully to a skip when the worker hasn't deployed the richer endpoint yet. 3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts) prewarmMayorSDK called mayorWorkdirForTown (which only returns a string) and went straight to ensureSDKServer's process.chdir, throwing ENOENT on cold containers because createMayorWorkspace only ran from runAgent. Exported ensureMayorWorkspaceForTown so prewarm materialises the workspace first. More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve plugin (plugin/index.ts) reads at spawn to decide whether to register mayor tools. Without them the prewarmed server booted with NO mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user. Now mirrors the mayor- shaped subset of buildAgentEnv. Added an end-to-end test that intercepts createKilo and asserts the env at spawn time. 4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500. Verified with pnpm --filter gastown-container test (67/67 pass), pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
Code Review SummaryStatus: No Issues Found | Recommendation: Merge ✅ All Previously Flagged Issues Resolved
Files Reviewed (all commits)
Reviewed by claude-4.6-sonnet-20260217 · 464,708 tokens |
* chore(gastown): remove manual request logging middleware * fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm Three independent fixes for the startAgentInContainer timeout regression introduced by #2974, plus a tighter container-instance cap. 1. Hydration gate (control-server.ts, process-manager.ts) The control server starts accepting requests immediately at boot, while bootHydration runs concurrently and serialises every registry agent + the mayor prewarm through the global sdkServerLock. Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) fired before they ever got the lock — surfacing as "TimeoutError: aborted due to timeout" and "timeout after 6000ms: ensureSDKServer for <agentId>". A new awaitHydration() promise is awaited at the top of those handlers (before any process.env mutation in the model PATCH path) so they don't compound the queue. 2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts) buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch, evicting prewarmed server" path on every warm restart — doubling lock-holding time on the critical path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id endpoint now returns the full prewarm context (model, smallModel, kilocodeToken, organizationId) resolved the same way _ensureMayor resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT to match. Falls back gracefully to a skip when the worker hasn't deployed the richer endpoint yet. 3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts) prewarmMayorSDK called mayorWorkdirForTown (which only returns a string) and went straight to ensureSDKServer's process.chdir, throwing ENOENT on cold containers because createMayorWorkspace only ran from runAgent. Exported ensureMayorWorkspaceForTown so prewarm materialises the workspace first. More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve plugin (plugin/index.ts) reads at spawn to decide whether to register mayor tools. Without them the prewarmed server booted with NO mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user. Now mirrors the mayor- shaped subset of buildAgentEnv. Added an end-to-end test that intercepts createKilo and asserts the env at spawn time. 4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500. Verified with pnpm --filter gastown-container test (67/67 pass), pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format. * feat(gastown): per-route logger tagging via Hono params (review on #3158) --------- Co-authored-by: John Fawcett <john@kilcoode.ai>
…st procedure Adds three dev-only debug endpoints for autonomous convoy testing without going through the mayor LLM: - GET /debug/towns/:townId/rigs — list rigs in a town - POST /debug/towns/:townId/sling-convoy — call Town.slingConvoy() directly - GET /debug/towns/:townId/convoys — list active convoys with progress Documents the new endpoints and adds a Test C section to e2e-pr-feedback-testing.md with a deterministic procedure for verifying review-then-land convoys end-to-end (sub-bead PRs into the convoy feature branch, then a landing PR into main). Also captures known issues observed during verification: container MTU/TLS handshake failures with github.com, 'failed' blockers not gating dependents, and intermittent polecat skipping of sub-PR creation.
… stale stored value resolveGitHubToken previously preferred git_auth.github_token over the platform integration. Since GitHub App installation tokens have a 1h TTL but git_auth.github_token is only written at rig creation (or rare manual refresh), every long-lived town with an integration was handing out an expired token to: - Polecat/refinery 'gh' CLI (via GH_TOKEN derived from GIT_TOKEN in the container), surfacing as 'Failed to log in to github.com using token (GH_TOKEN). The token in GH_TOKEN is invalid.' - The worker-side PR poller (checkPRStatus, checkPRFeedback, mergePR, areThreadsBlocking) — 401 from api.github.com. - The /refresh-git-token endpoint the container falls back to on auth failure — it returned the same expired token, so the retry just re-failed. Verified by hitting api.github.com with a local town's stored token: 401 even though the integration service mints fresh ones fine. Fix: - Flip resolveGitHubToken's priority to github_cli_pat -> live integration -> stored github_token (last-resort fallback for towns with no integration). Empty-string responses from the integration service now warn and fall back instead of silently failing. - Resolve a fresh token at agent dispatch (startAgentInContainer), merge dispatch (startMergeInContainer), and rig setup (setupRigRepoInContainer) before stuffing GIT_TOKEN into envVars. - buildContainerConfig now resolves a fresh token before serializing git_auth.github_token into the X-Town-Config header — the container's syncTownConfigToProcessEnv path reads this on every request to update process.env.GIT_TOKEN, which buildLiveHotSwapEnv then derives GH_TOKEN from on token-refresh hot-swaps. townId is required (not optional) so a forgotten arg can't silently regress to the stale-token shape. - syncConfigToContainer resolves a fresh token before persisting GIT_TOKEN to DO storage for next boot. Adds 6 unit tests covering the priority chain (cli_pat preferred, fresh integration over stale stored, fallback on lookup failure, rig-level integration ID, no-config returns null).
…3160) * fix(gastown): distinguish null causes in PR status polling (#3149) Replace PRStatusResult | null return type with discriminated PRStatusOutcome union in checkPRStatus. Each null cause (no token, HTTP error, invalid response, unrecognized URL, host mismatch) now surfaces a structured PRStatusError with actionable failure messages. - resolveGitHubToken returns GitHubTokenResolution with resolution chain - no_token and non-transient HTTP errors (401/403/404) fail immediately - invalid_response/unrecognized_url/host_mismatch fail after 3 strikes - Transient HTTP errors (5xx/429) keep existing 10-strike behavior - poll_null_count resets to 0 on successful poll at both call sites - failureKind persisted to bead metadata for analytics - AE event pr.poll_failed emitted on terminal failure - Unit tests for checkPRStatus, resolveGitHubToken, failureMessageFor, and threshold logic - Integration test for no_token immediate-fail path * style: apply oxfmt formatting * fix(gastown): track integration source when GIT_TOKEN_SERVICE unbound (review on town-scm.ts:66) When integrationId is set but GIT_TOKEN_SERVICE binding is missing, the configured integration source was silently omitted from the tried array. Add an else branch that pushes the source label with a '(GIT_TOKEN_SERVICE not bound)' annotation so the no_token error message lists all attempted sources. * fix(gastown): fail immediately for unrecognized_url and host_mismatch (review on actions.ts:374) Both are deterministic configuration errors that cannot self-resolve on retry. Move them from the 3-strike bucket to the fail-immediately bucket alongside no_token and non-transient http_error. Only invalid_response remains in the 3-strike category. * fix(gastown): use separate counters for transient vs non-transient poll errors (review on actions.ts:1350) Replace the shared poll_null_count with poll_transient_count and poll_non_transient_count. Each error category increments only its own counter and resets the other, preventing cross-contamination where 9 transient errors followed by 1 non-transient error would incorrectly fail the bead. Legacy poll_null_count is migrated on first read: the transient branch falls back to poll_null_count when poll_transient_count is absent. This ensures in-flight beads at deploy time retain their existing counter value. The non-transient branch does not read the legacy field since these counters reset on every success anyway — at worst an in-flight bead gets one extra retry for invalid_response. * fix(gastown): resolve merge conflict in resolveGitHubToken - merge staging priority with PR #3160 structured return type - resolveGitHubToken now uses staging's priority: cli_pat → integration → stored token - Returns GitHubTokenResolution discriminated union (from PR #3160) - Includes unbound-service else branch (GIT_TOKEN_SERVICE not bound) - Adds resolveGitHubTokenString helper for non-error-aware callers - Updates Town.do.ts, container-dispatch.ts, config.ts to use helper - Updates town-scm.test.ts for GitHubTokenResolution return shape - Updates pr-poll-errors.test.ts for new priority order --------- Co-authored-by: John Fawcett <john@kilcoode.ai>
… awaitHydration in /refresh-token The /refresh-token handler assigned process.env.GASTOWN_CONTAINER_TOKEN before awaiting hydration, inconsistent with PATCH /agents/:id/model which gates first. Mid-hydration token refresh could cause buildPrewarmEnv to pick up a different token than the one hydration captured locally.
…stead of module global The _resolveHydration module-global stale-capture pattern would orphan the first promise's resolver if bootHydration() were ever called concurrently. Capturing resolve as a local inside bootHydration() itself eliminates the risk and removes the module-global.
… getMayorPrewarmContext
getMayorPrewarmContext now returns { agentId } even when the kilocode
token is unavailable (instead of null), so the worker route no longer
needs to fall through to getMayorAgentId. This eliminates the redundant
agents.listAgents SQL query over a second RPC hop.
…igId routes The per-route tagging middleware registered prefixes under /api/orgs/:orgId/... but missed the parallel /api/users/:userId/rigs/:rigId family. Without this, requests to those routes lack rigId in structured log tags.
Review observation dispositionsObservation A — "Request/response logging removed without replacement"Intentional — PR #3158 deletes those manual log lines because Observation C — "Double
|
Three independent gastown reliability fixes that came out of debugging a
production town stuck in a dispatch loop with the mayor reporting
"GH_TOKEN is NOT set" while the bead it was working on cycled
in_progress -> open every few minutes:
1. dispatchAgent's catch path no longer transitions the agent to
'idle'. The exception path can fire after the container has already
accepted /agents/start (e.g. /refresh-token raced a token rotation
and threw late), in which case the SDK session is alive and
heartbeating. Marking the agent idle was tripping reconcileAgents'
"idle agent hooked to live bead" rule, which tore the hook out
from under the running session and made every subsequent
gt_request_changes / gt_triage_resolve call fail with
"is not hooked to a bead" until the session exited. Now we leave
the agent as 'working' and let the heartbeat-staleness check do
the cleanup if the agent really did die. Also surface
getLastStartError() in the dispatch_failed analytics event so
future debugging doesn't have to guess at the root cause.
2. reconcileAgents' "idle agent hooked to live bead" rule now skips
agents whose last_activity_at is fresh (<90s). That 90s window
matches reconcileAgents' own stale-heartbeat threshold, so a truly
dead agent still gets reaped on a later tick — but a live SDK
session caught mid-action by a phantom dispatch_failed keeps its
hook.
3. review-queue's unhooked-refinery gt_done fallback no longer
silently marks the MR as 'merged' when there's no pr_url. The
same hook-leak race that fires while the refinery is mid-review
would silently land a PR the refinery was actively trying to
reject (after gt_request_changes hit "not hooked"). The fallback
now fails the review (returning the source bead to 'open') and
raises an escalation so a human can decide whether the refinery
meant to approve or request changes.
Also fixes the "mayor has no GH_TOKEN" symptom by plumbing GitHub
auth into the prewarm path:
- getMayorPrewarmContext now resolves the GitHub token via the
standard chain (github_cli_pat / platform integration / stored
fallback) and returns it alongside the kilocode token.
- buildPrewarmEnv (container) sets GH_TOKEN, GIT_TOKEN, GITHUB_TOKEN,
and GITHUB_CLI_PAT on the prewarmed SDK's process.env so the
mayor's bash tool sees credentials from boot. Previously these
were only set by buildAgentEnv on /agents/start, which never runs
while ensureMayor short-circuits on a warm session.
- PERSIST_ENV_KEYS now includes GH_TOKEN/GIT_TOKEN/GITHUB_TOKEN/
GITHUB_CLI_PAT so a token rotation arriving via /agents/start
cache-hit reaches process.env.
Includes prior staged work: hasActiveWork refactor (short-circuit
SQL reads via thunks) and healthCheck alarm-cadence selection so
idle towns don't all wake up on the 5s cadence after a deploy.
Verification: pnpm --filter cloudflare-gastown {typecheck,lint,test}
all pass (251 tests). Behavior verified by code inspection against
the production AE event timeline for town
0532b9ef-04c8-488a-ac34-6eb1a3d47228 — every dispatch_failed in
that town hit the catch path, every cycling unhook hit the idle+
hooked+live-bead rule, and every "GH_TOKEN is NOT set" mayor
session was a prewarm cache-hit.
jeanduplessis
left a comment
There was a problem hiding this comment.
Found 4 issues in the final combined diff:
- 1 critical auth/env leakage risk in container SDK env persistence.
- 2 warnings around hydration gating and GitLab-specific PR polling errors.
- 1 test-quality suggestion where counter behavior is documented but not executed.
I verified the current unresolved review-thread state (none unresolved) and checked the previously resolved/outdated threads against the final diff.
Summary
Promotes 13 commits from
gastown-stagingtomain. Six independent fix/feature groups plus follow-on hardening:/agents/startduring container boot hydration and preserves mayor tools on prewarm.GH_TOKEN— fixes a cascade where phantom dispatch failures tore live SDK hooks out from under running agents (causinggt_request_changes500s, silent merges, and triage role-mismatch escalations) and the mayor's prewarmed SDK shipped withoutGH_TOKEN./refresh-token, bootHydration resolve scoping, mayor-id RPC consolidation, request-tag middleware coverage for nested:rigIdroutes, sample-rate tuning, and formatting.TownContainerDO.max_instanceswas lowered from 800 → 500 as part of commit 1.Constituent commits
1. Boot hydration + mayor prewarm fix (
2ffcef28f, direct push)Three independent fixes for the
startAgentInContainertimeout regression observed after #2974, plus a tighter container-instance cap.Symptoms. Production logs were filling with two error patterns since the last
gastown-staging→mainpromotion:Root cause. The control server starts accepting requests immediately at boot (
main.ts:83), whilebootHydration()runs concurrently and serialises every registry agent + the new mayor prewarm through the globalsdkServerLock(createKilo readsprocess.cwd()/process.env). Fresh/agents/start,/refresh-token, and PATCH/agents/:id/modelrequests queued behind that work and the DO-sideAbortSignal.timeout(60s)(resp.REFRESH_AGENT_TIMEOUT_MS=6_000) fired before they ever got the lock.The mayor prewarm added in #3122 made things worse on two axes:
KILO_CONFIG_CONTENTfrom hardcoded model defaults, so the real/agents/startwith the user's actual model triggeredensureSDKServer's "config mismatch — evicting prewarmed server" path on every warm restart, doubling lock-holding time on the critical path the prewarm was supposed to speed up.GASTOWN_AGENT_ROLE,GASTOWN_AGENT_ID, andGASTOWN_TOWN_IDfrom the prewarm env.kilo servesnapshotsprocess.envat spawn, andplugin/index.ts:66keys mayor-tool registration offGASTOWN_AGENT_ROLE === 'mayor'. Without those, the prewarmed server booted with no mayor tools, and the cache hit on the next/agents/starthanded that defective instance back to the user — manifesting as "mayor tools became unavailable."Changes
1. Hydration gate (
control-server.ts,process-manager.ts)New
awaitHydration()exported fromprocess-manager.ts: a promise thatbootHydrationreplaces on entry and resolves in afinally. Awaited at the top of/agents/start,/refresh-token, and PATCH/agents/:id/model(before anyprocess.envmutation in the model PATCH path so concurrent requests can't race on env writes before holding the SDK lock). Default-resolved at module init so test/dev contexts that never run hydration aren't blocked.2. Prewarm config matches
/agents/start(Town.do.ts,gastown.worker.ts,process-manager.ts)New
getMayorPrewarmContext()onTownDOreturns{ agentId, model, smallModel, kilocodeToken, organizationId }resolved the same way_ensureMayorresolves them (config.resolveModel(townConfig, null, 'mayor')). The/api/towns/:townId/mayor-idendpoint now returns that whole context so the container builds aKILO_CONFIG_CONTENTbyte-identical to what the next/agents/startwill send. Falls back to the bare{ agentId }shape for back-compat; the container skips prewarm when model/token aren't available rather than building a config that's guaranteed to mismatch.3. Mayor workdir + plugin env (
agent-runner.ts,process-manager.ts)ensureMayorWorkspaceForTown(townId)soprewarmMayorSDKmaterialises the workspace beforeensureSDKServer'sprocess.chdir(was throwingENOENTon cold containers).buildPrewarmEnvnow mirrors the mayor-shaped subset ofbuildAgentEnv:GASTOWN_AGENT_ID,GASTOWN_AGENT_ROLE='mayor',GASTOWN_TOWN_ID,KILOCODE_FEATURE='gastown',KILO_TEST_HOME,XDG_DATA_HOME. New end-to-end test interceptscreateKiloand asserts those keys are visible to the spawn.4.
wrangler.jsoncLowered
TownContainerDO.max_instancesfrom 800 → 500 (manual change).2. Remove manual request logging middleware (#3158,
a6cf1029b)Removes the redundant request-logging middleware in
gastown.worker.tsthat logged every request twice (-->/<--vialogger.info) — already covered by the per-routeinstrumented(c, route, handler)AE event wrapper. Replaces the regex-basedlogger.setTagsblock with proper per-route tagging using Honoc.req.param()matching for:orgId/:townId/:rigId/:agentIdprefixes. Net diff: ~30 deletions + ~25 additions.Link: #3158
3. Convoy debug endpoints + E2E test procedure (
7f9121ffa, direct push)Adds three dev-only debug endpoints for autonomous convoy testing without going through the mayor LLM:
GET /debug/towns/:townId/rigs— list rigs in a townPOST /debug/towns/:townId/sling-convoy— callTown.slingConvoy()directlyGET /debug/towns/:townId/convoys— list active convoys with progressDocuments the new endpoints and adds a Test C section to
services/gastown/docs/e2e-pr-feedback-testing.mdwith a deterministic procedure for verifying review-then-land convoys end-to-end (sub-bead PRs into the convoy feature branch, then a landing PR into main). Captures known issues observed during verification: container MTU/TLS handshake failures with github.com, 'failed' blockers not gating dependents, and intermittent polecat skipping of sub-PR creation.4. Fresh integration tokens for GitHub auth (
ce15a6fe7, direct push)resolveGitHubTokenpreviously preferredgit_auth.github_tokenover the platform integration. Since GitHub App installation tokens have a 1h TTL butgit_auth.github_tokenis only written at rig creation (or rare manual refresh), every long-lived town with an integration was handing out an expired token to:ghCLI (viaGH_TOKENderived fromGIT_TOKENin the container), surfacing as "Failed to log in to github.com using token (GH_TOKEN). The token in GH_TOKEN is invalid."checkPRStatus,checkPRFeedback,mergePR,areThreadsBlocking) — 401 from api.github.com./refresh-git-tokenendpoint the container falls back to on auth failure — it returned the same expired token, so the retry just re-failed.Fix flips priority to
github_cli_pat→ live integration → storedgithub_token(last-resort fallback for towns with no integration). Empty-string responses from the integration service now warn and fall back instead of silently failing. Resolves a fresh token at agent dispatch (startAgentInContainer), merge dispatch (startMergeInContainer), and rig setup (setupRigRepoInContainer) before stuffingGIT_TOKENinto envVars.buildContainerConfignow resolves a fresh token before serializinggit_auth.github_tokeninto theX-Town-Configheader. Adds 6 unit tests covering the priority chain.5. Distinguish null causes in PR status polling (#3160,
63873e425)Fixes #3149.
Replace
PRStatusResult | nullreturn type with discriminatedPRStatusOutcomeunion incheckPRStatus. Each null cause (no token, HTTP error, invalid response, unrecognized URL, host mismatch) now surfaces a structuredPRStatusErrorwith actionable failure messages.Key changes:
resolveGitHubTokenreturnsGitHubTokenResolutionwith resolution chain tracking which sources were tried (back-compat helperresolveGitHubTokenStringexists for non-error-aware callers).no_tokenand non-transient HTTP errors (401/403/404) fail the bead immediately (1 strike).invalid_response/unrecognized_url/host_mismatchfail after 3 strikes.poll_transient_countandpoll_non_transient_countseparate counters (replaces the cross-contaminated singlepoll_null_count); both reset on successful poll.failureKindpersisted to bead metadata for analytics.pr.poll_failedemitted on terminal failure.resolveGitHubTokentracks the configured integration source even whenGIT_TOKEN_SERVICEbinding is missing.Link: #3160
6. Hook-leak race + mayor
GH_TOKENplumb-through (cbbd120cf, direct push)Three independent gastown reliability fixes that came out of debugging a production town stuck in a dispatch loop with the mayor reporting
GH_TOKEN is NOT setwhile the bead it was working on cycledin_progress → openevery few minutes (town_id=0532b9ef-04c8-488a-ac34-6eb1a3d47228).6a.
dispatchAgentcatch path no longer transitions agent toidleThe catch in
scheduling.ts:dispatchAgentwas running anUPDATE agent_metadata SET status='idle'after any thrown error fromstartAgentInContainer. The exception path can fire after the container has already accepted/agents/start(e.g./refresh-tokenraced a token rotation and threw late) — in which case the SDK session is alive and heartbeating. Marking the agent idle was trippingreconcileAgents' "idle agent hooked to live bead" rule, which tore the hook out from under the running session and made every subsequentgt_request_changes/gt_triage_resolvecall fail withis not hooked to a beaduntil the session exited. Now we leave the agent asworkingand let the heartbeat-staleness check (90s) do the cleanup if the agent really did die. Also surfacesgetLastStartError()in thedispatch_failedanalytics event so future debugging doesn't have to guess at the root cause.6b.
reconcileAgentsskips fresh-heartbeat agents when reaping idle+hookedreconciler.ts:reconcileAgents"idle agent hooked to live bead" rule now skips agents whoselast_activity_atis fresh (<90s). The 90s window matches the same module's stale-heartbeat threshold, so a truly dead agent still gets reaped on a later tick — but a live SDK session caught mid-action by a phantomdispatch_failedkeeps its hook.6c.
review-queueunhooked-refinerygt_doneno longer silently mergesreview-queue.ts:agentDonepreviously fell back to marking the MR asmergedwhen a refinery calledgt_donewith nopr_urlafter losing its hook. That same hook-leak race fires while the refinery is mid-review and trying to callgt_request_changes— so the silent-merge path was landing PRs the refinery was actively trying to reject (aftergt_request_changesrepeatedly hit "not hooked"). The fallback now fails the review (returning the source bead toopen) and raises an escalation so a human can decide whether the refinery meant to approve or request changes.6d. Mayor prewarm propagates GitHub auth
getMayorPrewarmContextnow resolves the GitHub token via the standard chain (github_cli_pat/ platform integration / stored fallback) and returnsgithubToken+githubCliPatalongside the kilocode token. The container'sbuildPrewarmEnv(process-manager.ts) setsGH_TOKEN,GIT_TOKEN,GITHUB_TOKEN, andGITHUB_CLI_PATon the prewarmed SDK'sprocess.envso the mayor's bash tool sees credentials from boot. Previously these were only set bybuildAgentEnvon/agents/start, which never runs whileensureMayorshort-circuits on a warm session — the mayor'sgh auth statusreported "not logged in" until the SDK was torn down and rebuilt.PERSIST_ENV_KEYSnow includesGH_TOKEN/GIT_TOKEN/GITHUB_TOKEN/GITHUB_CLI_PATso a token rotation arriving via/agents/startcache-hit reachesprocess.env(without these, onlyKILO_CONFIG_CONTENT/OPENCODE_CONFIG_CONTENT/GASTOWN_ORGANIZATION_IDwere refreshed on cache-hit, leaving git auth pinned to whatever was set at prewarm time).Also includes the prior staged-but-unrelated work that was already in the working tree:
hasActiveWorkthunk-based short-circuit refactor (avoids 4 SQL reads per check on hot towns) andhealthCheckre-arms the alarm with the right cadence (idle vs active) so post-deploy health pings don't pin every idle town to the 5s fast loop.7. Follow-on hardening (multiple commits)
A run of small fixes/refactors that landed on top of the changes above:
ff992a383fix(gastown/container): move GASTOWN_CONTAINER_TOKEN assignment after awaitHydration— order-of-ops fix in/refresh-token: the localnewTokenis captured before theawait awaitHydration(), thenprocess.env.GASTOWN_CONTAINER_TOKENis assigned only after the hydration gate clears. Without this, a mid-hydration token refresh could hand the prewarm a different token than what hydration captured, racing the SDK lock.acdec1028refactor(gastown/container): capture bootHydration resolve locally— moves theresolveHydrationcapture from a module global to a local inbootHydrationso a future second call (periodic re-hydration, etc.) can't interleave with an in-flight one.c0c17ee67refactor(gastown): eliminate double RPC in /mayor-id— foldsgetMayorAgentIdintogetMayorPrewarmContextso the worker's/api/towns/:townId/mayor-idendpoint makes a single DO RPC instead of two.eb0525ffcfix(gastown): add rigId tag middleware for /api/users/:userId/rigs/:rigId routes— the per-route tagging block added in chore(gastown): remove manual request logging middleware #3158 missed this prefix; rig-scoped logs from the user-API surface had norigIdtag.c65fbc226fix(gastown): correct misleading migration comment in non-transient poll counter— comment-only, drift from fix(gastown): distinguish null causes in PR status polling (#3149) #3160 review.60224676dchore(gastown): apply oxfmt formatting— formatter pass.3f19681b9chore(gastown): Lower trace sample rate— Sentry trace sample rate tuning to reduce production sampling cost.Verification
bootHydrationis in flight, release when it returns).bootHydrationwith a/mayor-idfetch mock, interceptscreateKilo, assertsGASTOWN_AGENT_ID,GASTOWN_AGENT_ROLE='mayor',GASTOWN_TOWN_ID,GASTOWN_CONTAINER_TOKEN, and a non-emptyKILO_CONFIG_CONTENTare all visible at spawn time)._ensureMayormodel-resolution path to confirmresolveModel(townConfig, null, 'mayor')is byte-identical to what/agents/startwill send (mayor role ignoresrigOverrideentirely inconfig.resolveModel).0532b9ef-04c8-488a-ac34-6eb1a3d47228— everydispatch_failedin that town hit the catch path, every cycling unhook hit the idle+hooked+live-bead rule, and every "GH_TOKEN is NOT set" mayor session was a prewarm cache-hit. Manual production verification of the actual fix is deferred to post-deploy AE/Sentry monitoring.mayor.ensure_decision: short_circuit_warmandagent.startup_phaseafter merge.Visual Changes
N/A
Reviewer Notes
/api/towns/:townId/mayor-idresponse shape is back-compat: the container's Zod schema (MayorPrewarmResponse) accepts both the new full-context shape (now includinggithubToken/githubCliPat) and the legacy{ agentId }shape with.passthrough(), and rolls back to "skip prewarm" on missing fields.organizationIdfallback chain inbuildPrewarmEnvdistinguishesundefined(older worker, fall back toprocess.env) fromnull(worker authoritatively says "no org") so a stale env-var value can't override an authoritativenull.dispatchAgentcatch-path change deliberately leaves the agent asworkingeven whenstartAgentInContainerthrows — the comment at the catch site explains the rationale and points at the symmetric stale-heartbeat reaper. If you read the diff and worry about leakedworkingrows, the heartbeat path inreconcileAgents(line ~723, 90s threshold) is the safety net.review-queue.agentDoneis a behavior change: previously, an unhooked refinery callinggt_donewithout apr_urlwould mark the MRmerged; now it fails the review and raises an escalation. This is intentional — the previous behavior could land PRs the refinery wanted rejected — but it means towns currently relying on the silent-merge path will see a new escalation surface until the underlying hook-leak race is gone (which the same commit fixes upstream).GH_TOKEN/GIT_TOKEN/GITHUB_TOKEN/GITHUB_CLI_PATtoPERSIST_ENV_KEYS. This is what enables hot token rotation to reachprocess.enveven when/agents/startshort-circuits to a cached SDK server. Previously the cache-hit path only refreshedKILO_CONFIG_CONTENT/OPENCODE_CONFIG_CONTENT/GASTOWN_ORGANIZATION_ID.prewarmMayorSDKwarns but doesn't bail on workdir-mismatch (cheap to harden later), (b) one negative-case timing assertion in the new test relies on a 10mssetTimeout(test still validates the positive case deterministically).refresh-git-token.handler.tschange is a caller update for the newGitHubTokenResolutionreturn type (wasstring | null).wrangler.jsoncmax_instances change (800→500) is from the boot hydration commit (2ffcef28f).