Skip to content

release: gastown-staging -> main#3151

Open
jrf0110 wants to merge 14 commits into
mainfrom
gastown-staging
Open

release: gastown-staging -> main#3151
jrf0110 wants to merge 14 commits into
mainfrom
gastown-staging

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented May 9, 2026

Summary

Promotes 13 commits from gastown-staging to main. Six independent fix/feature groups plus follow-on hardening:

  1. Boot-hydration timeout fix — unblocks /agents/start during container boot hydration and preserves mayor tools on prewarm.
  2. GitHub auth correctness — fresh integration tokens instead of stale stored value, plus distinguished failure messages when no token resolves.
  3. Logging hygiene — redundant request-logging middleware replaced with per-route Hono-param tagging.
  4. Developer tooling — dev-only convoy debug endpoints and a deterministic review-then-land E2E test procedure.
  5. Hook-leak race + mayor GH_TOKEN — fixes a cascade where phantom dispatch failures tore live SDK hooks out from under running agents (causing gt_request_changes 500s, silent merges, and triage role-mismatch escalations) and the mayor's prewarmed SDK shipped without GH_TOKEN.
  6. Follow-on hardening — token-rotation ordering in /refresh-token, bootHydration resolve scoping, mayor-id RPC consolidation, request-tag middleware coverage for nested :rigId routes, sample-rate tuning, and formatting.

TownContainerDO.max_instances was lowered from 800 → 500 as part of commit 1.

Constituent commits

1. Boot hydration + mayor prewarm fix (2ffcef28f, direct push)

Three independent fixes for the startAgentInContainer timeout regression observed after #2974, plus a tighter container-instance cap.

Symptoms. Production logs were filling with two error patterns since the last gastown-stagingmain promotion:

[<DOMAIN>] startAgentInContainer: EXCEPTION for agent <UUID>: TimeoutError: The operation was aborted due to timeout
timeout after 6000ms: ensureSDKServer for <agentId>

Root cause. The control server starts accepting requests immediately at boot (main.ts:83), while bootHydration() runs concurrently and serialises every registry agent + the new mayor prewarm through the global sdkServerLock (createKilo reads process.cwd()/process.env). Fresh /agents/start, /refresh-token, and PATCH /agents/:id/model requests queued behind that work and the DO-side AbortSignal.timeout(60s) (resp. REFRESH_AGENT_TIMEOUT_MS=6_000) fired before they ever got the lock.

The mayor prewarm added in #3122 made things worse on two axes:

  1. It built KILO_CONFIG_CONTENT from hardcoded model defaults, so the real /agents/start with the user's actual model triggered ensureSDKServer's "config mismatch — evicting prewarmed server" path on every warm restart, doubling lock-holding time on the critical path the prewarm was supposed to speed up.
  2. It was missing GASTOWN_AGENT_ROLE, GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID from the prewarm env. kilo serve snapshots process.env at spawn, and plugin/index.ts:66 keys mayor-tool registration off GASTOWN_AGENT_ROLE === 'mayor'. Without those, the prewarmed server booted with no mayor tools, and the cache hit on the next /agents/start handed that defective instance back to the user — manifesting as "mayor tools became unavailable."

Changes

1. Hydration gate (control-server.ts, process-manager.ts)

New awaitHydration() exported from process-manager.ts: a promise that bootHydration replaces on entry and resolves in a finally. Awaited at the top of /agents/start, /refresh-token, and PATCH /agents/:id/model (before any process.env mutation in the model PATCH path so concurrent requests can't race on env writes before holding the SDK lock). Default-resolved at module init so test/dev contexts that never run hydration aren't blocked.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts, process-manager.ts)

New getMayorPrewarmContext() on TownDO returns { agentId, model, smallModel, kilocodeToken, organizationId } resolved the same way _ensureMayor resolves them (config.resolveModel(townConfig, null, 'mayor')). The /api/towns/:townId/mayor-id endpoint now returns that whole context so the container builds a KILO_CONFIG_CONTENT byte-identical to what the next /agents/start will send. Falls back to the bare { agentId } shape for back-compat; the container skips prewarm when model/token aren't available rather than building a config that's guaranteed to mismatch.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
  • Exported ensureMayorWorkspaceForTown(townId) so prewarmMayorSDK materialises the workspace before ensureSDKServer's process.chdir (was throwing ENOENT on cold containers).
  • buildPrewarmEnv now mirrors the mayor-shaped subset of buildAgentEnv: GASTOWN_AGENT_ID, GASTOWN_AGENT_ROLE='mayor', GASTOWN_TOWN_ID, KILOCODE_FEATURE='gastown', KILO_TEST_HOME, XDG_DATA_HOME. New end-to-end test intercepts createKilo and asserts those keys are visible to the spawn.
4. wrangler.jsonc

Lowered TownContainerDO.max_instances from 800 → 500 (manual change).

2. Remove manual request logging middleware (#3158, a6cf1029b)

Removes the redundant request-logging middleware in gastown.worker.ts that logged every request twice (-->/<-- via logger.info) — already covered by the per-route instrumented(c, route, handler) AE event wrapper. Replaces the regex-based logger.setTags block with proper per-route tagging using Hono c.req.param() matching for :orgId / :townId / :rigId / :agentId prefixes. Net diff: ~30 deletions + ~25 additions.

Link: #3158

3. Convoy debug endpoints + E2E test procedure (7f9121ffa, direct push)

Adds three dev-only debug endpoints for autonomous convoy testing without going through the mayor LLM:

  • GET /debug/towns/:townId/rigs — list rigs in a town
  • POST /debug/towns/:townId/sling-convoy — call Town.slingConvoy() directly
  • GET /debug/towns/:townId/convoys — list active convoys with progress

Documents the new endpoints and adds a Test C section to services/gastown/docs/e2e-pr-feedback-testing.md with a deterministic procedure for verifying review-then-land convoys end-to-end (sub-bead PRs into the convoy feature branch, then a landing PR into main). Captures known issues observed during verification: container MTU/TLS handshake failures with github.com, 'failed' blockers not gating dependents, and intermittent polecat skipping of sub-PR creation.

4. Fresh integration tokens for GitHub auth (ce15a6fe7, direct push)

resolveGitHubToken previously preferred git_auth.github_token over the platform integration. Since GitHub App installation tokens have a 1h TTL but git_auth.github_token is only written at rig creation (or rare manual refresh), every long-lived town with an integration was handing out an expired token to:

  • Polecat/refinery gh CLI (via GH_TOKEN derived from GIT_TOKEN in the container), surfacing as "Failed to log in to github.com using token (GH_TOKEN). The token in GH_TOKEN is invalid."
  • The worker-side PR poller (checkPRStatus, checkPRFeedback, mergePR, areThreadsBlocking) — 401 from api.github.com.
  • The /refresh-git-token endpoint the container falls back to on auth failure — it returned the same expired token, so the retry just re-failed.

Fix flips priority to github_cli_pat → live integration → stored github_token (last-resort fallback for towns with no integration). Empty-string responses from the integration service now warn and fall back instead of silently failing. Resolves a fresh token at agent dispatch (startAgentInContainer), merge dispatch (startMergeInContainer), and rig setup (setupRigRepoInContainer) before stuffing GIT_TOKEN into envVars. buildContainerConfig now resolves a fresh token before serializing git_auth.github_token into the X-Town-Config header. Adds 6 unit tests covering the priority chain.

5. Distinguish null causes in PR status polling (#3160, 63873e425)

Fixes #3149.

Replace PRStatusResult | null return type with discriminated PRStatusOutcome union in checkPRStatus. Each null cause (no token, HTTP error, invalid response, unrecognized URL, host mismatch) now surfaces a structured PRStatusError with actionable failure messages.

Key changes:

  • resolveGitHubToken returns GitHubTokenResolution with resolution chain tracking which sources were tried (back-compat helper resolveGitHubTokenString exists for non-error-aware callers).
  • no_token and non-transient HTTP errors (401/403/404) fail the bead immediately (1 strike).
  • invalid_response/unrecognized_url/host_mismatch fail after 3 strikes.
  • Transient HTTP errors (5xx/429) keep existing 10-strike behavior.
  • poll_transient_count and poll_non_transient_count separate counters (replaces the cross-contaminated single poll_null_count); both reset on successful poll.
  • failureKind persisted to bead metadata for analytics.
  • AE event pr.poll_failed emitted on terminal failure.
  • resolveGitHubToken tracks the configured integration source even when GIT_TOKEN_SERVICE binding is missing.

Link: #3160

6. Hook-leak race + mayor GH_TOKEN plumb-through (cbbd120cf, direct push)

Three independent gastown reliability fixes that came out of debugging a production town stuck in a dispatch loop with the mayor reporting GH_TOKEN is NOT set while the bead it was working on cycled in_progress → open every few minutes (town_id=0532b9ef-04c8-488a-ac34-6eb1a3d47228).

6a. dispatchAgent catch path no longer transitions agent to idle

The catch in scheduling.ts:dispatchAgent was running an UPDATE agent_metadata SET status='idle' after any thrown error from startAgentInContainer. The exception path can fire after the container has already accepted /agents/start (e.g. /refresh-token raced a token rotation and threw late) — in which case the SDK session is alive and heartbeating. Marking the agent idle was tripping reconcileAgents' "idle agent hooked to live bead" rule, which tore the hook out from under the running session and made every subsequent gt_request_changes / gt_triage_resolve call fail with is not hooked to a bead until the session exited. Now we leave the agent as working and let the heartbeat-staleness check (90s) do the cleanup if the agent really did die. Also surfaces getLastStartError() in the dispatch_failed analytics event so future debugging doesn't have to guess at the root cause.

6b. reconcileAgents skips fresh-heartbeat agents when reaping idle+hooked

reconciler.ts:reconcileAgents "idle agent hooked to live bead" rule now skips agents whose last_activity_at is fresh (<90s). The 90s window matches the same module's stale-heartbeat threshold, so a truly dead agent still gets reaped on a later tick — but a live SDK session caught mid-action by a phantom dispatch_failed keeps its hook.

6c. review-queue unhooked-refinery gt_done no longer silently merges

review-queue.ts:agentDone previously fell back to marking the MR as merged when a refinery called gt_done with no pr_url after losing its hook. That same hook-leak race fires while the refinery is mid-review and trying to call gt_request_changes — so the silent-merge path was landing PRs the refinery was actively trying to reject (after gt_request_changes repeatedly hit "not hooked"). The fallback now fails the review (returning the source bead to open) and raises an escalation so a human can decide whether the refinery meant to approve or request changes.

6d. Mayor prewarm propagates GitHub auth

getMayorPrewarmContext now resolves the GitHub token via the standard chain (github_cli_pat / platform integration / stored fallback) and returns githubToken + githubCliPat alongside the kilocode token. The container's buildPrewarmEnv (process-manager.ts) sets GH_TOKEN, GIT_TOKEN, GITHUB_TOKEN, and GITHUB_CLI_PAT on the prewarmed SDK's process.env so the mayor's bash tool sees credentials from boot. Previously these were only set by buildAgentEnv on /agents/start, which never runs while ensureMayor short-circuits on a warm session — the mayor's gh auth status reported "not logged in" until the SDK was torn down and rebuilt.

PERSIST_ENV_KEYS now includes GH_TOKEN/GIT_TOKEN/GITHUB_TOKEN/GITHUB_CLI_PAT so a token rotation arriving via /agents/start cache-hit reaches process.env (without these, only KILO_CONFIG_CONTENT/OPENCODE_CONFIG_CONTENT/GASTOWN_ORGANIZATION_ID were refreshed on cache-hit, leaving git auth pinned to whatever was set at prewarm time).

Also includes the prior staged-but-unrelated work that was already in the working tree: hasActiveWork thunk-based short-circuit refactor (avoids 4 SQL reads per check on hot towns) and healthCheck re-arms the alarm with the right cadence (idle vs active) so post-deploy health pings don't pin every idle town to the 5s fast loop.

7. Follow-on hardening (multiple commits)

A run of small fixes/refactors that landed on top of the changes above:

  • ff992a383 fix(gastown/container): move GASTOWN_CONTAINER_TOKEN assignment after awaitHydration — order-of-ops fix in /refresh-token: the local newToken is captured before the await awaitHydration(), then process.env.GASTOWN_CONTAINER_TOKEN is assigned only after the hydration gate clears. Without this, a mid-hydration token refresh could hand the prewarm a different token than what hydration captured, racing the SDK lock.
  • acdec1028 refactor(gastown/container): capture bootHydration resolve locally — moves the resolveHydration capture from a module global to a local in bootHydration so a future second call (periodic re-hydration, etc.) can't interleave with an in-flight one.
  • c0c17ee67 refactor(gastown): eliminate double RPC in /mayor-id — folds getMayorAgentId into getMayorPrewarmContext so the worker's /api/towns/:townId/mayor-id endpoint makes a single DO RPC instead of two.
  • eb0525ffc fix(gastown): add rigId tag middleware for /api/users/:userId/rigs/:rigId routes — the per-route tagging block added in chore(gastown): remove manual request logging middleware #3158 missed this prefix; rig-scoped logs from the user-API surface had no rigId tag.
  • c65fbc226 fix(gastown): correct misleading migration comment in non-transient poll counter — comment-only, drift from fix(gastown): distinguish null causes in PR status polling (#3149) #3160 review.
  • 60224676d chore(gastown): apply oxfmt formatting — formatter pass.
  • 3f19681b9 chore(gastown): Lower trace sample rate — Sentry trace sample rate tuning to reduce production sampling cost.

Verification

  • Unit-tested the hydration gate end-to-end with a fetch barrier (asserts awaiters block while bootHydration is in flight, release when it returns).
  • Unit-tested the prewarm env shape end-to-end (drives bootHydration with a /mayor-id fetch mock, intercepts createKilo, asserts GASTOWN_AGENT_ID, GASTOWN_AGENT_ROLE='mayor', GASTOWN_TOWN_ID, GASTOWN_CONTAINER_TOKEN, and a non-empty KILO_CONFIG_CONTENT are all visible at spawn time).
  • Reviewed the _ensureMayor model-resolution path to confirm resolveModel(townConfig, null, 'mayor') is byte-identical to what /agents/start will send (mayor role ignores rigOverride entirely in config.resolveModel).
  • For commit 6 (hook-leak / GH_TOKEN): behavior verified by code inspection against the production AE event timeline for town 0532b9ef-04c8-488a-ac34-6eb1a3d47228 — every dispatch_failed in that town hit the catch path, every cycling unhook hit the idle+hooked+live-bead rule, and every "GH_TOKEN is NOT set" mayor session was a prewarm cache-hit. Manual production verification of the actual fix is deferred to post-deploy AE/Sentry monitoring.
  • Manual production verification deferred for commit 1 — those changes target a hot path that's hard to reproduce locally; will monitor Sentry / AE mayor.ensure_decision: short_circuit_warm and agent.startup_phase after merge.

Visual Changes

N/A

Reviewer Notes

  • The /api/towns/:townId/mayor-id response shape is back-compat: the container's Zod schema (MayorPrewarmResponse) accepts both the new full-context shape (now including githubToken / githubCliPat) and the legacy { agentId } shape with .passthrough(), and rolls back to "skip prewarm" on missing fields.
  • The organizationId fallback chain in buildPrewarmEnv distinguishes undefined (older worker, fall back to process.env) from null (worker authoritatively says "no org") so a stale env-var value can't override an authoritative null.
  • Commit 6's dispatchAgent catch-path change deliberately leaves the agent as working even when startAgentInContainer throws — the comment at the catch site explains the rationale and points at the symmetric stale-heartbeat reaper. If you read the diff and worry about leaked working rows, the heartbeat path in reconcileAgents (line ~723, 90s threshold) is the safety net.
  • Commit 6's silent-merge fix in review-queue.agentDone is a behavior change: previously, an unhooked refinery calling gt_done without a pr_url would mark the MR merged; now it fails the review and raises an escalation. This is intentional — the previous behavior could land PRs the refinery wanted rejected — but it means towns currently relying on the silent-merge path will see a new escalation surface until the underlying hook-leak race is gone (which the same commit fixes upstream).
  • Commit 6 also adds GH_TOKEN/GIT_TOKEN/GITHUB_TOKEN/GITHUB_CLI_PAT to PERSIST_ENV_KEYS. This is what enables hot token rotation to reach process.env even when /agents/start short-circuits to a cached SDK server. Previously the cache-hit path only refreshed KILO_CONFIG_CONTENT/OPENCODE_CONFIG_CONTENT/GASTOWN_ORGANIZATION_ID.
  • Two SUGGESTION-level findings deferred from earlier code review still apply: (a) prewarmMayorSDK warns but doesn't bail on workdir-mismatch (cheap to harden later), (b) one negative-case timing assertion in the new test relies on a 10ms setTimeout (test still validates the positive case deterministically).
  • The refresh-git-token.handler.ts change is a caller update for the new GitHubTokenResolution return type (was string | null).
  • The wrangler.jsonc max_instances change (800→500) is from the boot hydration commit (2ffcef28f).

…ayor tools on prewarm

Three independent fixes for the startAgentInContainer timeout
regression introduced by #2974, plus a tighter container-instance cap.

1. Hydration gate (control-server.ts, process-manager.ts)
   The control server starts accepting requests immediately at boot,
   while bootHydration runs concurrently and serialises every registry
   agent + the mayor prewarm through the global sdkServerLock. Fresh
   /agents/start, /refresh-token, and PATCH /agents/:id/model requests
   queued behind that work and the DO-side AbortSignal.timeout(60s)
   fired before they ever got the lock — surfacing as
   "TimeoutError: aborted due to timeout" and "timeout after 6000ms:
   ensureSDKServer for <agentId>". A new awaitHydration() promise is
   awaited at the top of those handlers (before any process.env
   mutation in the model PATCH path) so they don't compound the queue.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts,
   process-manager.ts)
   buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded
   defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the
   real /agents/start with the user's actual model triggered
   ensureSDKServer's "config mismatch, evicting prewarmed server" path
   on every warm restart — doubling lock-holding time on the critical
   path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id
   endpoint now returns the full prewarm context (model, smallModel,
   kilocodeToken, organizationId) resolved the same way _ensureMayor
   resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT
   to match. Falls back gracefully to a skip when the worker hasn't
   deployed the richer endpoint yet.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
   prewarmMayorSDK called mayorWorkdirForTown (which only returns a
   string) and went straight to ensureSDKServer's process.chdir,
   throwing ENOENT on cold containers because createMayorWorkspace
   only ran from runAgent. Exported ensureMayorWorkspaceForTown so
   prewarm materialises the workspace first.

   More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE,
   GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve
   plugin (plugin/index.ts) reads at spawn to decide whether to
   register mayor tools. Without them the prewarmed server booted with
   NO mayor tools, and the cache hit on the next /agents/start handed
   that defective instance back to the user. Now mirrors the mayor-
   shaped subset of buildAgentEnv. Added an end-to-end test that
   intercepts createKilo and asserts the env at spawn time.

4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500.

Verified with pnpm --filter gastown-container test (67/67 pass),
pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.
Comment thread services/gastown/container/src/control-server.ts Outdated
Comment thread services/gastown/src/gastown.worker.ts Outdated
Comment thread services/gastown/container/src/process-manager.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 9, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

✅ All Previously Flagged Issues Resolved
File Issue Status
services/gastown/container/src/process-manager.ts CRITICAL: Git auth keys in PERSIST_ENV_KEYS could leak stale tokens into fresh SDK starts ✅ Fixed in e16b8b643 — split into CACHE_HIT_ENV_KEYS + applyCacheHitEnv which sets or revokes git auth on cache-hit
services/gastown/container/src/control-server.ts WARNING: Boot hydration registry fetch was unbounded, could block /agents/start indefinitely ✅ Fixed in e16b8b643 — added AbortSignal.timeout(10_000)
services/gastown/src/dos/town/actions.ts WARNING: GitLab polling failures surfaced with GitHub-specific messages ✅ Fixed in e16b8b643failureMessageFor now uses providerLabel() and provider-specific 403 hints
services/gastown/test/unit/pr-poll-thresholds.test.ts SUGGESTION: Counter cross-contamination tests did not exercise actual counter logic ✅ Fixed in e16b8b643 — extracted nextPollCounterState pure function and updated tests to assert on state transitions
services/gastown/container/src/control-server.ts process.env.GASTOWN_CONTAINER_TOKEN mutated before awaitHydration() ✅ Fixed
services/gastown/container/src/process-manager.ts _resolveHydration module-global stale-capture could orphan resolver ✅ Fixed
services/gastown/src/gastown.worker.ts Double RPC call to TownDO in mayor-id endpoint ✅ Fixed
services/gastown/src/gastown.worker.ts rigId not tagged for /api/users/:userId/rigs/:rigId routes ✅ Fixed
services/gastown/src/dos/town/actions.ts Misleading migration comment in poll_non_transient_count branch ✅ Fixed
Files Reviewed (all commits)
  • services/gastown/container/src/agent-runner.ts
  • services/gastown/container/src/control-server.ts
  • services/gastown/container/src/process-manager.tsPERSIST_ENV_KEYS / CACHE_HIT_ENV_KEYS split; applyCacheHitEnv revokes stale git auth on cache-hit; 10s timeout on boot registry fetch; no issues
  • services/gastown/container/src/process-manager.test.ts
  • services/gastown/docs/e2e-pr-feedback-testing.md
  • services/gastown/src/dos/Town.do.ts
  • services/gastown/src/dos/town/actions.tsproviderLabel() helper; provider-specific failure messages; nextPollCounterState pure function extracted and exported; no issues
  • services/gastown/src/dos/town/config.ts
  • services/gastown/src/dos/town/container-dispatch.ts
  • services/gastown/src/dos/town/reconciler.ts
  • services/gastown/src/dos/town/review-queue.ts
  • services/gastown/src/dos/town/scheduling.ts
  • services/gastown/src/dos/town/town-scm.ts
  • services/gastown/src/gastown.worker.ts
  • services/gastown/src/handlers/refresh-git-token.handler.ts
  • services/gastown/test/integration/pr-poll-errors.test.ts
  • services/gastown/test/unit/pr-poll-errors.test.ts
  • services/gastown/test/unit/pr-poll-thresholds.test.ts — counter tests now exercise nextPollCounterState directly with state assertions; no issues
  • services/gastown/test/unit/town-scm.test.ts
  • services/gastown/wrangler.jsonc

Reviewed by claude-4.6-sonnet-20260217 · 464,708 tokens

jrf0110 and others added 4 commits May 10, 2026 18:00
* chore(gastown): remove manual request logging middleware

* fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm

Three independent fixes for the startAgentInContainer timeout
regression introduced by #2974, plus a tighter container-instance cap.

1. Hydration gate (control-server.ts, process-manager.ts)
   The control server starts accepting requests immediately at boot,
   while bootHydration runs concurrently and serialises every registry
   agent + the mayor prewarm through the global sdkServerLock. Fresh
   /agents/start, /refresh-token, and PATCH /agents/:id/model requests
   queued behind that work and the DO-side AbortSignal.timeout(60s)
   fired before they ever got the lock — surfacing as
   "TimeoutError: aborted due to timeout" and "timeout after 6000ms:
   ensureSDKServer for <agentId>". A new awaitHydration() promise is
   awaited at the top of those handlers (before any process.env
   mutation in the model PATCH path) so they don't compound the queue.

2. Prewarm config matches /agents/start (Town.do.ts, gastown.worker.ts,
   process-manager.ts)
   buildPrewarmEnv was constructing KILO_CONFIG_CONTENT from hardcoded
   defaults (anthropic/claude-sonnet-4.6 / claude-haiku-4.5), so the
   real /agents/start with the user's actual model triggered
   ensureSDKServer's "config mismatch, evicting prewarmed server" path
   on every warm restart — doubling lock-holding time on the critical
   path the prewarm was supposed to speed up. The /api/towns/:id/mayor-id
   endpoint now returns the full prewarm context (model, smallModel,
   kilocodeToken, organizationId) resolved the same way _ensureMayor
   resolves it, and the container builds the prewarm KILO_CONFIG_CONTENT
   to match. Falls back gracefully to a skip when the worker hasn't
   deployed the richer endpoint yet.

3. Mayor workdir + plugin env (agent-runner.ts, process-manager.ts)
   prewarmMayorSDK called mayorWorkdirForTown (which only returns a
   string) and went straight to ensureSDKServer's process.chdir,
   throwing ENOENT on cold containers because createMayorWorkspace
   only ran from runAgent. Exported ensureMayorWorkspaceForTown so
   prewarm materialises the workspace first.

   More critically, buildPrewarmEnv was missing GASTOWN_AGENT_ROLE,
   GASTOWN_AGENT_ID, and GASTOWN_TOWN_ID — env vars the kilo serve
   plugin (plugin/index.ts) reads at spawn to decide whether to
   register mayor tools. Without them the prewarmed server booted with
   NO mayor tools, and the cache hit on the next /agents/start handed
   that defective instance back to the user. Now mirrors the mayor-
   shaped subset of buildAgentEnv. Added an end-to-end test that
   intercepts createKilo and asserts the env at spawn time.

4. wrangler.jsonc: lower TownContainerDO max_instances from 800 to 500.

Verified with pnpm --filter gastown-container test (67/67 pass),
pnpm --filter cloudflare-gastown typecheck, oxlint, and pnpm format.

* feat(gastown): per-route logger tagging via Hono params (review on #3158)

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
…st procedure

Adds three dev-only debug endpoints for autonomous convoy testing without
going through the mayor LLM:

- GET  /debug/towns/:townId/rigs         — list rigs in a town
- POST /debug/towns/:townId/sling-convoy — call Town.slingConvoy() directly
- GET  /debug/towns/:townId/convoys      — list active convoys with progress

Documents the new endpoints and adds a Test C section to
e2e-pr-feedback-testing.md with a deterministic procedure for verifying
review-then-land convoys end-to-end (sub-bead PRs into the convoy feature
branch, then a landing PR into main). Also captures known issues observed
during verification: container MTU/TLS handshake failures with github.com,
'failed' blockers not gating dependents, and intermittent polecat skipping
of sub-PR creation.
… stale stored value

resolveGitHubToken previously preferred git_auth.github_token over the
platform integration. Since GitHub App installation tokens have a 1h
TTL but git_auth.github_token is only written at rig creation (or rare
manual refresh), every long-lived town with an integration was handing
out an expired token to:

- Polecat/refinery 'gh' CLI (via GH_TOKEN derived from GIT_TOKEN in
  the container), surfacing as 'Failed to log in to github.com using
  token (GH_TOKEN). The token in GH_TOKEN is invalid.'
- The worker-side PR poller (checkPRStatus, checkPRFeedback, mergePR,
  areThreadsBlocking) — 401 from api.github.com.
- The /refresh-git-token endpoint the container falls back to on auth
  failure — it returned the same expired token, so the retry just
  re-failed.

Verified by hitting api.github.com with a local town's stored token:
401 even though the integration service mints fresh ones fine.

Fix:
- Flip resolveGitHubToken's priority to github_cli_pat -> live
  integration -> stored github_token (last-resort fallback for towns
  with no integration). Empty-string responses from the integration
  service now warn and fall back instead of silently failing.
- Resolve a fresh token at agent dispatch (startAgentInContainer),
  merge dispatch (startMergeInContainer), and rig setup
  (setupRigRepoInContainer) before stuffing GIT_TOKEN into envVars.
- buildContainerConfig now resolves a fresh token before serializing
  git_auth.github_token into the X-Town-Config header — the container's
  syncTownConfigToProcessEnv path reads this on every request to update
  process.env.GIT_TOKEN, which buildLiveHotSwapEnv then derives GH_TOKEN
  from on token-refresh hot-swaps. townId is required (not optional) so
  a forgotten arg can't silently regress to the stale-token shape.
- syncConfigToContainer resolves a fresh token before persisting
  GIT_TOKEN to DO storage for next boot.

Adds 6 unit tests covering the priority chain (cli_pat preferred,
fresh integration over stale stored, fallback on lookup failure,
rig-level integration ID, no-config returns null).
…3160)

* fix(gastown): distinguish null causes in PR status polling (#3149)

Replace PRStatusResult | null return type with discriminated PRStatusOutcome
union in checkPRStatus. Each null cause (no token, HTTP error, invalid
response, unrecognized URL, host mismatch) now surfaces a structured
PRStatusError with actionable failure messages.

- resolveGitHubToken returns GitHubTokenResolution with resolution chain
- no_token and non-transient HTTP errors (401/403/404) fail immediately
- invalid_response/unrecognized_url/host_mismatch fail after 3 strikes
- Transient HTTP errors (5xx/429) keep existing 10-strike behavior
- poll_null_count resets to 0 on successful poll at both call sites
- failureKind persisted to bead metadata for analytics
- AE event pr.poll_failed emitted on terminal failure
- Unit tests for checkPRStatus, resolveGitHubToken, failureMessageFor,
  and threshold logic
- Integration test for no_token immediate-fail path

* style: apply oxfmt formatting

* fix(gastown): track integration source when GIT_TOKEN_SERVICE unbound (review on town-scm.ts:66)

When integrationId is set but GIT_TOKEN_SERVICE binding is missing,
the configured integration source was silently omitted from the tried
array. Add an else branch that pushes the source label with a
'(GIT_TOKEN_SERVICE not bound)' annotation so the no_token error
message lists all attempted sources.

* fix(gastown): fail immediately for unrecognized_url and host_mismatch (review on actions.ts:374)

Both are deterministic configuration errors that cannot self-resolve
on retry. Move them from the 3-strike bucket to the fail-immediately
bucket alongside no_token and non-transient http_error. Only
invalid_response remains in the 3-strike category.

* fix(gastown): use separate counters for transient vs non-transient poll errors (review on actions.ts:1350)

Replace the shared poll_null_count with poll_transient_count and
poll_non_transient_count. Each error category increments only its own
counter and resets the other, preventing cross-contamination where 9
transient errors followed by 1 non-transient error would incorrectly
fail the bead.

Legacy poll_null_count is migrated on first read: the transient branch
falls back to poll_null_count when poll_transient_count is absent.
This ensures in-flight beads at deploy time retain their existing
counter value. The non-transient branch does not read the legacy field
since these counters reset on every success anyway — at worst an
in-flight bead gets one extra retry for invalid_response.

* fix(gastown): resolve merge conflict in resolveGitHubToken - merge staging priority with PR #3160 structured return type

- resolveGitHubToken now uses staging's priority: cli_pat → integration → stored token
- Returns GitHubTokenResolution discriminated union (from PR #3160)
- Includes unbound-service else branch (GIT_TOKEN_SERVICE not bound)
- Adds resolveGitHubTokenString helper for non-error-aware callers
- Updates Town.do.ts, container-dispatch.ts, config.ts to use helper
- Updates town-scm.test.ts for GitHubTokenResolution return shape
- Updates pr-poll-errors.test.ts for new priority order

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
@jrf0110 jrf0110 changed the title fix(gastown): unblock /agents/start during boot hydration; preserve mayor tools on prewarm release: gastown-staging -> main May 11, 2026
John Fawcett added 4 commits May 11, 2026 16:21
… awaitHydration in /refresh-token

The /refresh-token handler assigned process.env.GASTOWN_CONTAINER_TOKEN
before awaiting hydration, inconsistent with PATCH /agents/:id/model which
gates first. Mid-hydration token refresh could cause buildPrewarmEnv to
pick up a different token than the one hydration captured locally.
…stead of module global

The _resolveHydration module-global stale-capture pattern would orphan
the first promise's resolver if bootHydration() were ever called
concurrently. Capturing resolve as a local inside bootHydration() itself
eliminates the risk and removes the module-global.
… getMayorPrewarmContext

getMayorPrewarmContext now returns { agentId } even when the kilocode
token is unavailable (instead of null), so the worker route no longer
needs to fall through to getMayorAgentId. This eliminates the redundant
agents.listAgents SQL query over a second RPC hop.
…igId routes

The per-route tagging middleware registered prefixes under
/api/orgs/:orgId/... but missed the parallel /api/users/:userId/rigs/:rigId
family. Without this, requests to those routes lack rigId in structured
log tags.
Comment thread services/gastown/src/dos/town/actions.ts Outdated
@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented May 11, 2026

Review observation dispositions

Observation A — "Request/response logging removed without replacement"

Intentional — PR #3158 deletes those manual log lines because instrumented() already emits structured AE events with route, userId, townId, rigId, agentId, beadId, durationMs, and error per route. The new per-route Hono-param tagging middleware preserves the tagging side of the old block. No tracing observability is lost; structured tracing replaces it.

Observation C — "Double GIT_TOKEN_SERVICE.getToken call per agent start"

Acknowledged. The second GIT_TOKEN_SERVICE.getToken is a KV cache hit, so the perf impact is negligible, but the duplication is real — will track as a separate cleanup since deduping requires changing buildContainerConfig's signature beyond what's in scope for this release.

Additional thread resolved

A 4th inline thread about a misleading migration comment in actions.ts was also addressed: the comment on the non-transient poll counter branch claimed poll_null_count migration, but the SQL doesn't include it (correctly, since invalid_response is a new error kind). Fixed in c65fbc2.

John Fawcett and others added 3 commits May 11, 2026 18:37
Three independent gastown reliability fixes that came out of debugging a
production town stuck in a dispatch loop with the mayor reporting
"GH_TOKEN is NOT set" while the bead it was working on cycled
in_progress -> open every few minutes:

1. dispatchAgent's catch path no longer transitions the agent to
   'idle'. The exception path can fire after the container has already
   accepted /agents/start (e.g. /refresh-token raced a token rotation
   and threw late), in which case the SDK session is alive and
   heartbeating. Marking the agent idle was tripping reconcileAgents'
   "idle agent hooked to live bead" rule, which tore the hook out
   from under the running session and made every subsequent
   gt_request_changes / gt_triage_resolve call fail with
   "is not hooked to a bead" until the session exited. Now we leave
   the agent as 'working' and let the heartbeat-staleness check do
   the cleanup if the agent really did die. Also surface
   getLastStartError() in the dispatch_failed analytics event so
   future debugging doesn't have to guess at the root cause.

2. reconcileAgents' "idle agent hooked to live bead" rule now skips
   agents whose last_activity_at is fresh (<90s). That 90s window
   matches reconcileAgents' own stale-heartbeat threshold, so a truly
   dead agent still gets reaped on a later tick — but a live SDK
   session caught mid-action by a phantom dispatch_failed keeps its
   hook.

3. review-queue's unhooked-refinery gt_done fallback no longer
   silently marks the MR as 'merged' when there's no pr_url. The
   same hook-leak race that fires while the refinery is mid-review
   would silently land a PR the refinery was actively trying to
   reject (after gt_request_changes hit "not hooked"). The fallback
   now fails the review (returning the source bead to 'open') and
   raises an escalation so a human can decide whether the refinery
   meant to approve or request changes.

Also fixes the "mayor has no GH_TOKEN" symptom by plumbing GitHub
auth into the prewarm path:

- getMayorPrewarmContext now resolves the GitHub token via the
  standard chain (github_cli_pat / platform integration / stored
  fallback) and returns it alongside the kilocode token.
- buildPrewarmEnv (container) sets GH_TOKEN, GIT_TOKEN, GITHUB_TOKEN,
  and GITHUB_CLI_PAT on the prewarmed SDK's process.env so the
  mayor's bash tool sees credentials from boot. Previously these
  were only set by buildAgentEnv on /agents/start, which never runs
  while ensureMayor short-circuits on a warm session.
- PERSIST_ENV_KEYS now includes GH_TOKEN/GIT_TOKEN/GITHUB_TOKEN/
  GITHUB_CLI_PAT so a token rotation arriving via /agents/start
  cache-hit reaches process.env.

Includes prior staged work: hasActiveWork refactor (short-circuit
SQL reads via thunks) and healthCheck alarm-cadence selection so
idle towns don't all wake up on the 5s cadence after a deploy.

Verification: pnpm --filter cloudflare-gastown {typecheck,lint,test}
all pass (251 tests). Behavior verified by code inspection against
the production AE event timeline for town
0532b9ef-04c8-488a-ac34-6eb1a3d47228 — every dispatch_failed in
that town hit the catch path, every cycling unhook hit the idle+
hooked+live-bead rule, and every "GH_TOKEN is NOT set" mayor
session was a prewarm cache-hit.
Copy link
Copy Markdown
Contributor

@jeanduplessis jeanduplessis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 4 issues in the final combined diff:

  • 1 critical auth/env leakage risk in container SDK env persistence.
  • 2 warnings around hydration gating and GitLab-specific PR polling errors.
  • 1 test-quality suggestion where counter behavior is documented but not executed.

I verified the current unresolved review-thread state (none unresolved) and checked the previously resolved/outdated threads against the final diff.

Comment thread services/gastown/container/src/process-manager.ts
Comment thread services/gastown/container/src/control-server.ts
Comment thread services/gastown/src/dos/town/actions.ts Outdated
Comment thread services/gastown/test/unit/pr-poll-thresholds.test.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Gastown] Misleading "GitHub API returned null" error when town has no GitHub token

2 participants