Skip to content

chore: batch update main from gastown-staging#2130

Merged
jrf0110 merged 25 commits intomainfrom
gastown-staging
Apr 8, 2026
Merged

chore: batch update main from gastown-staging#2130
jrf0110 merged 25 commits intomainfrom
gastown-staging

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Apr 7, 2026

Summary

This batch PR merges recent feature and bugfix work from gastown-staging into main.

Original constituent PRs

  • KV-backed agent session persistence (#2105)
    • Isolates agent SQLite DB via KILO_TEST_HOME and XDG_DATA_HOME.
    • Adds AGENT_DB_SNAPSHOTS_KV binding for agent DB snapshots with WAL checkpoint before save.
    • Hydrates DB on startAgent and resumes mayor sessions on boot, allowing state to survive container evictions.
    • Adds process registry RPC to TownContainerDO and passes GASTOWN_TOWN_ID to container on provision.
    • syncRegistry() persists running agents to the container registry on start/stop/exit/failure.
  • Filter closed/failed beads from re-escalation query (#2128)
  • Extract SCM operations and instrument AI calls
    • Extracts PR status checking and feedback analysis from TownDO into a new town-scm submodule.
    • Instruments the Workers AI call for checking review threads with api.external_request analytics.
  • Auto-merge pipeline and Workers AI classification fixes
    • Adds Workers AI (Gemma 4 26B) classification of unresolved PR threads (differentiates blocking from non-blocking like LGTMs/bot statuses).
    • Fixes mergePR to try squash/merge/rebase gracefully.
    • Fixes resetAgent to zero both agent and bead dispatch_attempts for immediate recovery after evictions.
    • Fixes cross-tick race condition in pr_feedback_detected to prevent duplicate PRs.
  • Small Model configuration field (#2195)
    • Adds Small Model setting to town settings page.

Session persistence fixes

  • WAL checkpoint before snapshot: saveDbSnapshot now runs PRAGMA wal_checkpoint(TRUNCATE) via bun:sqlite before reading kilo.db, ensuring recent writes aren't lost.
  • Session resume for mayor only: startAgent and updateAgentModel call session.list() to resume existing sessions for the mayor. Non-mayor agents always get fresh sessions.
  • Skip initial prompt on resume: Resumed sessions don't re-send the startup prompt, avoiding duplicate turns.
  • Remove legacy conversation history injection: Removed conversationHistory from all mayor dispatch paths — kilo.db persistence handles session continuity.

Container lifecycle fixes

  • Container registry sync: Added syncRegistry() that POSTs running agents to TownContainerDO on agent start/exit/stop/failure. Registry cleared at end of drainAll().
  • Drain idle timer shortening: idleTimers now store { timer, onExit }. drainAll() Phase 1b replaces long idle timers (120s/600s) with 10s timers so already-idle agents exit promptly.
  • db-snapshot auth bypass: Added /db-snapshot to the kiloAuthMiddleware skip condition alongside /container-registry.
  • SDK bump: @kilocode/sdk and @kilocode/plugin bumped from 7.0.37 to 7.1.23.

Reconciler fixes

  • code_review=false fast-track: Only fast-tracks MR beads that have a pr_url. MR beads without a PR stay open for the refinery (or get failed if code review is disabled and no PR was created).
  • Refinery dispatch gating: Rules 5-6 use refineryNeededFilter — when code_review=false, only dispatches for convoy review-and-merge beads. Prevents refinery dispatch for ordinary PR beads that should be handled by poll_pr.
  • Convoy review-and-merge support: Fast-track and Rules 5-6 correctly exclude/include convoy review-and-merge MR beads using bead_dependenciesconvoy_metadata.merge_mode lookups.
  • Orphaned MR cleanup: Open MR beads without pr_url when code_review=false are failed, and the source bead is reopened for retry.
  • Idle agent recovery: Reconciler detects idle agents hooked to live beads (dispatch failed) and unhooks + reopens for re-dispatch.
  • GUPP stale timestamp fix: dispatchAgent NULLs last_event_at/last_event_type so GUPP falls back to last_activity_at until new SDK events arrive.
  • Active alarm cadence: escalateToActiveCadence() on work-creation paths (slingBead, slingConvoy, startConvoy, submitToReviewQueue, requestChanges). Only shortens alarms, never pushes back.

Review queue fixes

  • Skip review queue for pr-feedback beads: gt:pr-feedback beads now close directly in agentDone (matching gt:rework and gt:pr-fixup patterns), preventing redundant MR bead creation.

Documentation

  • Added E2E testing docs for KV persistence (Section 8) and re-escalation filtering (Section 9) in local-debug-testing.md.

Verification

  • 182/182 unit tests pass
  • Typecheck clean
  • Lint clean
  • Local E2E tested: container registry CRUD, DB snapshot round-trip, full auto-merge pipeline, graceful container eviction, session resume
  • Production debugging of convoy and bead lifecycle issues

Visual Changes

Small model added to settings

image

Reviewer Notes

  • The AGENT_DB_SNAPSHOTS_KV binding requires a real KV namespace ID (set in both top-level and dev env in wrangler.jsonc).
  • Workers AI integration for PR feedback analysis uses the AI binding.
  • The code_review=false reconciler logic has been significantly reworked — the fast-track, Rules 5-6, and orphan cleanup all interact carefully. See reconcileReviewQueue in reconciler.ts.
  • Container drain Phase 1b shortens idle timers — verify agents exit promptly during graceful stops.
  • Mayor session resume relies on kilo.db persistence + session.list() — if the SDK changes how sessions are stored, this may need updating.

jrf0110 and others added 5 commits April 6, 2026 16:22
…tion, and bug fixes

- Add Workers AI (Gemma 4 26B) to classify unresolved PR review threads as
  blocking vs non-blocking for auto-merge decisions. Informational comments
  (LGTM, bot status reports) no longer block auto-merge.
- Fix mergePR to try squash/merge/rebase in order instead of hardcoding merge
  method (repos with squash-only policy were failing with 405).
- Fix resetAgent to also zero dispatch_attempts so agents recover immediately
  after container evictions instead of being stuck in exponential backoff.
- Fix code_review=false bypass: fast-track ALL open MR beads (not just those
  with pr_url) to prevent the refinery from being dispatched for code review
  when code_review is disabled.
- Fix cross-tick race in pr_feedback_detected: re-verify PR is still open
  before creating feedback beads to prevent duplicate PRs on merged branches.
- Add AI binding to wrangler.jsonc for both production and dev environments.
- Add diagnostic logging for poll_pr auto-merge flow (allGreen, readySince,
  elapsed/delay, convoy dispatch target branch).
- Update local-debug-testing.md with Workers AI documentation.
Moves PR status checking and PR feedback analysis from TownDO into a new town-scm
submodule. Instruments the Workers AI call for checking review threads with a new
api.external_request analytics event.
* fix(gastown): filter closed beads from re-escalation query

Add status != 'closed' filter to reEscalateStaleEscalations query to
prevent phantom Re-Escalation messages for already-acknowledged (and
thus closed) escalation beads.

Fixes #2123

* fix(gastown): exclude failed beads from re-escalation query

Exclude beads with status='failed' in addition to 'closed' from the
re-escalation query, preventing phantom Re-Escalation messages for
already-failed escalation beads.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
* feat(gastown): isolate agent SQLite DB via KILO_TEST_HOME (#2094)

Adds KILO_TEST_HOME env var to buildAgentEnv() to ensure @kilocode/sdk
isolates the kilo.db file per agent instead of sharing the container's
default path.

Co-authored-by: Shadow-polecat-d1e0e21b@5f5fda7f

Co-authored-by: John Fawcett <john@kilcoode.ai>

* feat(gastown): add AGENT_DB_SNAPSHOTS_KV binding for agent DB snapshots (#2096)

* feat(gastown): add AGENT_DB_SNAPSHOTS_KV binding for agent DB snapshots

- Add KV namespace binding to wrangler.jsonc
- Add AGENT_DB_SNAPSHOTS_KV: KVNamespace to Env interface in worker-configuration.d.ts (both DevEnv and Env)

* fix(gastown): set AGENT_DB_SNAPSHOTS_KV id to empty string

The placeholder id '<your-kv-namespace-id>' would break deployments.
Set to empty string to allow wrangler to provision the namespace.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>

* feat(gastown): add process registry RPC to TownContainerDO (#2098)

Co-authored-by: John Fawcett <john@kilcoode.ai>

* feat(gastown): add container-registry and db-snapshot worker endpoints

Add GET/POST /api/towns/:townId/container-registry proxied to
TownContainerDO.getRegistry() and .updateRegistry().

Add GET/POST /api/towns/:townId/rigs/:rigId/agents/:agentId/db-snapshot
backed by AGENT_DB_SNAPSHOTS_KV.

BEAD=fca2ee4e

* feat(gastown): pass GASTOWN_TOWN_ID to container on provision (#2099)

Ensures the container knows its town identity on cold boot by reading
process.env.GASTOWN_TOWN_ID.

Co-authored-by: John Fawcett <john@kilcoode.ai>

* feat(gastown): hydrate DB on startAgent and resume agents on boot (#2103)

* feat(gastown): hydrate DB on startAgent and resume agents on boot

- Add hydrateDbFromSnapshot() that fetches the agent's DB from KV and writes
  it to /tmp/agent-home-<agentId>/.local/share/kilo/kilo.db
- Call hydrateDbFromSnapshot() in startAgent() before ensureSDKServer()
- Add saveDbSnapshot() that reads kilo.db and POSTs it to the worker KV
- Call saveDbSnapshot() on stopAgent(), exitAgent(), drainAll() stragglers,
  and stopAll()
- Add bootHydration() that fetches the container registry and resumes all
  registered agents
- Call bootHydration() from main.ts after control server startup

* fix(gastown): move container-registry and db-snapshot routes before kiloAuthMiddleware

Container-token requests to /container-registry and /db-snapshot were returning
401 Unauthorized because these routes were registered after the kiloAuthMiddleware
wildcard. Moved them before that middleware so they are protected by authMiddleware
instead, which accepts container JWTs.

Fixes PR #2103 review comments from kilo-code-bot.

* fix(gastown): use authMiddleware for container-registry and db-snapshot routes

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>

* fix: skip kiloAuthMiddleware for container-registry routes; add placeholder KV namespace id

- Container-registry routes use authMiddleware which accepts container JWTs,
  but the global /api/towns/:townId/* middleware was also applying kiloAuthMiddleware
  which rejected container tokens. Now the global middleware skips container-registry.
- Changed AGENT_DB_SNAPSHOTS_KV namespace id from empty string to 'placeholder'
  to indicate it requires configuration at deploy time.

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 7, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

No new issues found on the changed lines in this incremental review.

Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

WARNING

File Line Issue
services/gastown/src/dos/town/reconciler.ts 1267 The in-progress MR guard still counts fast-tracked PR beads with pr_url IS NULL, so an ordinary PR can block review-and-merge convoy refinery work when code_review=false.
services/gastown/container/src/process-manager.ts 195 syncRegistry() still serializes startupRequest and startupEnv, so boot hydration can restore the mayor with stale model and runtime config after a hot-swap.
Files Reviewed (1 file)
  • services/gastown/src/dos/town/review-queue.ts - no new issues

Reviewed by gpt-5.4-2026-03-05 · 210,273 tokens

…ing docs

- Replace placeholder KV namespace ID with real ID (5ffb8f362e7b4d869fe2f48293a9f0c2)
  in both top-level and dev env wrangler config
- Add kv_namespaces to dev env (wrangler doesn't inherit from top-level
  when env-specific bindings are declared)
- Add Section 8: KV-backed agent session persistence testing guide
  (container registry, db snapshots, boot hydration, drain snapshots)
- Add Section 9: Re-escalation filtering verification guide
jrf0110 added 2 commits April 7, 2026 13:07
…umping SDK

- Set XDG_DATA_HOME in buildAgentEnv() so the kilo CLI writes kilo.db
  to the same path that saveDbSnapshot/hydrateDbFromSnapshot use
  (/tmp/agent-home-{agentId}/.local/share/kilo/kilo.db). Previously
  KILO_TEST_HOME was set but only affects Global.Path.home, not the
  XDG data directory where kilo.db lives.
- Bump @kilocode/sdk and @kilocode/plugin from 7.0.37 to 7.1.23
- Bump container plugin SDK deps from ^1.0.23 to 7.1.23
…orten drain idle timers

Three bugs fixed:

1. Container registry never written — agents weren't persisted to the
   registry on start/stop, so bootHydration always found an empty
   registry after container eviction. Added syncRegistry() that
   serializes the running agents Map to the TownContainerDO via POST
   /container-registry, called on agent start, exit, stop, and failure.
   Added startupRequest field to ManagedAgent to preserve the original
   StartAgentRequest for registry serialization.

2. Refinery dispatched despite code_review=false — Rules 5-6 in the
   reconciler were wrapped in a bare block { } instead of
   if (refineryCodeReview) { }. The fast-track code appended
   transition_bead actions but hadn't mutated the DB yet, so Rules 5-6
   re-queried and still saw MR beads as open, dispatching the refinery.
   Changed the bare block to a proper if guard.

3. Drain waits 120s-600s for already-idle agents — agents that received
   session.idle before drain started had long idle timers pending.
   drainAll() now replaces those with 10s timers in Phase 1b by storing
   the onExit callback alongside each idle timer.
…eset, conversation history

- Fix db-snapshot routes hitting kiloAuthMiddleware: add '/db-snapshot'
  to the skip condition alongside '/container-registry' so container JWT
  auth works correctly for snapshot endpoints.
- Clear container registry at end of drainAll() so bootHydration on the
  next container doesn't resurrect force-saved agents.
- Reset bead dispatch_attempts and last_dispatch_attempt_at in resetAgent
  so the reconciler doesn't skip the bead due to accumulated cooldown.
- Remove legacy conversation history injection from mayor dispatch paths
  (sendMayorMessage, _ensureMayor, updateMayorModel) — kilo.db
  persistence now handles session continuity across evictions.
…essions

Two fixes for agent session persistence across container evictions:

1. WAL checkpoint before snapshot: SQLite in WAL mode stores recent
   writes in -wal/-shm files. saveDbSnapshot now runs PRAGMA
   wal_checkpoint(TRUNCATE) via bun:sqlite subprocess to merge the
   WAL into the main .db file before reading it. Without this, the
   snapshot was missing recent session data (messages, parts, etc).

2. Resume existing sessions: startAgent now calls session.list() after
   hydrating the DB and starting the SDK server. If sessions exist
   from the hydrated kilo.db, it resumes the most recently updated
   one instead of always creating a new session via session.create().
jrf0110 added 3 commits April 7, 2026 17:10
Resolved 4 conflicts:
- docs/local-debug-testing.md: kept our new sections 8-9
- Town.do.ts: removed duplicate SCM methods that were already extracted
  to town-scm.ts (resolveGitHubToken, checkPRStatus, checkPRFeedback,
  areThreadsBlocking, mergePR)
- worker-configuration.d.ts: kept AGENT_DB_SNAPSHOTS_KV binding
- wrangler.jsonc: kept kv_namespaces in both top-level and dev env
Non-mayor agents (polecats, refineries, triage) always get fresh
sessions since they work on a new bead each dispatch. The session
resume logic was applying to all agents, causing polecats to inherit
stale sessions from the hydrated kilo.db instead of starting clean.
When a town is in idle alarm cadence (5 min interval), creating beads
or convoys via the mayor didn't wake it up. armAlarmIfNeeded() only
sets an alarm if none is scheduled, so it's a no-op when the idle
alarm is already set minutes in the future.

Added escalateToActiveCadence() which unconditionally reschedules the
alarm to fire in 5s, and call it from work-creation paths:
- slingBead (mayor creates a bead)
- slingConvoy (mayor creates a convoy, non-staged)
- startConvoy (staged convoy transitions to active)
- submitToReviewQueue (polecat submits work, creates MR bead)
- requestChanges (creates rework bead)

Lifecycle paths (initialize, configureRig, heartbeat, agentDone,
agentCompleted, sendMayorMessage) keep using armAlarmIfNeeded()
since they shouldn't clobber an active alarm schedule.
…n code_review=false

When code_review=false, the fast-track blindly moved ALL open MR beads
to in_progress, including convoy review-and-merge beads that need the
refinery. Rules 5-6 were then gated behind refineryCodeReview, so the
refinery was never dispatched for these beads — they got stuck in
in_progress with no assignee.

Two changes:
1. Fast-track now excludes MR beads belonging to a review-and-merge
   convoy (checked via parent_bead_id → convoy_metadata.merge_mode).
2. Rules 5-6 block is no longer gated behind refineryCodeReview — it
   runs unconditionally but naturally only finds open MR beads, which
   after the fast-track are only convoy review-and-merge beads when
   code_review=false.
jrf0110 added 2 commits April 7, 2026 20:29
…arvation, refinery filter

1. updateAgentModel now resumes existing mayor sessions (via
   session.list) instead of always creating new ones, matching the
   startAgent fix. Model swaps no longer lose conversation history.

2. escalateToActiveCadence now only shortens alarms — it checks
   whether the current alarm is already nearer than ACTIVE_ALARM_INTERVAL
   before overwriting. Prevents reconciler starvation during bursts
   of work creation.

3. Rules 5-6 refinery dispatch now applies a convoy-only filter when
   code_review=false. Because reconciliation emits actions without
   mutating SQL, the fast-track's transition_bead actions haven't been
   applied yet, so ordinary PR beads are still 'open' in the DB.
   Without the filter, Rules 5-6 would re-dispatch the refinery for
   beads that should be skipped.
jrf0110 and others added 4 commits April 7, 2026 20:54
Both startAgent and updateAgentModel were sending the startup prompt
into resumed sessions, creating duplicate turns. Now the initial
session.prompt() call is skipped when a session was resumed from
the hydrated kilo.db — the conversation history is already there.
gt:pr-feedback beads (address review comments, fix CI) were falling
through to the default submitToReviewQueue path in agentDone, creating
a redundant MR bead. The polecat pushes to the existing PR branch —
no new PR or MR bead is needed.

Added a gt:pr-feedback handler that mirrors the existing gt:rework and
gt:pr-fixup patterns: close the feedback bead directly, which unblocks
the parent MR bead so poll_pr can re-check CI or the reconciler can
re-dispatch the refinery for re-review.
…n settings page (#2195)

* feat(gastown): Add Small Model configuration field to user-facing town settings page

* fix: Add clear button to Small Model field

---------

Co-authored-by: John Fawcett <john@kilcoode.ai>
…GUPP triggers

When an idle agent is re-dispatched, dispatchAgent updates
last_activity_at but leaves last_event_at with the timestamp from the
previous session's last SDK event (potentially hours/days old). The
GUPP patrol uses last_event_at as the primary activity signal, so it
would see the agent as unresponsive for >2h and immediately force-stop
it — even though the agent was just dispatched 5 seconds ago.

Fix: NULL out last_event_at and last_event_type in the dispatch UPDATE
so the GUPP falls back to last_activity_at (freshly set) until new SDK
events arrive from the current session.
jrf0110 added 5 commits April 8, 2026 11:31
When a dispatch fails (container didn't start, OOM, etc.),
agentCompleted sets the agent to idle but doesn't always unhook it.
This leaves the agent idle+hooked to a live bead — a dead-end state
where GUPP doesn't target it (not working), scheduling doesn't pick
up the bead (already in_progress with an assignee), and the agent
can't take new work (has a hook).

Added a reconciler rule in reconcileAgents: idle agents hooked to
open/in_progress beads get unhooked, and the bead is reset to open
so scheduling can re-dispatch.
The code_review=false fast-track was transitioning ALL open MR beads
to in_progress, including those without a pr_url. MR beads without a
PR have no poll_pr target, so Rule 2 would detect them as stuck and
reset to open — causing an open→in_progress→open oscillation every
reconciler tick.

Now the fast-track JOIN's review_metadata and only matches beads where
pr_url IS NOT NULL. MR beads without a PR stay open so Rules 5-6 can
dispatch the refinery to create the PR.
… code_review=false

The convoyOnlyFilter was too restrictive — when code_review=false, it
only allowed Rules 5-6 to dispatch the refinery for convoy
review-and-merge beads. MR beads without a pr_url (where the polecat
didn't create a PR) were excluded, leaving them stuck in 'open' with
no one to create the PR.

Renamed to refineryNeededFilter with an OR condition: dispatch the
refinery when the MR bead has no pr_url (needs PR creation) OR when
it belongs to a review-and-merge convoy. MR beads WITH a pr_url are
handled by the fast-track → poll_pr pipeline as before.
…false

When code_review=false and merge_strategy=pr, the polecat is
responsible for creating the PR. If it doesn't provide a pr_url in
gt_done, the MR bead is orphaned: no PR to poll, and the refinery
shouldn't be dispatched (code review disabled).

Previously this left the MR bead stuck in 'open' forever. Now the
reconciler fails orphaned MR beads (no pr_url, not convoy
review-and-merge) and reopens the source bead so a polecat can retry.

Also reverted the refineryNeededFilter to only match convoy
review-and-merge beads — MR beads without pr_url don't need the
refinery when code review is disabled.
@jrf0110 jrf0110 merged commit 5b67fa6 into main Apr 8, 2026
34 checks passed
@jrf0110 jrf0110 deleted the gastown-staging branch April 8, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gastown): closed escalation beads re-broadcast as phantom Re-Escalation messages

3 participants