Skip to content

[Gastown] Design exploration: per-task-type model selection #3207

@jrf0110

Description

@jrf0110

Summary

Explore adding per-task-type model selection to gastown — letting users configure different models for different kinds of agent work (e.g. small/cheap model for gt:pr-fixup beads, premium model for fresh feature beads, Opus for mayor planning, etc.).

The architecture is already shaped favourably for this: there's exactly one function (resolveModel at services/gastown/src/dos/town/config.ts:178) that all four model-resolution call sites route through, and the kilo SDK's KILO_CONFIG_CONTENT already supports per-task routing via agent.<slug>.model slots that gastown currently flattens to a single primary model.

This issue is exploratory. Not committing to building it, not committing to a timeline. Documenting the design space so we can decide whether/when to invest.


Why this might be worth doing

Today every polecat dispatch resolves to the same default_model regardless of what work it's doing. A polecat fixing a typo, rebasing a conflict, or addressing 3 review comments runs on the same model as a polecat designing and implementing OAuth from scratch. The cost asymmetry is real:

  • gt:pr-fixup — polecat addresses specific review comments on an existing PR branch. Mostly mechanical edits with bounded scope. Sonnet is overkill.
  • gt:pr-feedback — same shape; polecat addresses CI failures + reviewer comments.
  • gt:pr-conflict — rebase work. Mostly syntactic; only escalates when there's semantic conflict.
  • gt:rework — polecat continues on the same branch addressing refinery feedback. Smaller scope than fresh issue beads.
  • gt:triage — single-pass judgment from a triage prompt. Currently runs as role: polecat with the same model resolution as full coding sessions.

PR fixups are a non-trivial fraction of polecat dispatches in any active rig and they're cleanly observable at dispatch time. That's the strongest single argument for at least wiring role × label into resolveModel.


The differentiable axes (from the codebase, not speculation)

The signals that are structurally observable to the worker today — the only place model selection can happen given the current architecture:

Axis Values Where set Currently affects model?
Agent role mayor, polecat, refinery, triage services/gastown/container/src/types.ts:5 Yes — townConfig.role_models.{mayor,polecat,refinery} already exists. Triage has no slot (runs as polecat).
Bead label (gt:*) gt:rework, gt:pr-fixup, gt:pr-feedback, gt:pr-conflict, gt:triage, gt:held, gt:escalation, gt:merge-request, gt:convoy, gt:molecule, gt:message Sling-time + lifecycle events No — labels change prompts, branch checkout, review queue routing, but never model selection
Bead type issue, merge_request, escalation, message, convoy, molecule, agent Sling Implicitly via role (refinery picks up merge_request); no direct model coupling
Rig per-rig rigOverride services/gastown/src/dos/town/rigs.ts Yes for polecat and refinery only; mayor explicitly ignores rig overrides
Convoy merge_mode review-then-land, review-and-merge Convoy creation No model effect (only branch-target effect)
Bead metadata model arbitrary string trpc/router.ts:849 Stored but never read — the tRPC sling mutation accepts a model field that nothing consumes. Latent feature.
Task phase within a session exploration / planning / coding / writing PR description implicit inside the LLM session Not observable to the worker. Phase boundaries don't exist; one bead = one continuous kilo serve session.

The most leveraged differentiator that's currently unwired is role × label. Clean cost-asymmetry, fully observable at dispatch time.


The hook point

Exactly one function to modify: resolveModel at services/gastown/src/dos/town/config.ts:178.

export function resolveModel(
  townConfig: TownConfig,
  rigOverride: RigOverrideConfig | null | undefined,
  role: string,
  // new:
  taskKind?: { labels?: string[]; type?: BeadType }
): string

Every read site (4 callers) routes through it:

  1. container-dispatch.ts:491 — dispatch payload model field on POST /agents/start
  2. Town.do.ts:2727getMayorPrewarmContext (must agree byte-identically with dispatch)
  3. config.ts:334buildContainerConfig X-Town-Config header default
  4. router.ts:1244-1245 — mayor model change detection in updateTownConfig

Threading taskKind from dispatch site:

  • dispatchAgent in services/gastown/src/dos/town/scheduling.ts:64-153 — already has the bead in hand; labels are right there
  • startAgentInContainer params struct in container-dispatch.ts:346-382 — needs a taskKind field

KILO_CONFIG_CONTENT already supports per-task routing. services/gastown/container/src/agent-runner.ts:39-86 builds an SDK config with agent.code.model, agent.plan.model, agent.title.model, agent.explore.model slots. Today gastown flattens them all to one primary model. If we wanted SDK-side per-task routing later, the wire format can carry it. For dispatch-time-only routing (the realistic v1), the worker resolves once and ships a single string.


The realistic gap: phase-level routing isn't free

Phase-level selection ("small model for the planning sub-step within a polecat session, big model for implementation") is architecturally precluded without redesign.

Per services/gastown/container/src/process-manager.ts:2010-2090, model changes require an SDK server restart. If taskKind resolves to different models within one bead's session, that's a redesign, not a hook addition. Per-bead-at-dispatch-time selection is straightforward; per-tool-call selection is not.

The honest answer: phase-level routing only works if the kilo SDK does it internally — i.e. we ship agent.title.model = small, the SDK invokes the title agent for that sub-task, and we accept whatever heuristic the SDK uses to decide what a "title" task is. We can't drive it from gastown's config language without a restart-per-phase.

Recommendation: don't expose phase-level config as a user knob in v1. Thread labels[] to resolveModel and let users configure per-(role × label). The phase axis is an SDK-side concern and should stay there.


Configuration UX — the interesting tradeoff

Four config shapes considered:

Flat overrides Hierarchical Rule-based selectors Profile + overrides
Learnability High Medium Low Highest
Power Low Medium High Medium-High
Failure mode UX OK OK Bad (silent rule errors) Best (centrally maintained profiles update over time)
Migration from current Easy Awkward Easy Cleanest
Surfaceability Easy Medium Strong Strong

Recommended: profile + overrides

─── Models ─────────────────────────────────────────────
Profile:         [ Balanced  ▾ ]   ⓘ what's in this profile?

  Balanced expands to:
    Mayor:                claude-opus-4.7
    Polecat (default):    claude-sonnet-4.6
    Polecat (pr-fixup):   claude-haiku-4.5
    Refinery:             claude-sonnet-4.6
    Triage:               claude-haiku-4.5

▸ Advanced overrides (3 set)
  Mayor                            [ claude-opus-4.7    ▾ ]
  Polecat (default)                [ Use profile        ▾ ]
  Polecat — label: gt:pr-fixup     [ claude-haiku-4.5   ▾ ]
  + Add override…

Resolution preview:  [pick a bead ▾]  →  claude-haiku-4.5
                     because: override "Polecat — label: gt:pr-fixup"

Wins:

  • 90% of users pick a profile and never touch it. Gastown ships sensible defaults that update centrally as models evolve. No "we shipped Sonnet 4.6 in our config and now Sonnet 5 is out and our rule is silently outdated" failure mode.
  • The "Add override" dialog is a constrained rule-builder: pick a role (required), optionally add label or rig from known values. No free-text rule language. Discoverable.
  • The resolution preview ("pick a bead → see which model and why") is the single most important UI element — turns config from declarative-and-opaque into testable.

What it absorbs cleanly:

  • Existing default_model / small_model users → custom profile capturing current settings.
  • Mayor's "rig override is ignored" quirk stays intact — profile + overrides explicitly scopes which axes apply per role.
  • The latent bead.metadata.model tRPC field can be wired as the highest-precedence override (per-bead override), or left as-is.

Gotchas that constrain the redesign

  1. Prewarm and dispatch must agree on the resolved model byte-identically. getMayorPrewarmContext (Town.do.ts:2705-2732) and _ensureMayor (Town.do.ts:2743-2871) both call resolveModel and the prewarmed SDK gets evicted if they disagree. If taskKind enters mayor resolution, prewarm has to know what task is coming. Today there's no "what task will the next mayor message handle" signal. Recommendation: don't add taskKind to mayor resolution. Mayor stays role-only.

  2. Mayor explicitly ignores rig overrides (config.ts:184-185). The schema even forbids role_models.mayor at the rig level. Any new design must preserve this — mayor is town-level only.

  3. Mayor model hot-reload (router.ts:1244-1259) compares resolveModel(old, null, 'mayor') against resolveModel(new, null, 'mayor') to decide whether to restart. Per gotcha 1, mayor stays simple, so this stays simple.

  4. buildContainerConfig calls resolveModel(config, null, '') with empty role (config.ts:334) for the X-Town-Config header default. Needs explicit handling — probably "ignore taskKind, return town default" since this is a dispatch-agnostic fallback.

  5. No fallback / degraded-mode logic exists at the gastown layer. Model fallback is the AI gateway's concern. Per-task selection composes cleanly with gateway-level fallback: gastown picks "what to ask for", gateway picks "what to actually serve". Don't try to put fallback into gastown config.

  6. Triage is structurally underdetermined. The AgentRole enum includes 'triage' but no codepath calls registerAgent({ role: 'triage' }). Triage work runs as role: 'polecat' with the triage system prompt overlaid. If we want triage to use a different model, the cleanest hook is via the gt:triage label, not via making triage a real role.


The honest argument against doing this

Most users won't touch it, and the ones who do will get it wrong. Anthropic-class model selection is a moving target — Sonnet-of-today is Haiku-of-next-year in capability terms. Users who lock in polecat:gt:pr-fixup → haiku-4.5 today will silently keep paying Haiku-4.5 quality for fixups in 2027 when the right answer changes.

The profile mechanism is the answer to this — but only if profiles are good and users actually use them. If profiles are mediocre and everyone immediately drops to overrides, the feature becomes a knob factory.

Counter-argument: the cost wins from polecat:gt:pr-fixup → small_model alone are likely meaningful. Look at any active rig — PR fixup beads are a non-trivial fraction of polecat dispatches, they're mechanically simpler than fresh feature work, and they're cleanly observable at dispatch time. That's the strongest single argument for shipping at least the role × label hook.


Recommended scoping (if pursued)

Phase 1 — wire what's free. Thread labels[] to resolveModel, add a label-keyed slot to townConfig.role_models (e.g. role_models.polecat_pr_fixup or a nested role_models.polecat.labels[label] → model map). Settings UI gets a few new dropdowns. No profile system, no rule builder. Ships in a week.

Phase 2 — profile system. Define Frugal, Balanced, Premium profiles centrally. Town config stores profile: <name> plus existing override map. Profile defaults update server-side as models evolve.

Phase 3 — resolution preview UI. Pick-a-bead dropdown that shows resolved model + matched rule. Necessary for trust.

Phase 4 (maybe): Wire the latent bead.metadata.model field as the highest-precedence override. Per-bead model selection from the sling form.

Phases 5+ deferred: Per-tool-call routing (SDK-side concern), cost budgeting, A/B testing, user-level preferences.

Phase 1 alone delivers most of the cost wins. Phases 2-3 are about making the system trustworthy enough that users will actually use it. Phase 4 is "nice to have." Don't skip ahead.


Out of scope for this issue

  • Specific implementation tickets — this is design-space exploration, not a build plan.
  • Cost budgeting / spend caps — separate problem, separate UI.
  • Provider routing (claude-sonnet-4.6 via Anthropic vs via Bedrock) — gateway concern, not user config.
  • Per-end-user overrides — we configure per-town, not per-human.
  • Model A/B testing or shadow runs — premature.
  • Phase-level model selection as a user knob — SDK-side concern.

Decision needed

Whether/when to invest, and in what shape:

  1. Build Phase 1 only (role × label hook, minimal UI) — ships fastest, captures most cost wins.
  2. Build Phases 1–3 (profiles + preview) — ships a real config UX, takes weeks.
  3. Build Phases 1–4 (everything except SDK-side routing) — ships everything that's cleanly worker-resolvable.
  4. Defer entirely — architecture stays favourable; revisit when cycles permit.

No urgency. The hook point is small enough that this is bounded engineering work whenever it gets prioritized.

References

  • resolveModel: services/gastown/src/dos/town/config.ts:178
  • KILO_CONFIG_CONTENT builder (per-task slots already exist): services/gastown/container/src/agent-runner.ts:39-86
  • Prewarm/dispatch agreement: services/gastown/src/dos/Town.do.ts:2705-2732 and Town.do.ts:2743-2871
  • Mayor model hot-reload: services/gastown/src/trpc/router.ts:1244-1259
  • Latent per-bead model field: services/gastown/src/trpc/router.ts:849 (sling mutation accepts model, never read)
  • Agent roles enum: services/gastown/container/src/types.ts:5
  • Bead label semantics: services/gastown/src/dos/town/agents.ts:478-567 (label-driven prime context building)
  • Town settings UI: apps/web/src/app/(app)/gastown/[townId]/settings/TownSettingsPageClient.tsx:306-365
  • Rig settings UI: apps/web/src/app/(app)/gastown/[townId]/rigs/[rigId]/settings/RigSettingsPageClient.tsx:136-183

Metadata

Metadata

Assignees

No one assigned

    Labels

    gt:coreReconciler, state machine, bead lifecycle, convoy flow

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions