[Gastown] Design exploration: per-task-type model selection

## Summary

Explore adding per-task-type model selection to gastown — letting users configure different models for different kinds of agent work (e.g. small/cheap model for `gt:pr-fixup` beads, premium model for fresh feature beads, Opus for mayor planning, etc.).

The architecture is already shaped favourably for this: there's exactly one function (`resolveModel` at `services/gastown/src/dos/town/config.ts:178`) that all four model-resolution call sites route through, and the kilo SDK's `KILO_CONFIG_CONTENT` already supports per-task routing via `agent.<slug>.model` slots that gastown currently flattens to a single primary model.

**This issue is exploratory.** Not committing to building it, not committing to a timeline. Documenting the design space so we can decide whether/when to invest.

---

## Why this might be worth doing

Today every polecat dispatch resolves to the same `default_model` regardless of what work it's doing. A polecat fixing a typo, rebasing a conflict, or addressing 3 review comments runs on the same model as a polecat designing and implementing OAuth from scratch. The cost asymmetry is real:

- **`gt:pr-fixup`** — polecat addresses specific review comments on an existing PR branch. Mostly mechanical edits with bounded scope. Sonnet is overkill.
- **`gt:pr-feedback`** — same shape; polecat addresses CI failures + reviewer comments.
- **`gt:pr-conflict`** — rebase work. Mostly syntactic; only escalates when there's semantic conflict.
- **`gt:rework`** — polecat continues on the same branch addressing refinery feedback. Smaller scope than fresh `issue` beads.
- **`gt:triage`** — single-pass judgment from a triage prompt. Currently runs as `role: polecat` with the same model resolution as full coding sessions.

PR fixups are a non-trivial fraction of polecat dispatches in any active rig and they're cleanly observable at dispatch time. That's the strongest single argument for at least wiring `role × label` into `resolveModel`.

---

## The differentiable axes (from the codebase, not speculation)

The signals that are **structurally observable to the worker today** — the only place model selection can happen given the current architecture:

| Axis | Values | Where set | Currently affects model? |
|---|---|---|---|
| **Agent role** | `mayor`, `polecat`, `refinery`, `triage` | `services/gastown/container/src/types.ts:5` | Yes — `townConfig.role_models.{mayor,polecat,refinery}` already exists. Triage has no slot (runs as polecat). |
| **Bead label** (`gt:*`) | `gt:rework`, `gt:pr-fixup`, `gt:pr-feedback`, `gt:pr-conflict`, `gt:triage`, `gt:held`, `gt:escalation`, `gt:merge-request`, `gt:convoy`, `gt:molecule`, `gt:message` | Sling-time + lifecycle events | **No** — labels change prompts, branch checkout, review queue routing, but never model selection |
| **Bead type** | `issue`, `merge_request`, `escalation`, `message`, `convoy`, `molecule`, `agent` | Sling | Implicitly via role (refinery picks up `merge_request`); no direct model coupling |
| **Rig** | per-rig `rigOverride` | `services/gastown/src/dos/town/rigs.ts` | Yes for `polecat` and `refinery` only; mayor explicitly ignores rig overrides |
| **Convoy `merge_mode`** | `review-then-land`, `review-and-merge` | Convoy creation | No model effect (only branch-target effect) |
| **Bead metadata `model`** | arbitrary string | `trpc/router.ts:849` | **Stored but never read** — the tRPC `sling` mutation accepts a model field that nothing consumes. Latent feature. |
| **Task phase within a session** | exploration / planning / coding / writing PR description | implicit inside the LLM session | **Not observable** to the worker. Phase boundaries don't exist; one bead = one continuous `kilo serve` session. |

The most leveraged differentiator that's currently unwired is **`role × label`**. Clean cost-asymmetry, fully observable at dispatch time.

---

## The hook point

Exactly **one function** to modify: `resolveModel` at `services/gastown/src/dos/town/config.ts:178`.

```ts
export function resolveModel(
  townConfig: TownConfig,
  rigOverride: RigOverrideConfig | null | undefined,
  role: string,
  // new:
  taskKind?: { labels?: string[]; type?: BeadType }
): string
```

Every read site (4 callers) routes through it:
1. `container-dispatch.ts:491` — dispatch payload `model` field on `POST /agents/start`
2. `Town.do.ts:2727` — `getMayorPrewarmContext` (must agree byte-identically with dispatch)
3. `config.ts:334` — `buildContainerConfig` X-Town-Config header default
4. `router.ts:1244-1245` — mayor model change detection in `updateTownConfig`

Threading `taskKind` from dispatch site:
- `dispatchAgent` in `services/gastown/src/dos/town/scheduling.ts:64-153` — already has the bead in hand; labels are right there
- `startAgentInContainer` params struct in `container-dispatch.ts:346-382` — needs a `taskKind` field

**`KILO_CONFIG_CONTENT` already supports per-task routing.** `services/gastown/container/src/agent-runner.ts:39-86` builds an SDK config with `agent.code.model`, `agent.plan.model`, `agent.title.model`, `agent.explore.model` slots. Today gastown flattens them all to one primary model. If we wanted SDK-side per-task routing later, the wire format can carry it. For dispatch-time-only routing (the realistic v1), the worker resolves once and ships a single string.

---

## The realistic gap: phase-level routing isn't free

Phase-level selection ("small model for the planning sub-step within a polecat session, big model for implementation") is **architecturally precluded** without redesign.

Per `services/gastown/container/src/process-manager.ts:2010-2090`, model changes require an SDK server restart. If `taskKind` resolves to different models within one bead's session, that's a redesign, not a hook addition. Per-bead-at-dispatch-time selection is straightforward; per-tool-call selection is not.

The honest answer: phase-level routing only works **if the kilo SDK does it internally** — i.e. we ship `agent.title.model = small`, the SDK invokes the title agent for that sub-task, and we accept whatever heuristic the SDK uses to decide what a "title" task is. We can't drive it from gastown's config language without a restart-per-phase.

**Recommendation: don't expose phase-level config as a user knob in v1.** Thread `labels[]` to `resolveModel` and let users configure per-(role × label). The phase axis is an SDK-side concern and should stay there.

---

## Configuration UX — the interesting tradeoff

Four config shapes considered:

| | Flat overrides | Hierarchical | Rule-based selectors | **Profile + overrides** |
|---|---|---|---|---|
| Learnability | High | Medium | Low | **Highest** |
| Power | Low | Medium | **High** | Medium-High |
| Failure mode UX | OK | OK | Bad (silent rule errors) | **Best** (centrally maintained profiles update over time) |
| Migration from current | Easy | Awkward | Easy | **Cleanest** |
| Surfaceability | Easy | Medium | **Strong** | Strong |

### Recommended: profile + overrides

```
─── Models ─────────────────────────────────────────────
Profile:         [ Balanced  ▾ ]   ⓘ what's in this profile?

  Balanced expands to:
    Mayor:                claude-opus-4.7
    Polecat (default):    claude-sonnet-4.6
    Polecat (pr-fixup):   claude-haiku-4.5
    Refinery:             claude-sonnet-4.6
    Triage:               claude-haiku-4.5

▸ Advanced overrides (3 set)
  Mayor                            [ claude-opus-4.7    ▾ ]
  Polecat (default)                [ Use profile        ▾ ]
  Polecat — label: gt:pr-fixup     [ claude-haiku-4.5   ▾ ]
  + Add override…

Resolution preview:  [pick a bead ▾]  →  claude-haiku-4.5
                     because: override "Polecat — label: gt:pr-fixup"
```

Wins:
- 90% of users pick a profile and never touch it. Gastown ships sensible defaults that update centrally as models evolve. No "we shipped Sonnet 4.6 in our config and now Sonnet 5 is out and our rule is silently outdated" failure mode.
- The "Add override" dialog is a constrained rule-builder: pick a role (required), optionally add label or rig from known values. No free-text rule language. Discoverable.
- The **resolution preview** ("pick a bead → see which model and why") is the single most important UI element — turns config from declarative-and-opaque into testable.

What it absorbs cleanly:
- Existing `default_model` / `small_model` users → custom profile capturing current settings.
- Mayor's "rig override is ignored" quirk stays intact — profile + overrides explicitly scopes which axes apply per role.
- The latent `bead.metadata.model` tRPC field can be wired as the highest-precedence override (per-bead override), or left as-is.

---

## Gotchas that constrain the redesign

1. **Prewarm and dispatch must agree on the resolved model byte-identically.** `getMayorPrewarmContext` (`Town.do.ts:2705-2732`) and `_ensureMayor` (`Town.do.ts:2743-2871`) both call `resolveModel` and the prewarmed SDK gets evicted if they disagree. If `taskKind` enters mayor resolution, prewarm has to know what task is coming. Today there's no "what task will the next mayor message handle" signal. **Recommendation: don't add `taskKind` to mayor resolution.** Mayor stays role-only.

2. **Mayor explicitly ignores rig overrides** (`config.ts:184-185`). The schema even forbids `role_models.mayor` at the rig level. Any new design must preserve this — mayor is town-level only.

3. **Mayor model hot-reload** (`router.ts:1244-1259`) compares `resolveModel(old, null, 'mayor')` against `resolveModel(new, null, 'mayor')` to decide whether to restart. Per gotcha 1, mayor stays simple, so this stays simple.

4. **`buildContainerConfig` calls `resolveModel(config, null, '')`** with empty role (`config.ts:334`) for the X-Town-Config header default. Needs explicit handling — probably "ignore taskKind, return town default" since this is a dispatch-agnostic fallback.

5. **No fallback / degraded-mode logic exists at the gastown layer.** Model fallback is the AI gateway's concern. Per-task selection composes cleanly with gateway-level fallback: gastown picks "what to ask for", gateway picks "what to actually serve". Don't try to put fallback into gastown config.

6. **Triage is structurally underdetermined.** The `AgentRole` enum includes `'triage'` but no codepath calls `registerAgent({ role: 'triage' })`. Triage work runs as `role: 'polecat'` with the triage system prompt overlaid. If we want triage to use a different model, the cleanest hook is via the `gt:triage` label, not via making triage a real role.

---

## The honest argument against doing this

Most users won't touch it, and the ones who do will get it wrong. Anthropic-class model selection is a moving target — Sonnet-of-today is Haiku-of-next-year in capability terms. Users who lock in `polecat:gt:pr-fixup → haiku-4.5` today will silently keep paying Haiku-4.5 quality for fixups in 2027 when the right answer changes.

The profile mechanism is the answer to this — but **only if profiles are good and users actually use them.** If profiles are mediocre and everyone immediately drops to overrides, the feature becomes a knob factory.

Counter-argument: the cost wins from `polecat:gt:pr-fixup → small_model` alone are likely meaningful. Look at any active rig — PR fixup beads are a non-trivial fraction of polecat dispatches, they're mechanically simpler than fresh feature work, and they're cleanly observable at dispatch time. That's the strongest single argument for shipping at least the role × label hook.

---

## Recommended scoping (if pursued)

**Phase 1 — wire what's free.** Thread `labels[]` to `resolveModel`, add a label-keyed slot to `townConfig.role_models` (e.g. `role_models.polecat_pr_fixup` or a nested `role_models.polecat.labels[label] → model` map). Settings UI gets a few new dropdowns. No profile system, no rule builder. Ships in a week.

**Phase 2 — profile system.** Define `Frugal`, `Balanced`, `Premium` profiles centrally. Town config stores `profile: <name>` plus existing override map. Profile defaults update server-side as models evolve.

**Phase 3 — resolution preview UI.** Pick-a-bead dropdown that shows resolved model + matched rule. Necessary for trust.

**Phase 4 (maybe):** Wire the latent `bead.metadata.model` field as the highest-precedence override. Per-bead model selection from the sling form.

**Phases 5+ deferred:** Per-tool-call routing (SDK-side concern), cost budgeting, A/B testing, user-level preferences.

Phase 1 alone delivers most of the cost wins. Phases 2-3 are about making the system trustworthy enough that users will actually use it. Phase 4 is "nice to have." Don't skip ahead.

---

## Out of scope for this issue

- Specific implementation tickets — this is design-space exploration, not a build plan.
- Cost budgeting / spend caps — separate problem, separate UI.
- Provider routing (`claude-sonnet-4.6 via Anthropic` vs `via Bedrock`) — gateway concern, not user config.
- Per-end-user overrides — we configure per-town, not per-human.
- Model A/B testing or shadow runs — premature.
- Phase-level model selection as a user knob — SDK-side concern.

## Decision needed

Whether/when to invest, and in what shape:

1. Build **Phase 1 only** (role × label hook, minimal UI) — ships fastest, captures most cost wins.
2. Build **Phases 1–3** (profiles + preview) — ships a real config UX, takes weeks.
3. Build **Phases 1–4** (everything except SDK-side routing) — ships everything that's cleanly worker-resolvable.
4. **Defer entirely** — architecture stays favourable; revisit when cycles permit.

No urgency. The hook point is small enough that this is bounded engineering work whenever it gets prioritized.

## References

- `resolveModel`: `services/gastown/src/dos/town/config.ts:178`
- `KILO_CONFIG_CONTENT` builder (per-task slots already exist): `services/gastown/container/src/agent-runner.ts:39-86`
- Prewarm/dispatch agreement: `services/gastown/src/dos/Town.do.ts:2705-2732` and `Town.do.ts:2743-2871`
- Mayor model hot-reload: `services/gastown/src/trpc/router.ts:1244-1259`
- Latent per-bead model field: `services/gastown/src/trpc/router.ts:849` (sling mutation accepts `model`, never read)
- Agent roles enum: `services/gastown/container/src/types.ts:5`
- Bead label semantics: `services/gastown/src/dos/town/agents.ts:478-567` (label-driven `prime` context building)
- Town settings UI: `apps/web/src/app/(app)/gastown/[townId]/settings/TownSettingsPageClient.tsx:306-365`
- Rig settings UI: `apps/web/src/app/(app)/gastown/[townId]/rigs/[rigId]/settings/RigSettingsPageClient.tsx:136-183`


Axis	Values	Where set	Currently affects model?
Agent role	`mayor`, `polecat`, `refinery`, `triage`	`services/gastown/container/src/types.ts:5`	Yes — `townConfig.role_models.{mayor,polecat,refinery}` already exists. Triage has no slot (runs as polecat).
Bead label (`gt:*`)	`gt:rework`, `gt:pr-fixup`, `gt:pr-feedback`, `gt:pr-conflict`, `gt:triage`, `gt:held`, `gt:escalation`, `gt:merge-request`, `gt:convoy`, `gt:molecule`, `gt:message`	Sling-time + lifecycle events	No — labels change prompts, branch checkout, review queue routing, but never model selection
Bead type	`issue`, `merge_request`, `escalation`, `message`, `convoy`, `molecule`, `agent`	Sling	Implicitly via role (refinery picks up `merge_request`); no direct model coupling
Rig	per-rig `rigOverride`	`services/gastown/src/dos/town/rigs.ts`	Yes for `polecat` and `refinery` only; mayor explicitly ignores rig overrides
Convoy `merge_mode`	`review-then-land`, `review-and-merge`	Convoy creation	No model effect (only branch-target effect)
Bead metadata `model`	arbitrary string	`trpc/router.ts:849`	Stored but never read — the tRPC `sling` mutation accepts a model field that nothing consumes. Latent feature.
Task phase within a session	exploration / planning / coding / writing PR description	implicit inside the LLM session	Not observable to the worker. Phase boundaries don't exist; one bead = one continuous `kilo serve` session.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gastown] Design exploration: per-task-type model selection #3207

Summary

Why this might be worth doing

The differentiable axes (from the codebase, not speculation)

The hook point

The realistic gap: phase-level routing isn't free

Configuration UX — the interesting tradeoff

Recommended: profile + overrides

Gotchas that constrain the redesign

The honest argument against doing this

Recommended scoping (if pursued)

Out of scope for this issue

Decision needed

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	Flat overrides	Hierarchical	Rule-based selectors	Profile + overrides
Learnability	High	Medium	Low	Highest
Power	Low	Medium	High	Medium-High
Failure mode UX	OK	OK	Bad (silent rule errors)	Best (centrally maintained profiles update over time)
Migration from current	Easy	Awkward	Easy	Cleanest
Surfaceability	Easy	Medium	Strong	Strong

[Gastown] Design exploration: per-task-type model selection #3207

Description

Summary

Why this might be worth doing

The differentiable axes (from the codebase, not speculation)

The hook point

The realistic gap: phase-level routing isn't free

Configuration UX — the interesting tradeoff

Recommended: profile + overrides

Gotchas that constrain the redesign

The honest argument against doing this

Recommended scoping (if pursued)

Out of scope for this issue

Decision needed

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions