Goal
The "what-if" tree. Replay a session up to message K, swap the model (or system prompt, or tools), regenerate from there, persist the fork as a child session linked to the parent. Lets you ask "would Sonnet have done this in fewer turns?" empirically.
Why now
Comparative benchmark (Spec 26) needs this primitive. And separately, this is the killer "blackbox" feature — replay + fork is what makes a flight recorder useful.
Schema
v021 — session_forks:
CREATE TABLE session_forks (
fork_session_id TEXT PRIMARY KEY, -- the new session id
parent_session_id TEXT NOT NULL,
forked_at_message_seq INTEGER NOT NULL,
forked_at_ts TEXT NOT NULL,
fork_reason TEXT NOT NULL, -- 'manual' | 'benchmark' | 'what-if'
swap_model TEXT, -- new model id (NULL = unchanged)
swap_system_prompt TEXT, -- new system prompt (NULL = unchanged)
swap_tools_json TEXT, -- new tool list (NULL = unchanged)
status TEXT NOT NULL, -- 'running' | 'complete' | 'failed' | 'cancelled'
created_ts TEXT NOT NULL,
completed_ts TEXT,
raw_outcome_json TEXT
);
CREATE INDEX idx_sf_parent ON session_forks(parent_session_id);
Plus add is_fork BOOLEAN and parent_session_id TEXT to sessions (additive).
User-visible surface
- CLI:
stackunderflow fork session <id> --at-message <seq> --swap-model <new> [--swap-system-prompt FILE] [--reason what-if].
- CLI:
stackunderflow fork list <parent_session_id>.
- API:
POST /api/playback/{id}/fork with body {at_message_seq, swap_model?, swap_system_prompt?, reason} returns the new fork session id; GET /api/playback/{id}/forks lists children.
- Meta-agent tool:
fork_session(id, at_message_seq, swap_model).
- UI: extend Playback tab with a "Fork from here" button on each event.
Implementation plan
- v021 migration.
- New service
stackunderflow/services/session_fork.py:
fork(conn, parent_id, at_seq, *, swap_model, swap_system_prompt, swap_tools, reason) — uses Spec 24 (reconstruct_context_at) to rebuild the model context, calls Ollama (or queues for cloud-API depending on swap_model), persists assistant turns into a new session.
- Background-task runner (similar to backfill_jobs.py pattern) — fork can take minutes; don't block the API.
- CLI + API + meta-agent.
- UI button.
Tests
- Simple fork: 3-message parent, fork at msg 2, swap model → assert child session exists with parent link, msg 1 + msg 2 copied verbatim, msg 3+ are new.
- Mocked LLM: assert the LLM was called with the reconstructed context (per Spec 24).
- Cancellation: fork running, user cancels → status='cancelled', no zombie data.
- Idempotency: forking the same parent at the same seq with the same swap_model returns existing fork (don't duplicate).
Hard parts
- LLM call orchestration. Forks against Ollama are simple (use the meta-agent route's machinery). Forks against cloud APIs need credentials + cost-cap (default refuse cloud; require explicit
--allow-cloud + budget cap).
- Tool execution during fork. If the original session called
Bash, what does the fork do? v1 answer: don't actually execute tools — record the model's tool-call intent only. v2 could execute in a sandboxed dir. Document this constraint loudly.
- Context cost. Reconstructing 200K tokens of context per fork costs real money. Surface estimated cost before forking; refuse if over a threshold.
Out of scope
- Sandboxed tool execution during fork (defer).
- Cloud-API forks by default (opt-in only, with budget cap).
- "Fork tree" visualization (defer to v2).
Dependencies
- Blocked by: Spec 24 (context-window replay).
- Consumed by Spec 26 (comparative benchmark).
Estimated effort
Size XL — single agent, ~3-4 hr. Background-job orchestration + LLM-loop machinery is the bulk.
Hard rules
- DO NOT touch versions / CHANGELOG headings.
- Pre-assigned schema slot: v021.
- Branch:
feat/session-fork-mode off main.
- Default to LOCAL Ollama only; cloud requires explicit opt-in flag.
Goal
The "what-if" tree. Replay a session up to message K, swap the model (or system prompt, or tools), regenerate from there, persist the fork as a child session linked to the parent. Lets you ask "would Sonnet have done this in fewer turns?" empirically.
Why now
Comparative benchmark (Spec 26) needs this primitive. And separately, this is the killer "blackbox" feature — replay + fork is what makes a flight recorder useful.
Schema
v021 —
session_forks:Plus add
is_fork BOOLEANandparent_session_id TEXTtosessions(additive).User-visible surface
stackunderflow fork session <id> --at-message <seq> --swap-model <new> [--swap-system-prompt FILE] [--reason what-if].stackunderflow fork list <parent_session_id>.POST /api/playback/{id}/forkwith body{at_message_seq, swap_model?, swap_system_prompt?, reason}returns the new fork session id;GET /api/playback/{id}/forkslists children.fork_session(id, at_message_seq, swap_model).Implementation plan
stackunderflow/services/session_fork.py:fork(conn, parent_id, at_seq, *, swap_model, swap_system_prompt, swap_tools, reason)— uses Spec 24 (reconstruct_context_at) to rebuild the model context, calls Ollama (or queues for cloud-API depending on swap_model), persists assistant turns into a new session.Tests
Hard parts
--allow-cloud+ budget cap).Bash, what does the fork do? v1 answer: don't actually execute tools — record the model's tool-call intent only. v2 could execute in a sandboxed dir. Document this constraint loudly.Out of scope
Dependencies
Estimated effort
Size XL — single agent, ~3-4 hr. Background-job orchestration + LLM-loop machinery is the bulk.
Hard rules
feat/session-fork-modeoff main.