Goal
Recommend the cheapest model that fits the task, based on the user's own past sessions. v1 is a heuristic — pattern-match the prompt + tool-call shape against historical sessions, suggest the model that solved similar tasks at lowest cost.
Why now
Every team is overpaying for the wrong model on the wrong task today and nobody can prove it. A heuristic v1 is shippable now without the full benchmark engine (Spec 26 — that's the v2).
Schema
v016 — mode_recommendations table:
id INTEGER PRIMARY KEY
task_pattern_hash TEXT (md5 of normalized prompt features)
recommended_model TEXT
confidence REAL ([0, 1])
evidence_session_ids TEXT (JSON array)
created_ts TEXT
last_used_ts TEXT
INDEX (task_pattern_hash)
Additive, IF NOT EXISTS-guarded.
User-visible surface
- CLI:
stackunderflow recommend mode --prompt "<text>" [--current-model X] returns recommendation + cost-delta.
- MCP tool:
recommend_mode(prompt, current_model?) returns same.
- Meta-agent tool: same.
- UI: optional — defer to v2.
Implementation plan
- New service
stackunderflow/services/mode_recommender.py:
_extract_features(prompt) — pull intent (build/fix/refactor/test/explore), language hints, file-mention count, code-block count.
_hash_features(features) -> str — stable hash for caching.
_find_similar_past_sessions(conn, features, limit=20) — substring + intent match.
recommend(conn, prompt, current_model=None) -> dict.
- Cache results in
mode_recommendations (24h TTL).
- CLI + MCP + meta-agent wiring.
Tests
- Feature-extraction snapshots (prompt → feature dict).
- Recommendation shape on a seeded store with 5+ historical sessions.
- Cache hit + miss + TTL eviction.
- Empty-store returns
confidence=0.0 with a clean message ("no historical data").
Hard parts
- Defining "similar task" is squishy. Start simple: same intent + token-count band + language overlap. Document the heuristic; mark it v1 explicitly so users don't expect ML.
- Cost-delta math needs to use post-v0.8.0
usage_events.cost_usd (not the legacy compute_cost path).
Out of scope
- LLM-based feature extraction (defer to v2).
- Per-team / per-project recommenders (defer).
- The full comparative benchmark (Spec 26).
Dependencies
- None blocking. Builds on v0.8.0 cost-fix.
Estimated effort
Size M — single agent, ~1 hr.
Hard rules
- DO NOT touch versions / CHANGELOG headings.
- Pre-assigned schema slot: v016.
- Branch:
feat/mode-recommender-v1 off main.
Goal
Recommend the cheapest model that fits the task, based on the user's own past sessions. v1 is a heuristic — pattern-match the prompt + tool-call shape against historical sessions, suggest the model that solved similar tasks at lowest cost.
Why now
Every team is overpaying for the wrong model on the wrong task today and nobody can prove it. A heuristic v1 is shippable now without the full benchmark engine (Spec 26 — that's the v2).
Schema
v016 —
mode_recommendationstable:id INTEGER PRIMARY KEYtask_pattern_hash TEXT(md5 of normalized prompt features)recommended_model TEXTconfidence REAL([0, 1])evidence_session_ids TEXT(JSON array)created_ts TEXTlast_used_ts TEXTINDEX (task_pattern_hash)Additive,
IF NOT EXISTS-guarded.User-visible surface
stackunderflow recommend mode --prompt "<text>" [--current-model X]returns recommendation + cost-delta.recommend_mode(prompt, current_model?)returns same.Implementation plan
stackunderflow/services/mode_recommender.py:_extract_features(prompt)— pull intent (build/fix/refactor/test/explore), language hints, file-mention count, code-block count._hash_features(features) -> str— stable hash for caching._find_similar_past_sessions(conn, features, limit=20)— substring + intent match.recommend(conn, prompt, current_model=None) -> dict.mode_recommendations(24h TTL).Tests
confidence=0.0with a clean message ("no historical data").Hard parts
usage_events.cost_usd(not the legacy compute_cost path).Out of scope
Dependencies
Estimated effort
Size M — single agent, ~1 hr.
Hard rules
feat/mode-recommender-v1off main.