Skip to content

Spec 18: mode recommender (heuristic v1) — 'this task fits a Sonnet, you used Opus' #88

@0bserver07

Description

@0bserver07

Goal

Recommend the cheapest model that fits the task, based on the user's own past sessions. v1 is a heuristic — pattern-match the prompt + tool-call shape against historical sessions, suggest the model that solved similar tasks at lowest cost.

Why now

Every team is overpaying for the wrong model on the wrong task today and nobody can prove it. A heuristic v1 is shippable now without the full benchmark engine (Spec 26 — that's the v2).

Schema

v016 — mode_recommendations table:

  • id INTEGER PRIMARY KEY
  • task_pattern_hash TEXT (md5 of normalized prompt features)
  • recommended_model TEXT
  • confidence REAL ([0, 1])
  • evidence_session_ids TEXT (JSON array)
  • created_ts TEXT
  • last_used_ts TEXT
  • INDEX (task_pattern_hash)

Additive, IF NOT EXISTS-guarded.

User-visible surface

  • CLI: stackunderflow recommend mode --prompt "<text>" [--current-model X] returns recommendation + cost-delta.
  • MCP tool: recommend_mode(prompt, current_model?) returns same.
  • Meta-agent tool: same.
  • UI: optional — defer to v2.

Implementation plan

  1. New service stackunderflow/services/mode_recommender.py:
    • _extract_features(prompt) — pull intent (build/fix/refactor/test/explore), language hints, file-mention count, code-block count.
    • _hash_features(features) -> str — stable hash for caching.
    • _find_similar_past_sessions(conn, features, limit=20) — substring + intent match.
    • recommend(conn, prompt, current_model=None) -> dict.
  2. Cache results in mode_recommendations (24h TTL).
  3. CLI + MCP + meta-agent wiring.

Tests

  • Feature-extraction snapshots (prompt → feature dict).
  • Recommendation shape on a seeded store with 5+ historical sessions.
  • Cache hit + miss + TTL eviction.
  • Empty-store returns confidence=0.0 with a clean message ("no historical data").

Hard parts

  • Defining "similar task" is squishy. Start simple: same intent + token-count band + language overlap. Document the heuristic; mark it v1 explicitly so users don't expect ML.
  • Cost-delta math needs to use post-v0.8.0 usage_events.cost_usd (not the legacy compute_cost path).

Out of scope

  • LLM-based feature extraction (defer to v2).
  • Per-team / per-project recommenders (defer).
  • The full comparative benchmark (Spec 26).

Dependencies

  • None blocking. Builds on v0.8.0 cost-fix.

Estimated effort

Size M — single agent, ~1 hr.

Hard rules

  • DO NOT touch versions / CHANGELOG headings.
  • Pre-assigned schema slot: v016.
  • Branch: feat/mode-recommender-v1 off main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    size-m~1 hr agent runspecSpec/feature for an agent to implementwave-1Wave 1: independent foundations

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions