feat(sizing): complexity-scale-based derivation for all Claude sessions#101
Conversation
… CLI
Introduces lib/sizing.py as the single source of truth for converting a
task complexity scale (0.0–1.0) into (model, effort, max_turns,
max_budget_usd). Used by both the scheduler (at tick time) and the CLI
(for doctor sizing, `clauck size`, `clauck config doctor`, and
`clauck inspect` resolved-param display).
Why
Prior doctor sizing lived inside the classifier prompt and asked the
classifier to emit dollar values. Classifiers under load anchor on
literal dollars in the prompt rather than computing from rates — a
structural failure that produced repeat `Reached maximum budget ($0.25)`
truncations across two prior attempts at prompt-only fixes. Moving the
arithmetic out of the prompt eliminates that class of failure.
What this commit does
- lib/sizing.py: banded SCALE_PARAMS lookup (haiku/sonnet/opus tiers) +
context-growth-aware compute_sizing() + resolve_params() with
per-field override semantics + load/save of the `doctor` config block
(min/max/headroom/scale_skew/auto-skew knobs) + atomic config writes.
Rates calibrated against Anthropic 4.x API list prices with a
midpoint-average context-growth multiplier.
- lib/clauck: new doctor interpreter prompt emits
{interpretation, task_complexity_scale} only — no model, turns, or
dollars. Python derives those via sizing.compute_sizing. Doctor
prints the full sizing breakdown before stage-2 for transparency.
Adds auto-skew: on "Reached maximum budget" bumps config.scale_skew
by auto_skew_increment (capped at auto_skew_cap); on clean runs
decays by increment/2 (floored at 0) — self-balancing tuner.
Interpreter runs with --effort low for cheapest classification.
New subcommands: `clauck config doctor [key [val]]` (view/edit
doctor config) and `clauck size <scale> [--ctx N]` (inspect what
the formula derives). Updates semantic interpreter and job-creation
prompts to prefer `complexity:` in new job frontmatter.
cmd_validate recognizes `complexity`, validates [0.0, 1.0] range,
and warns when no sizing is declared. cmd_inspect shows resolved
params with per-field provenance (derived / override / default).
YAML subset parser now strips inline `#` comments respecting quotes.
- lib/scheduler.py: discover_jobs routes the three job-dict builds
(flat, module anchor, module stage) through sizing.resolve_params
so complexity-based frontmatter derives the four sizing fields at
tick time, and legacy jobs continue to work unchanged via the
per-field override path. YAML parser gets the same inline-comment
fix to stay consistent with the CLI parser.
- install.sh / uninstall.sh: place and remove lib/sizing.py alongside
the other runtime Python files in ~/.clauck/.
Backward compatibility
- Jobs without `complexity:` continue to work. resolve_params falls
back to explicit `max_turns` / `max_budget_usd` / `effort` / `model`
fields, then to LEGACY_DEFAULTS (50 turns, $2.00, high effort).
- Jobs with both `complexity:` and an explicit field use the explicit
value for THAT field only. Override granularity is per-field so a
job can pin max_turns while leaving everything else derived.
- The manifest schema is unchanged — only the values flowing through
it shift from "raw from frontmatter" to "resolved by sizing.py".
Replaces explicit max_turns/max_budget_usd/effort/model with a single
`complexity:` field in each marketplace job. The scheduler derives
all four sizing fields at fire time via lib/sizing.py.
Scale assignments:
- 0.05: dynamic-context-demo, module-demo (trivial/pedagogical)
- 0.10: event-monitor, git-commit-nudge (single-trigger check)
- 0.15: daily-verify, downloads-triage,
workspace-cleanup (focused single-source)
- 0.20: github-issue-watcher, github-pr-digest (multi-query digest)
- 0.25: inbox-zero-assist (scan + synthesis)
- 0.35: morning-brief (3-source synthesis)
module-demo retains explicit max_turns: 1 / max_budget_usd: 0.02 as
overrides — pedagogical minimums below min_budget_usd floor. This
demonstrates the override pattern as a teaching example.
All jobs pass through `clauck validate` with the new complexity
recognition, and `sizing.resolve_params` derives sensible defaults
for each scale. No behavioral change to firing logic; budgets shift
to the values the formula picks (generally comparable to or slightly
more generous than the hand-tuned values they replace).
- CLAUDE.md: new "Cost policy" section covering the scale formula, per-field override semantics, legacy compat, auto-skew, and the single-source-of-truth rule (lib/sizing.py). Frontmatter schema extended to document `complexity:`. - INTENT.md: §4 promoted from three architectural properties to four — "Cost transparency" added. Cost as first-class (non-negotiable #4) explained HOW via this property: single formula, single config surface, inspectable from CLI by humans and agents, self-correcting on truncation. §7 Cost policy row updated to describe the declare-and-derive pattern instead of raw "declare + log." - README.md: Cost section leads with the transparency pitch and points at `clauck size <scale>` + `clauck config doctor` for introspection. - skill/clauck/SKILL.md: new "Cost & sizing" reference section before the frontmatter schema. Table of scale anchors, config knobs, auto-skew semantics, legacy compat. Frontmatter schema entry extended to document `complexity:` with override semantics.
B1 — auto-skew false-positive detection
Replaced substring search on stdout+stderr ("Reached maximum budget") with
structured-field detection on the claude CLI envelope (subtype ==
"error_max_budget_usd" OR matching entries in envelope.errors). The old
detection would false-positive on any diagnostic report that mentioned
budget truncation in its output (very common once users start debugging
this feature). Also short-circuits the skew bump when the current run's
budget was already clamped by max_budget_usd — bumping skew cannot help
if the config ceiling is the binding constraint; the code now prints a
direct pointer to `clauck config doctor max_budget_usd <value>` instead.
B2 — max_budget_usd default disagreement + silent opus clamping
- DEFAULT_DOCTOR_CONFIG["max_budget_usd"] raised from $10 to $25 so the
high-end opus bands (scale 0.85–0.95) are not silently clamped by
default. Scale 1.00 still clamps at $25 as a genuine circuit breaker.
- compute_sizing now reads all fallbacks from DEFAULT_DOCTOR_CONFIG keys
instead of literal magic numbers at the call site (previously
max_budget_usd fallback was 5.00 in code but 10.00 in defaults — the
two paths now agree by construction).
- cmd_size prints a visible "⚠ budget clamped" warning when the raw
derivation exceeds max_budget_usd, pointing at the config command to
raise it. Satisfies INTENT.md §4 "set a budget the user cannot audit
violates this property" — clamps are now surfaced, not silent.
B3 — missing `effort: low` band removed from expressible space
Added a bottom band: (ceiling=0.07, haiku, low, 3 turns, $0.012/turn).
Re-scaled the marketplace jobs whose original `effort: low` was getting
upgraded to medium/high under the previous table: event-monitor,
git-commit-nudge → complexity 0.05 (low tier, preserves original
effort: low exactly); daily-verify, downloads-triage, workspace-cleanup
→ complexity 0.10 (medium tier, closer match than the prior 0.15).
Band ceilings throughout SCALE_PARAMS shifted slightly to accommodate
the new entry; scale-anchor tables updated in doctor interpreter prompt,
SKILL.md, and CLAUDE.md.
B4 — INTENT.md stale "three" + version not bumped
- §8 "When checking for drift" line updated: "three architectural
properties" → "four".
- Version header bumped v1 → v2 per §8 ("Each amendment bumps it").
B5 — semantic interpreter contract-drift acknowledgment
The natural-language semantic interpreter (clauck <anything>) still
emits exec_model/exec_effort/exec_max_turns/exec_max_budget_usd in its
routing JSON — a known gap vs the new §4 Cost transparency property.
Refactoring that path touches more consumers than doctor did; tracked
for follow-up rather than extended into this PR. INTENT.md §4 now
explicitly names the scope of v2 compliance (doctor + scheduler + CLI
surfaces + marketplace jobs, all fully compliant) and flags the
semantic path as a known gap that must be closed before v3.
_build_interpreter_prompt gets a docstring pointing at the same gap
so any future maintainer sees it locally, not only in INTENT.md.
N1 — scale_skew cross-contamination into scheduled jobs (high-value non-blocker)
scale_skew is a doctor-scoped tuner: it auto-bumps when doctor hits
budget, reflecting doctor task shapes. Previously every scheduled job's
sizing read the same skew, so a user whose doctor had been auto-bumped
to +0.20 would see every cron-fired job also run hotter despite never
having truncated. Scheduler now loads its sizing config via a new
_load_scheduler_sizing_config() that forces scale_skew=0 while keeping
the other knobs (min/max/headroom/growth). Also lifted the config load
out of _resolve_sizing into discover_jobs so the full tick shares one
disk read (~30k reads/day saved at ~20 jobs × 60s ticks).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f1b9296e23
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fm_for_sizing = _parse_fm_block(fm_match.group(1)) | ||
| except OSError: | ||
| pass | ||
| resolved = sizing.resolve_params(fm_for_sizing, 0, sizing.load_doctor_config()) |
There was a problem hiding this comment.
Align
inspect sizing with scheduler skew rules
cmd_inspect resolves complexity with sizing.load_doctor_config(), which includes the current doctor.scale_skew, but scheduled jobs are resolved with scale_skew forcibly set to 0.0 in scheduler._load_scheduler_sizing_config(). When doctor auto-skew has been bumped, clauck inspect <job> will show inflated model/turn/budget values that do not match what the scheduler will actually run, which undermines the new “resolved params” output and can lead users to make incorrect tuning decisions.
Useful? React with 👍 / 👎.
| findings.append(("error", f"complexity must be in [0.0, 1.0] (got {c})")) | ||
| else: | ||
| has_complexity = True | ||
| sized = sizing.compute_sizing(c, 0, sizing.load_doctor_config()) |
There was a problem hiding this comment.
Use scheduler-equivalent config in complexity validation
The complexity preview in _validate_job_file calls sizing.compute_sizing(..., sizing.load_doctor_config()), so validation output is skewed by doctor.scale_skew, while scheduler execution intentionally ignores that skew (scale_skew = 0.0). This means clauck validate can report a derived tier/budget that is different from runtime behavior for the same job, producing misleading validation results right where users expect authoritative sizing feedback.
Useful? React with 👍 / 👎.
…102) ## Summary Empirical hard rule: **haiku is not a viable choice when the user's full MCP surface is loaded.** Tool descriptions across a typical MCP surface total ~150k tokens — combined with system prompts and tool results, this regularly exceeds haiku's effective working context and sends sessions into a compaction loop that burns budget without progress. Follows directly from the conversation around PR #101: we added context-*cost* tracking (dollars per turn) but not context-*fit* guardrails (does everything load before compaction). This PR adds one minimal, explicit fit rule. A more sophisticated window-fit layer (MODEL_CONTEXT_WINDOW table + per-model thresholds + measure-once MCP estimate) is tracked as a follow-up if this rule proves insufficient in practice. ## What changed - \`lib/sizing.py\` — new \`_promote_for_mcp\` helper + \`strict_mcp\` kwarg on \`compute_sizing\`. When \`strict_mcp=False\` (default), any haiku derivation bumps to the nearest sonnet band matching current effort (low→medium, medium→medium, high→high). Turns stay unchanged — we need a bigger **model**, not more steps. The promotion surfaces in the returned dict (\`mcp_promoted: bool\`) and in the \`explanation\` string. - \`resolve_params\` — reads \`strict_mcp_config\` from frontmatter, passes through to \`compute_sizing\`. - \`cmd_doctor\` — passes \`strict_mcp=False\` explicitly. Doctor's stage-2 does not pass \`--strict-mcp-config\` because the diagnostic agent needs MCP for \`--fix\` actions (post Slack escalations, query GitHub, etc.). So any haiku-tier scale auto-promotes. Eliminates the compaction-loop budget-exceeded failure mode that sparked PR #101 follow-up discussion. - \`cmd_size\` — new \`--strict-mcp\` flag for preview, plus a \`strict_mcp:\` line and \`↑ haiku auto-promoted\` note in output. - CLAUDE.md + SKILL.md — docs for the rule and how to opt out. ## Sanity check | Frontmatter | Scale | Before this PR | After this PR | |---|---|---|---| | \`complexity: 0.20\` (MCP loaded, default) | 0.20 | haiku/high/8t/$0.22 | sonnet/high/8t/$0.77 | | \`complexity: 0.20, strict_mcp_config: true\` | 0.20 | haiku/high/8t/$0.22 | haiku/high/8t/$0.22 (unchanged) | | Doctor \"is plist loaded\" | 0.05 | haiku/low/3t/$0.05 | sonnet/medium/3t/$0.22 | | \`complexity: 0.55\` (MCP loaded) | 0.55 | sonnet/medium/18t/$1.46 | sonnet/medium/18t/$1.46 (no promotion needed) | ## Test plan - [x] \`ast.parse\` clean for all modules - [x] ruff check passes (E9,F selection) - [x] Smoke: resolve_params on fm with / without \`strict_mcp_config\` derives expected model - [x] Smoke: compute_sizing with strict_mcp=False promotes haiku bands; with True keeps them - [ ] After merge + reinstall: \`clauck size 0.20\` defaults to sonnet derivation; \`--strict-mcp\` shows haiku - [ ] Doctor invocation that previously truncated on haiku now runs on sonnet and completes
Summary
Makes cost a first-class transparent policy per `INTENT.md §3` non-negotiable #4 and §4 architectural property "Cost transparency". Introduces a single shared formula (`lib/sizing.py`) that maps a declared complexity scale (0.0–1.0) to `(model, effort, max_turns, max_budget_usd)`. Used by:
What changed
New module: `lib/sizing.py`
Doctor changes
```
→ Check delegate-scout job scheduling and Slack DM history [--dry --safe]
→ scale=0.35 → haiku/high, 14 turns, base $0.39 × 1.30 headroom = $0.50
```
New CLI surfaces
Scheduler
YAML parser (both scheduler and CLI)
Documentation
Backward compatibility
Known-failure regression checks
Test plan