Skip to content

feat(sizing): complexity-scale-based derivation for all Claude sessions#101

Merged
CoreyRDean merged 6 commits into
mainfrom
feat/complexity-scale-sizing
Apr 18, 2026
Merged

feat(sizing): complexity-scale-based derivation for all Claude sessions#101
CoreyRDean merged 6 commits into
mainfrom
feat/complexity-scale-sizing

Conversation

@CoreyRDean
Copy link
Copy Markdown
Owner

Summary

Makes cost a first-class transparent policy per `INTENT.md §3` non-negotiable #4 and §4 architectural property "Cost transparency". Introduces a single shared formula (`lib/sizing.py`) that maps a declared complexity scale (0.0–1.0) to `(model, effort, max_turns, max_budget_usd)`. Used by:

  • `clauck doctor` — classifier emits only `task_complexity_scale`; Python does the dollar arithmetic. Kills the ``"Reached maximum budget ($0.25)"`` failure mode structurally (two prior prompt-only fixes could not converge — classifiers anchor on literal dollars in prompts under load).
  • Scheduled jobs — `scheduler.py` resolves via the same helper at tick time. Jobs declare `complexity: ` in frontmatter; the scheduler derives the four sizing fields. Legacy jobs with explicit params continue to work unchanged.
  • Marketplace jobs — all 11 converted to `complexity:` declarations. `module-demo` retains explicit overrides for pedagogical minimums (teaches the override pattern).
  • Semantic interpreter (`clauck `) — job-creation guidance updated to emit `complexity:` in new jobs.

What changed

New module: `lib/sizing.py`

  • `SCALE_PARAMS` — banded lookup (0.10: haiku/medium → 1.0: opus/high), rates calibrated against Anthropic 4.x API list prices with a midpoint-average context-growth multiplier.
  • `compute_sizing(scale, ctx_tokens, config)` — derives all four fields, clamps budget to config min/max, explains the arithmetic in a human-readable string.
  • `resolve_params(frontmatter, ctx_tokens, config)` — handles the frontmatter contract: complexity → derived, explicit field → override (per-field granularity), neither → legacy defaults.
  • `load_doctor_config` / `save_doctor_config` with atomic writes.
  • `apply_auto_skew_on_budget_hit` / `apply_auto_skew_decay` — self-balancing scale offset.

Doctor changes

  • New classifier prompt (`_build_doctor_interpreter_prompt`) — emits `{interpretation, task_complexity_scale}` only. Scale anchors at 0.05, 0.20, 0.35, 0.55, 0.75, 0.95 with concrete task shapes. No literal dollar values in the prompt anywhere.
  • Stage-2 now uses derived params + prints the sizing breakdown before running:
    ```
    → Check delegate-scout job scheduling and Slack DM history [--dry --safe]
    → scale=0.35 → haiku/high, 14 turns, base $0.39 × 1.30 headroom = $0.50
    ```
  • Auto-skew on budget hit (`scale_skew += auto_skew_increment`, capped), decay on success (`scale_skew -= increment/2`, floored). Runs silently; notifies stderr on adjustment.
  • Interpreter runs with `--effort low` (new).

New CLI surfaces

  • `clauck config doctor [key [value]]` — view or set any doctor sizing knob.
  • `clauck size [--ctx N] [--model M]` — inspect what the formula derives without firing a session. Shows breakdown, config context, and the frontmatter snippet to pin.
  • `clauck inspect ` now shows resolved params with per-field provenance (`derived` / `override` / `default`).
  • `clauck validate` accepts `complexity` (float in [0.0, 1.0]), warns on total absence of sizing declaration.

Scheduler

  • Flat, module-anchor, and module-stage job builds all route through `sizing.resolve_params` — zero drift between the three construction sites.
  • Same YAML parser fix as the CLI (inline `#` comment stripping respecting quotes).

YAML parser (both scheduler and CLI)

  • New `_strip_inline_comment` helper that respects single/double quotes. Needed to parse `complexity: 0.15 # chain-test`-style lines, a natural convention for authors who want to annotate their scale choice inline.

Documentation

  • CLAUDE.md: new Cost policy section; frontmatter schema extended.
  • INTENT.md: §4 now lists four architectural properties (adds Cost transparency); §7 Cost row updated.
  • README.md: Cost section reframed around transparency.
  • skill/clauck/SKILL.md: new Cost & sizing reference section with scale anchors, knob table, auto-skew semantics.

Backward compatibility

  • Jobs without `complexity:` continue to work exactly as before (explicit fields → override path; no fields → legacy defaults).
  • Manifest schema unchanged.
  • No existing job requires edits.
  • `clauck config` still accepts the dotted-key form (`clauck config set doctor.scale_skew 0.1`) — the new `config doctor` form is additive.

Known-failure regression checks

Invocation Pre-fix Post-fix derivation
`"is my delegate-scout job operating as expected?"` haiku $0.25, budget-exceeded scale≈0.20 → haiku/high, 8 turns, $0.22 — within budget
`"intent-miner silent 4h … delegate-scout … 5-min cron"` haiku $0.25, truncated; then sonnet $0.25, truncated scale≈0.55 → sonnet/high, 25 turns, $2.71
`"is the plist loaded"` (trivial) haiku small, fine scale≈0.05 → haiku/medium, 4 turns, $0.08 (no regression)

Test plan

  • `bash -n install.sh && bash -n uninstall.sh` clean
  • `ast.parse` clean for `lib/{scheduler.py,dag-runner.py,clauck,sizing.py}`
  • `json.load(marketplace/index.json)` clean
  • `bash install.sh --dry-run --yes` completes
  • Smoke: `scheduler.discover_jobs()` with current `~/.clauck/` returns sensible sizing for every installed job (21 jobs); none regressed
  • Marketplace sizing sanity check: each of 11 jobs derives expected tier
  • After merge + reinstall: rerun the originally failing invocations and confirm no `Reached maximum budget` truncation
  • Sanity check: `clauck size 0.35` prints breakdown; `clauck config doctor` shows defaults; `clauck inspect morning-brief` shows `[derived]` provenance
  • Sanity check: trivial `clauck doctor "is the plist loaded"` still routes cheap

… CLI

Introduces lib/sizing.py as the single source of truth for converting a
task complexity scale (0.0–1.0) into (model, effort, max_turns,
max_budget_usd). Used by both the scheduler (at tick time) and the CLI
(for doctor sizing, `clauck size`, `clauck config doctor`, and
`clauck inspect` resolved-param display).

Why

Prior doctor sizing lived inside the classifier prompt and asked the
classifier to emit dollar values. Classifiers under load anchor on
literal dollars in the prompt rather than computing from rates — a
structural failure that produced repeat `Reached maximum budget ($0.25)`
truncations across two prior attempts at prompt-only fixes. Moving the
arithmetic out of the prompt eliminates that class of failure.

What this commit does

- lib/sizing.py: banded SCALE_PARAMS lookup (haiku/sonnet/opus tiers) +
  context-growth-aware compute_sizing() + resolve_params() with
  per-field override semantics + load/save of the `doctor` config block
  (min/max/headroom/scale_skew/auto-skew knobs) + atomic config writes.
  Rates calibrated against Anthropic 4.x API list prices with a
  midpoint-average context-growth multiplier.

- lib/clauck: new doctor interpreter prompt emits
  {interpretation, task_complexity_scale} only — no model, turns, or
  dollars. Python derives those via sizing.compute_sizing. Doctor
  prints the full sizing breakdown before stage-2 for transparency.
  Adds auto-skew: on "Reached maximum budget" bumps config.scale_skew
  by auto_skew_increment (capped at auto_skew_cap); on clean runs
  decays by increment/2 (floored at 0) — self-balancing tuner.
  Interpreter runs with --effort low for cheapest classification.
  New subcommands: `clauck config doctor [key [val]]` (view/edit
  doctor config) and `clauck size <scale> [--ctx N]` (inspect what
  the formula derives). Updates semantic interpreter and job-creation
  prompts to prefer `complexity:` in new job frontmatter.
  cmd_validate recognizes `complexity`, validates [0.0, 1.0] range,
  and warns when no sizing is declared. cmd_inspect shows resolved
  params with per-field provenance (derived / override / default).
  YAML subset parser now strips inline `#` comments respecting quotes.

- lib/scheduler.py: discover_jobs routes the three job-dict builds
  (flat, module anchor, module stage) through sizing.resolve_params
  so complexity-based frontmatter derives the four sizing fields at
  tick time, and legacy jobs continue to work unchanged via the
  per-field override path. YAML parser gets the same inline-comment
  fix to stay consistent with the CLI parser.

- install.sh / uninstall.sh: place and remove lib/sizing.py alongside
  the other runtime Python files in ~/.clauck/.

Backward compatibility

- Jobs without `complexity:` continue to work. resolve_params falls
  back to explicit `max_turns` / `max_budget_usd` / `effort` / `model`
  fields, then to LEGACY_DEFAULTS (50 turns, $2.00, high effort).
- Jobs with both `complexity:` and an explicit field use the explicit
  value for THAT field only. Override granularity is per-field so a
  job can pin max_turns while leaving everything else derived.
- The manifest schema is unchanged — only the values flowing through
  it shift from "raw from frontmatter" to "resolved by sizing.py".
Replaces explicit max_turns/max_budget_usd/effort/model with a single
`complexity:` field in each marketplace job. The scheduler derives
all four sizing fields at fire time via lib/sizing.py.

Scale assignments:
- 0.05: dynamic-context-demo, module-demo       (trivial/pedagogical)
- 0.10: event-monitor, git-commit-nudge         (single-trigger check)
- 0.15: daily-verify, downloads-triage,
        workspace-cleanup                       (focused single-source)
- 0.20: github-issue-watcher, github-pr-digest  (multi-query digest)
- 0.25: inbox-zero-assist                       (scan + synthesis)
- 0.35: morning-brief                           (3-source synthesis)

module-demo retains explicit max_turns: 1 / max_budget_usd: 0.02 as
overrides — pedagogical minimums below min_budget_usd floor. This
demonstrates the override pattern as a teaching example.

All jobs pass through `clauck validate` with the new complexity
recognition, and `sizing.resolve_params` derives sensible defaults
for each scale. No behavioral change to firing logic; budgets shift
to the values the formula picks (generally comparable to or slightly
more generous than the hand-tuned values they replace).
- CLAUDE.md: new "Cost policy" section covering the scale formula,
  per-field override semantics, legacy compat, auto-skew, and the
  single-source-of-truth rule (lib/sizing.py). Frontmatter schema
  extended to document `complexity:`.

- INTENT.md: §4 promoted from three architectural properties to four
  — "Cost transparency" added. Cost as first-class (non-negotiable
  #4) explained HOW via this property: single formula, single config
  surface, inspectable from CLI by humans and agents, self-correcting
  on truncation. §7 Cost policy row updated to describe the
  declare-and-derive pattern instead of raw "declare + log."

- README.md: Cost section leads with the transparency pitch and
  points at `clauck size <scale>` + `clauck config doctor` for
  introspection.

- skill/clauck/SKILL.md: new "Cost & sizing" reference section before
  the frontmatter schema. Table of scale anchors, config knobs,
  auto-skew semantics, legacy compat. Frontmatter schema entry
  extended to document `complexity:` with override semantics.
@CoreyRDean CoreyRDean added the enhancement New feature or request label Apr 18, 2026
B1 — auto-skew false-positive detection
  Replaced substring search on stdout+stderr ("Reached maximum budget") with
  structured-field detection on the claude CLI envelope (subtype ==
  "error_max_budget_usd" OR matching entries in envelope.errors). The old
  detection would false-positive on any diagnostic report that mentioned
  budget truncation in its output (very common once users start debugging
  this feature). Also short-circuits the skew bump when the current run's
  budget was already clamped by max_budget_usd — bumping skew cannot help
  if the config ceiling is the binding constraint; the code now prints a
  direct pointer to `clauck config doctor max_budget_usd <value>` instead.

B2 — max_budget_usd default disagreement + silent opus clamping
  - DEFAULT_DOCTOR_CONFIG["max_budget_usd"] raised from $10 to $25 so the
    high-end opus bands (scale 0.85–0.95) are not silently clamped by
    default. Scale 1.00 still clamps at $25 as a genuine circuit breaker.
  - compute_sizing now reads all fallbacks from DEFAULT_DOCTOR_CONFIG keys
    instead of literal magic numbers at the call site (previously
    max_budget_usd fallback was 5.00 in code but 10.00 in defaults — the
    two paths now agree by construction).
  - cmd_size prints a visible "⚠ budget clamped" warning when the raw
    derivation exceeds max_budget_usd, pointing at the config command to
    raise it. Satisfies INTENT.md §4 "set a budget the user cannot audit
    violates this property" — clamps are now surfaced, not silent.

B3 — missing `effort: low` band removed from expressible space
  Added a bottom band: (ceiling=0.07, haiku, low, 3 turns, $0.012/turn).
  Re-scaled the marketplace jobs whose original `effort: low` was getting
  upgraded to medium/high under the previous table: event-monitor,
  git-commit-nudge → complexity 0.05 (low tier, preserves original
  effort: low exactly); daily-verify, downloads-triage, workspace-cleanup
  → complexity 0.10 (medium tier, closer match than the prior 0.15).
  Band ceilings throughout SCALE_PARAMS shifted slightly to accommodate
  the new entry; scale-anchor tables updated in doctor interpreter prompt,
  SKILL.md, and CLAUDE.md.

B4 — INTENT.md stale "three" + version not bumped
  - §8 "When checking for drift" line updated: "three architectural
    properties" → "four".
  - Version header bumped v1 → v2 per §8 ("Each amendment bumps it").

B5 — semantic interpreter contract-drift acknowledgment
  The natural-language semantic interpreter (clauck <anything>) still
  emits exec_model/exec_effort/exec_max_turns/exec_max_budget_usd in its
  routing JSON — a known gap vs the new §4 Cost transparency property.
  Refactoring that path touches more consumers than doctor did; tracked
  for follow-up rather than extended into this PR. INTENT.md §4 now
  explicitly names the scope of v2 compliance (doctor + scheduler + CLI
  surfaces + marketplace jobs, all fully compliant) and flags the
  semantic path as a known gap that must be closed before v3.
  _build_interpreter_prompt gets a docstring pointing at the same gap
  so any future maintainer sees it locally, not only in INTENT.md.

N1 — scale_skew cross-contamination into scheduled jobs (high-value non-blocker)
  scale_skew is a doctor-scoped tuner: it auto-bumps when doctor hits
  budget, reflecting doctor task shapes. Previously every scheduled job's
  sizing read the same skew, so a user whose doctor had been auto-bumped
  to +0.20 would see every cron-fired job also run hotter despite never
  having truncated. Scheduler now loads its sizing config via a new
  _load_scheduler_sizing_config() that forces scale_skew=0 while keeping
  the other knobs (min/max/headroom/growth). Also lifted the config load
  out of _resolve_sizing into discover_jobs so the full tick shares one
  disk read (~30k reads/day saved at ~20 jobs × 60s ticks).
@CoreyRDean CoreyRDean marked this pull request as ready for review April 18, 2026 13:31
@CoreyRDean CoreyRDean merged commit c5d9ac0 into main Apr 18, 2026
2 checks passed
@CoreyRDean CoreyRDean deleted the feat/complexity-scale-sizing branch April 18, 2026 13:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f1b9296e23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/clauck
fm_for_sizing = _parse_fm_block(fm_match.group(1))
except OSError:
pass
resolved = sizing.resolve_params(fm_for_sizing, 0, sizing.load_doctor_config())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align inspect sizing with scheduler skew rules

cmd_inspect resolves complexity with sizing.load_doctor_config(), which includes the current doctor.scale_skew, but scheduled jobs are resolved with scale_skew forcibly set to 0.0 in scheduler._load_scheduler_sizing_config(). When doctor auto-skew has been bumped, clauck inspect <job> will show inflated model/turn/budget values that do not match what the scheduler will actually run, which undermines the new “resolved params” output and can lead users to make incorrect tuning decisions.

Useful? React with 👍 / 👎.

Comment thread lib/clauck
findings.append(("error", f"complexity must be in [0.0, 1.0] (got {c})"))
else:
has_complexity = True
sized = sizing.compute_sizing(c, 0, sizing.load_doctor_config())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use scheduler-equivalent config in complexity validation

The complexity preview in _validate_job_file calls sizing.compute_sizing(..., sizing.load_doctor_config()), so validation output is skewed by doctor.scale_skew, while scheduler execution intentionally ignores that skew (scale_skew = 0.0). This means clauck validate can report a derived tier/budget that is different from runtime behavior for the same job, producing misleading validation results right where users expect authoritative sizing feedback.

Useful? React with 👍 / 👎.

CoreyRDean added a commit that referenced this pull request Apr 18, 2026
…102)

## Summary

Empirical hard rule: **haiku is not a viable choice when the user's full
MCP surface is loaded.** Tool descriptions across a typical MCP surface
total ~150k tokens — combined with system prompts and tool results, this
regularly exceeds haiku's effective working context and sends sessions
into a compaction loop that burns budget without progress.

Follows directly from the conversation around PR #101: we added
context-*cost* tracking (dollars per turn) but not context-*fit*
guardrails (does everything load before compaction). This PR adds one
minimal, explicit fit rule. A more sophisticated window-fit layer
(MODEL_CONTEXT_WINDOW table + per-model thresholds + measure-once MCP
estimate) is tracked as a follow-up if this rule proves insufficient in
practice.

## What changed

- \`lib/sizing.py\` — new \`_promote_for_mcp\` helper + \`strict_mcp\`
kwarg on \`compute_sizing\`. When \`strict_mcp=False\` (default), any
haiku derivation bumps to the nearest sonnet band matching current
effort (low→medium, medium→medium, high→high). Turns stay unchanged — we
need a bigger **model**, not more steps. The promotion surfaces in the
returned dict (\`mcp_promoted: bool\`) and in the \`explanation\`
string.
- \`resolve_params\` — reads \`strict_mcp_config\` from frontmatter,
passes through to \`compute_sizing\`.
- \`cmd_doctor\` — passes \`strict_mcp=False\` explicitly. Doctor's
stage-2 does not pass \`--strict-mcp-config\` because the diagnostic
agent needs MCP for \`--fix\` actions (post Slack escalations, query
GitHub, etc.). So any haiku-tier scale auto-promotes. Eliminates the
compaction-loop budget-exceeded failure mode that sparked PR #101
follow-up discussion.
- \`cmd_size\` — new \`--strict-mcp\` flag for preview, plus a
\`strict_mcp:\` line and \`↑ haiku auto-promoted\` note in output.
- CLAUDE.md + SKILL.md — docs for the rule and how to opt out.

## Sanity check

| Frontmatter | Scale | Before this PR | After this PR |
|---|---|---|---|
| \`complexity: 0.20\` (MCP loaded, default) | 0.20 |
haiku/high/8t/$0.22 | sonnet/high/8t/$0.77 |
| \`complexity: 0.20, strict_mcp_config: true\` | 0.20 |
haiku/high/8t/$0.22 | haiku/high/8t/$0.22 (unchanged) |
| Doctor \"is plist loaded\" | 0.05 | haiku/low/3t/$0.05 |
sonnet/medium/3t/$0.22 |
| \`complexity: 0.55\` (MCP loaded) | 0.55 | sonnet/medium/18t/$1.46 |
sonnet/medium/18t/$1.46 (no promotion needed) |

## Test plan

- [x] \`ast.parse\` clean for all modules
- [x] ruff check passes (E9,F selection)
- [x] Smoke: resolve_params on fm with / without \`strict_mcp_config\`
derives expected model
- [x] Smoke: compute_sizing with strict_mcp=False promotes haiku bands;
with True keeps them
- [ ] After merge + reinstall: \`clauck size 0.20\` defaults to sonnet
derivation; \`--strict-mcp\` shows haiku
- [ ] Doctor invocation that previously truncated on haiku now runs on
sonnet and completes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant