Rectify: review-design L1 Severity Calibration#602
Conversation
Replace misleading phrase 'regardless of triage output' with accurate wording: 'severity thresholds are calibrated per experiment_type via the rubric below'. H-weight is constant; thresholds are not. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 6 new tests (5 currently failing, 1 immediately passing) that encode the structural contracts required by the review-design L1 severity calibration repair: - test_l1_subagents_receive_experiment_type: Step 2 must explicitly list experiment_type as a subagent input (was absent, causing false STOP verdicts) - test_l1_severity_calibration_rubric_present: Step 2 must contain a calibration rubric table mapping experiment_type to severity per dimension - test_design_rejected_step_is_action_stop: regression guard for design_rejected action=stop with non-empty message referencing STOP verdict - test_kitchen_rule_6_acknowledges_stop_verdict_exception: rule #6's blanket 'do not stop' must carve out the STOP verdict exception - test_skill_contracts_pattern_examples_use_valid_experiment_types: pattern_examples must not use invalid experiment_type values (e.g. 'controlled') - test_review_design_experiment_type_output_has_allowed_values: experiment_type output must have an allowed_values constraint in skill_contracts.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three interconnected fixes for the false-STOP verdict regression on
benchmark/exploratory plans:
1. SKILL.md Step 2 — add explicit subagent inputs and calibration rubric
- L1 subagents now explicitly receive experiment_type from Step 1 triage
- Severity calibration rubric table maps experiment_type to severity per
L1 dimension (causal_inference → critical; benchmark/config/robust →
warning; exploratory → info)
- Both subagent descriptions updated with type-specific severity guidance
2. research.yaml kitchen rule #6 — scope 'do not stop' to exhaustion/failure
- Adds explicit STOP verdict exception clause so the blanket 'do not stop'
language no longer contradicts the design_rejected hard halt
3. skill_contracts.yaml review-design outputs
- experiment_type now has allowed_values constraint (benchmark,
configuration_study, causal_inference, robustness_audit, exploratory)
- pattern_examples: replaced invalid 'controlled' with 'benchmark'
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The rule fired for experiment_type after allowed_values was added to its contract output, demanding explicit on_result routing for all 5 experiment types. experiment_type is captured for context forwarding, not routing. Added a guard to skip the check for any output not referenced in any on_result when condition.
…t coverage Part B immunity tests: - B-1 (parameterized): all subagent-spawning steps (2–5) must declare explicit inputs using 'Receives:', 'Inputs:', or 'Each subagent receives:' pattern - B-2 (parameterized): Steps 4 and 5 must explicitly receive experiment_type - B-3: unrouted-verdict-value rule must not fire for review_design in research.yaml All 5 B-1/B-2 cases fail before SKILL.md implementation (Steps 3–5 lack canonical input declarations). B-3 passes immediately (all three verdict values are already explicitly routed in research.yaml). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the explicit context pattern from Step 2 (Part A) to all remaining subagent-spawning steps in review-design SKILL.md: - Step 3 (red-team): canonical 'Receives:' form with colon - Step 4 (L3): 'Each L3 subagent receives:' block with experiment_type and L1/L2 findings context; notes that exploratory plans do not require pre-registered correction procedures while causal_inference demands formal power analysis - Step 5 (L4): 'Each L4 subagent receives:' block with experiment_type and dimension_weights; notes that ecological_validity and reproducibility_spec rigor expectations scale with experiment_type Together with Part A, every subagent step now has a tested explicit input specification. Any future regression (removing experiment_type from any subagent context) immediately fails a parameterized contract test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
|
|
||
| def skill_text_between(start_heading: str, end_heading: str, text: str) -> str: | ||
| """Extract SKILL.md text between two headings (start inclusive, end exclusive).""" | ||
| pattern = re.escape(start_heading) + r".*?(?=" + re.escape(end_heading) + r")" |
There was a problem hiding this comment.
[warning] tests: skill_text_between uses a greedy dotall regex (.*?) with a lookahead for end_heading. If the end_heading appears multiple times or section order changes, the match silently covers the wrong range. Use a non-greedy match anchored to the next same-level heading boundary, or assert the matched range is non-empty.
There was a problem hiding this comment.
Investigated — this is intentional. The regex uses .*? (non-greedy), not greedy — the reviewer's 'greedy dotall' framing is factually incorrect. SKILL.md has exactly one occurrence of each '### Step N' heading, so the lookahead terminates unambiguously. The helper already asserts a non-None match (line 194) which fails loudly if the section is absent. No change warranted. (commit 2c9eb89)
| assert "critical" in step2_lower and "warning" in step2_lower, ( | ||
| "Step 2 must define what constitutes critical vs warning per experiment type." | ||
| ) | ||
|
|
There was a problem hiding this comment.
[warning] tests: The phrase list ["receives:", "inputs:", "each subagent receives", "each ... receives"] is fragile: any legitimate rephrasing not in this list causes a false failure. This assertion checks prose style rather than a structural contract. Consider asserting the presence of specific variable names (e.g. 'experiment_type') instead.
There was a problem hiding this comment.
Investigated — this is intentional. The phrase list is a canonical structural-form guard, not an arbitrary prose check. The test docstring explicitly calls this the 'established pattern (from Step 3 red-team)' — it enforces input-declaration form across all subagent steps. Switching to variable-name assertions (e.g. 'experiment_type') would conflate this with the separate test_l1_subagents_receive_experiment_type test. All four step ranges match the pattern and tests pass. No change warranted. (commit 8764aa4)
| The current blanket 'do not stop' language contradicts the design_rejected | ||
| hard halt. The rule must explicitly note that verdict=STOP is the sole exception. | ||
| """ | ||
| rule6 = recipe.kitchen_rules[5] # 0-indexed; rule #6 is index 5 |
There was a problem hiding this comment.
[warning] tests: Hard-coded kitchen_rules[5] index will silently audit the wrong rule if any rule is inserted before index 5. Locate the relevant rule by content substring match rather than positional index.
There was a problem hiding this comment.
Valid observation — flagged for design decision. The test deliberately uses a documented index (# 0-indexed; rule #6 is index 5) and the kitchen_rules list currently has 8 entries. The positional approach is a tradeoff: a rule inserted before index 5 could silently audit the wrong rule if it coincidentally contains 'stop' AND 'verdict'/'exception'. Switching to a content substring match (e.g. next(r for r in recipe.kitchen_rules if 'DESIGN REVIEW LOOP' in r)) would be more robust. Flagged for human decision on whether to harden. (commit 2c9eb89)
| # ── Section-scoped helpers ──────────────────────────────────────────────────── | ||
|
|
||
|
|
||
| def skill_text_between(start_heading: str, end_heading: str, text: str) -> str: |
There was a problem hiding this comment.
[warning] defense: skill_text_between returns m.group(0) which is the full lookahead match. If the section is empty the caller receives an empty string with no guard — callers like test_l1_subagents_receive_experiment_type then assert keyword presence in an empty string and fail with an uninformative message. Add a non-empty assertion in the helper or document the empty-string contract.
There was a problem hiding this comment.
Investigated — this is intentional. The concern is based on a misunderstanding of how lookahead assertions work: re.search with (?=end_heading) returns only the consumed portion (up to but not including the lookahead), never the lookahead text itself. The assert m on line 194 guards against a missing section. Against the actual SKILL.md, all four step ranges return non-empty strings. The 'uninformative message' failure mode does not exist in practice. No change warranted. (commit 2c9eb89)
| 2–4 subagents. Only spawn subagents for dimensions with weight ≥ L in the matrix. | ||
| SILENT (S) dimensions are NOT spawned and NOT mentioned in output. | ||
|
|
||
| Each L4 subagent receives: |
There was a problem hiding this comment.
[warning] cohesion: L4 input spec includes dimension_weights but L1 and L3 input specs do not declare whether they receive or ignore this field. If L4 uses dimension_weights to calibrate thresholds, L3 (also dimension-scoped) should declare its position on dimension_weights for a consistent cross-level input contract.
There was a problem hiding this comment.
Valid observation — flagged for design decision. L4 input spec explicitly passes dimension_weights (added in commit a5913f3) so H-weight dimensions apply stricter thresholds. L3 dimensions (error_budget, statistical_corrections, variance_protocol) are unconditionally spawned with no SILENT/weight-gating logic, so dimension_weights may carry less interpretive value there — but this asymmetry is undocumented. Flagged for deliberate decision: either add dimension_weights to L3 spec (with rationale) or document the intentional omission. (commit a5913f3)
| - `estimand_clarity` agent: "Can the claim be written as a formal contrast (A vs B on Y in Z)?" | ||
| Reference the exp-lens-estimand-clarity philosophical mode as guidance (do NOT invoke | ||
| the skill — reference its lens question only in the subagent prompt). | ||
| Use the calibration rubric above to assign severity. For `causal_inference`: absent formal |
There was a problem hiding this comment.
[warning] slop: Lines L164-L172 repeat the rubric table content (L156-L159) in prose inline with agent instructions. The table already encodes the same severity mappings; the prose expansions are redundant restatements that duplicate the table rather than adding information.
There was a problem hiding this comment.
Investigated — this is intentional. The rubric table (lines 156-159) encodes severity values only. Lines 164-172 embed agent-specific interpretive rationale that the table cannot encode: 'informal contrast sufficient' (estimand_clarity / benchmark), 'intentionally absent' (exploratory), 'pre-registration not required' (hypothesis_falsifiability / exploratory). This prose explains why the severity is what it is and provides operational grounding that prevents subagents from misapplying the table to borderline cases. It was added in commit 8cec605 specifically to fix a false-STOP verdict regression. Removing it would reintroduce the same ambiguity. No change warranted. (commit 8cec605)
Summary
The
review-designskill has a data flow break at Step 2 (Level 1 Analysis):experiment_typeproduced by Step 1 (Triage Dispatcher) is never forwarded to theestimand_clarityandhypothesis_falsifiabilitysubagents. These L1 subagents receive only a one-line lens question, with no experiment type context and no severity calibration guidance. They self-derive severity thresholds and default to the most restrictive standard — causal-inference pre-registration requirements — regardless of experiment type.The result: a
benchmarkplan with well-formed hypotheses but no formal estimand declaration triggers acriticalfinding, fires the L1 fail-fast gate, and receivesverdict = STOP. The pipeline routes todesign_rejected(hard halt), bypassing both therevise_designloop and theon_exhausted: create_worktreefallback.The architectural fix is a data lineage repair plus contract enforcement: Step 2 must explicitly specify
experiment_typeas a subagent input (mirroring the established red-team agent pattern at Step 3), and include a severity calibration rubric table mapping experiment type to severity floor per L1 dimension. Contract tests assert this specification exists, making any future regression immediately visible as a test failure. Three secondary issues are addressed alongside: kitchen rule #6's overbroad "do not stop" language, an invalidexperiment_type = controlledexample inskill_contracts.yaml, and the missingallowed_valuesconstraint on theexperiment_typeoutput declaration.Architecture Impact
Data Lineage Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 65, 'curve': 'basis'}}}%% flowchart LR %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; PLAN(["experiment_plan<br/>━━━━━━━━━━<br/>.md / .yaml"]) subgraph Step1 ["● Step 1: Triage (SKILL.md)"] TRIAGE["Triage Subagent<br/>━━━━━━━━━━<br/>classifies plan"] ET["experiment_type<br/>━━━━━━━━━━<br/>benchmark | causal_inference<br/>config_study | robustness_audit<br/>exploratory"] DW["dimension_weights<br/>━━━━━━━━━━<br/>H/M/L/SILENT per dim"] end subgraph Step2 ["● Step 2: L1 Analysis (SKILL.md — repaired)"] RUBRIC["● Severity Rubric<br/>━━━━━━━━━━<br/>causal_inference → critical<br/>benchmark → warning<br/>exploratory → info"] EC["estimand_clarity<br/>━━━━━━━━━━<br/>receives: plan + ET + rubric"] HF["hypothesis_falsifiability<br/>━━━━━━━━━━<br/>receives: plan + ET + rubric"] GATE{{"Fail-fast Gate<br/>any critical?"}} end subgraph Step3to6 ["Steps 3–6: Red-Team / L2 / L3 / L4 (SKILL.md)"] RT["red_team + L2/L3/L4<br/>━━━━━━━━━━<br/>receive: plan + experiment_type<br/>type-specific calibration"] end subgraph Step7 ["● Step 7: Synthesis (SKILL.md)"] SYNTH["stop_trigger check<br/>━━━━━━━━━━<br/>critical in {estimand_clarity,<br/>hypothesis_falsifiability, red_team}"] VERDICT["verdict<br/>━━━━━━━━━━<br/>GO | REVISE | STOP"] end subgraph Contracts ["● Contracts Layer"] SC["● skill_contracts.yaml<br/>━━━━━━━━━━<br/>verdict: allowed_values [GO,REVISE,STOP]<br/>experiment_type: allowed_values [5 types]"] RV["● rules_verdict.py<br/>━━━━━━━━━━<br/>unrouted-verdict-value rule<br/>skips context-forward captures"] end subgraph Routing ["● research.yaml on_result routing"] GO_R["GO<br/>━━━━━━━━━━<br/>→ create_worktree"] REV_R["REVISE<br/>━━━━━━━━━━<br/>→ revise_design loop"] STOP_R["STOP<br/>━━━━━━━━━━<br/>→ design_rejected (halt)"] end subgraph Artifacts ["Write-Only Artifacts"] DASH["evaluation_dashboard<br/>━━━━━━━━━━<br/>always written"] REVGUID["revision_guidance<br/>━━━━━━━━━━<br/>written on REVISE only"] end PLAN -->|"full plan text"| TRIAGE TRIAGE -->|"classifies"| ET TRIAGE -->|"weight matrix"| DW ET -->|"● now forwarded to L1"| RUBRIC ET -->|"● now forwarded to L1"| EC ET -->|"● now forwarded to L1"| HF RUBRIC -->|"calibrates severity"| EC RUBRIC -->|"calibrates severity"| HF EC -->|"calibrated findings"| GATE HF -->|"calibrated findings"| GATE DW -->|"weight controls spawning"| RT ET -->|"type-specific focus"| RT GATE -->|"no critical → proceed"| RT GATE -->|"critical → skip 3-6"| SYNTH RT -->|"findings"| SYNTH SYNTH -->|"adjudicates"| VERDICT VERDICT -->|"emitted token"| GO_R VERDICT -->|"emitted token"| REV_R VERDICT -->|"emitted token"| STOP_R VERDICT -.->|"write-only"| DASH REV_R -.->|"write-only on REVISE"| REVGUID SC -->|"validates"| VERDICT RV -->|"semantic rule"| SC class PLAN cli; class TRIAGE,EC,HF,RT,SYNTH handler; class ET,DW stateNode; class RUBRIC,SC,RV newComponent; class GATE phase; class VERDICT output; class GO_R,REV_R,STOP_R phase; class DASH,REVGUID output;Color Legend:
Process Flow Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 55, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; START([experiment_plan_path]) subgraph Triage ["Step 1: Triage Dispatcher"] direction TB TRIAGENODE["Triage Subagent<br/>━━━━━━━━━━<br/>classifies experiment type<br/>builds dimension weight matrix"] ET_VAL{"experiment_type valid?<br/>━━━━━━━━━━<br/>5-value enum check"} end subgraph L1 ["● Step 2: L1 Analysis — repaired data flow (SKILL.md)"] direction TB EC["estimand_clarity<br/>━━━━━━━━━━<br/>● receives: plan + experiment_type<br/>● + severity calibration rubric"] HF["hypothesis_falsifiability<br/>━━━━━━━━━━<br/>● receives: plan + experiment_type<br/>● + severity calibration rubric"] L1GATE{{"Fail-fast Gate<br/>━━━━━━━━━━<br/>any critical?"}} end subgraph Deep ["● Steps 3–6: Deep Analysis (SKILL.md)"] direction TB REDTEAM["● Red-team (L2)<br/>━━━━━━━━━━<br/>● explicit inputs: plan + experiment_type<br/>type-specific adversarial focus"] L3L4["● L3 / L4 Subagents<br/>━━━━━━━━━━<br/>● explicit inputs: plan + experiment_type<br/>dimension_weight-gated spawning"] end subgraph Synth ["Step 7: Synthesis"] direction TB STOPCHK{"stop_triggers?<br/>━━━━━━━━━━<br/>critical in {estimand_clarity,<br/>hypothesis_falsifiability, red_team}"} CRITCHK{"critical or warnings≥3?"} VERDICT["verdict<br/>━━━━━━━━━━<br/>GO | REVISE | STOP"] end subgraph Recipe ["● research.yaml on_result routing"] direction TB GO_BRANCH["GO<br/>━━━━━━━━━━<br/>→ create_worktree"] REVISE_BRANCH["REVISE<br/>━━━━━━━━━━<br/>→ revise_design → plan_experiment"] STOP_BRANCH["STOP<br/>━━━━━━━━━━<br/>→ design_rejected (action: stop)"] end HALT([HARD HALT]) PROCEED([PROCEED TO EXECUTION]) LOOP(["LOOP back (retries: 2 max = 3 cycles)"]) subgraph Validation ["● Contract Validation Layer"] direction LR SCON["● skill_contracts.yaml<br/>━━━━━━━━━━<br/>allowed_values enforcement<br/>verdict + experiment_type"] VRULE["● rules_verdict.py<br/>━━━━━━━━━━<br/>unrouted-verdict-value rule<br/>skips context-forward captures"] end %% MAIN FLOW %% START --> TRIAGENODE TRIAGENODE --> ET_VAL ET_VAL -->|"valid / default to exploratory"| EC ET_VAL -->|"valid / default to exploratory"| HF EC --> L1GATE HF --> L1GATE L1GATE -->|"no critical → proceed"| REDTEAM L1GATE -->|"critical → skip 3-6"| STOPCHK REDTEAM --> L3L4 L3L4 --> STOPCHK STOPCHK -->|"stop_triggers present"| VERDICT STOPCHK -->|"no stop_triggers"| CRITCHK CRITCHK -->|"critical or warnings≥3"| VERDICT CRITCHK -->|"clean"| VERDICT VERDICT -->|"GO"| GO_BRANCH VERDICT -->|"REVISE"| REVISE_BRANCH VERDICT -->|"STOP"| STOP_BRANCH GO_BRANCH --> PROCEED REVISE_BRANCH --> LOOP STOP_BRANCH --> HALT SCON -.->|"validates"| VERDICT VRULE -.->|"audits routing"| SCON %% CLASS ASSIGNMENTS %% class START terminal; class HALT,PROCEED,LOOP terminal; class TRIAGENODE,EC,HF,REDTEAM,L3L4 handler; class ET_VAL,L1GATE,STOPCHK,CRITCHK stateNode; class VERDICT output; class GO_BRANCH,REVISE_BRANCH,STOP_BRANCH phase; class SCON,VRULE newComponent;Color Legend:
Closes #599
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/remediation-20260404-092701-456719/.autoskillit/temp/rectify/rectify_review-design-l1-severity-calibration_2026-04-04_094500_part_a.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary