Skip to content

Rectify: review-design L1 Severity Calibration#602

Merged
Trecek merged 6 commits intointegrationfrom
review-design-l1-subagents-lack-experiment-type-severity-cal/599
Apr 4, 2026
Merged

Rectify: review-design L1 Severity Calibration#602
Trecek merged 6 commits intointegrationfrom
review-design-l1-subagents-lack-experiment-type-severity-cal/599

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented Apr 4, 2026

Summary

The review-design skill has a data flow break at Step 2 (Level 1 Analysis): experiment_type produced by Step 1 (Triage Dispatcher) is never forwarded to the estimand_clarity and hypothesis_falsifiability subagents. These L1 subagents receive only a one-line lens question, with no experiment type context and no severity calibration guidance. They self-derive severity thresholds and default to the most restrictive standard — causal-inference pre-registration requirements — regardless of experiment type.

The result: a benchmark plan with well-formed hypotheses but no formal estimand declaration triggers a critical finding, fires the L1 fail-fast gate, and receives verdict = STOP. The pipeline routes to design_rejected (hard halt), bypassing both the revise_design loop and the on_exhausted: create_worktree fallback.

The architectural fix is a data lineage repair plus contract enforcement: Step 2 must explicitly specify experiment_type as a subagent input (mirroring the established red-team agent pattern at Step 3), and include a severity calibration rubric table mapping experiment type to severity floor per L1 dimension. Contract tests assert this specification exists, making any future regression immediately visible as a test failure. Three secondary issues are addressed alongside: kitchen rule #6's overbroad "do not stop" language, an invalid experiment_type = controlled example in skill_contracts.yaml, and the missing allowed_values constraint on the experiment_type output declaration.

Architecture Impact

Data Lineage Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 65, 'curve': 'basis'}}}%%
flowchart LR
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;

    PLAN(["experiment_plan<br/>━━━━━━━━━━<br/>.md / .yaml"])

    subgraph Step1 ["● Step 1: Triage (SKILL.md)"]
        TRIAGE["Triage Subagent<br/>━━━━━━━━━━<br/>classifies plan"]
        ET["experiment_type<br/>━━━━━━━━━━<br/>benchmark | causal_inference<br/>config_study | robustness_audit<br/>exploratory"]
        DW["dimension_weights<br/>━━━━━━━━━━<br/>H/M/L/SILENT per dim"]
    end

    subgraph Step2 ["● Step 2: L1 Analysis (SKILL.md — repaired)"]
        RUBRIC["● Severity Rubric<br/>━━━━━━━━━━<br/>causal_inference → critical<br/>benchmark → warning<br/>exploratory → info"]
        EC["estimand_clarity<br/>━━━━━━━━━━<br/>receives: plan + ET + rubric"]
        HF["hypothesis_falsifiability<br/>━━━━━━━━━━<br/>receives: plan + ET + rubric"]
        GATE{{"Fail-fast Gate<br/>any critical?"}}
    end

    subgraph Step3to6 ["Steps 3–6: Red-Team / L2 / L3 / L4 (SKILL.md)"]
        RT["red_team + L2/L3/L4<br/>━━━━━━━━━━<br/>receive: plan + experiment_type<br/>type-specific calibration"]
    end

    subgraph Step7 ["● Step 7: Synthesis (SKILL.md)"]
        SYNTH["stop_trigger check<br/>━━━━━━━━━━<br/>critical in {estimand_clarity,<br/>hypothesis_falsifiability, red_team}"]
        VERDICT["verdict<br/>━━━━━━━━━━<br/>GO | REVISE | STOP"]
    end

    subgraph Contracts ["● Contracts Layer"]
        SC["● skill_contracts.yaml<br/>━━━━━━━━━━<br/>verdict: allowed_values [GO,REVISE,STOP]<br/>experiment_type: allowed_values [5 types]"]
        RV["● rules_verdict.py<br/>━━━━━━━━━━<br/>unrouted-verdict-value rule<br/>skips context-forward captures"]
    end

    subgraph Routing ["● research.yaml on_result routing"]
        GO_R["GO<br/>━━━━━━━━━━<br/>→ create_worktree"]
        REV_R["REVISE<br/>━━━━━━━━━━<br/>→ revise_design loop"]
        STOP_R["STOP<br/>━━━━━━━━━━<br/>→ design_rejected (halt)"]
    end

    subgraph Artifacts ["Write-Only Artifacts"]
        DASH["evaluation_dashboard<br/>━━━━━━━━━━<br/>always written"]
        REVGUID["revision_guidance<br/>━━━━━━━━━━<br/>written on REVISE only"]
    end

    PLAN -->|"full plan text"| TRIAGE
    TRIAGE -->|"classifies"| ET
    TRIAGE -->|"weight matrix"| DW
    ET -->|"● now forwarded to L1"| RUBRIC
    ET -->|"● now forwarded to L1"| EC
    ET -->|"● now forwarded to L1"| HF
    RUBRIC -->|"calibrates severity"| EC
    RUBRIC -->|"calibrates severity"| HF
    EC -->|"calibrated findings"| GATE
    HF -->|"calibrated findings"| GATE
    DW -->|"weight controls spawning"| RT
    ET -->|"type-specific focus"| RT
    GATE -->|"no critical → proceed"| RT
    GATE -->|"critical → skip 3-6"| SYNTH
    RT -->|"findings"| SYNTH
    SYNTH -->|"adjudicates"| VERDICT
    VERDICT -->|"emitted token"| GO_R
    VERDICT -->|"emitted token"| REV_R
    VERDICT -->|"emitted token"| STOP_R
    VERDICT -.->|"write-only"| DASH
    REV_R -.->|"write-only on REVISE"| REVGUID
    SC -->|"validates"| VERDICT
    RV -->|"semantic rule"| SC

    class PLAN cli;
    class TRIAGE,EC,HF,RT,SYNTH handler;
    class ET,DW stateNode;
    class RUBRIC,SC,RV newComponent;
    class GATE phase;
    class VERDICT output;
    class GO_R,REV_R,STOP_R phase;
    class DASH,REVGUID output;
Loading

Color Legend:

Color Category Description
Dark Blue Input experiment_plan entry point
Orange Handler Subagents and synthesis processing
Teal State/Data experiment_type and dimension_weights
Purple Gate/Phase Fail-fast gate and routing decisions
Green Contract/Repair ● Repaired rubric, contracts, rules
Dark Teal Output/Artifact verdict token, dashboards, guidance

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 55, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    START([experiment_plan_path])

    subgraph Triage ["Step 1: Triage Dispatcher"]
        direction TB
        TRIAGENODE["Triage Subagent<br/>━━━━━━━━━━<br/>classifies experiment type<br/>builds dimension weight matrix"]
        ET_VAL{"experiment_type valid?<br/>━━━━━━━━━━<br/>5-value enum check"}
    end

    subgraph L1 ["● Step 2: L1 Analysis — repaired data flow (SKILL.md)"]
        direction TB
        EC["estimand_clarity<br/>━━━━━━━━━━<br/>● receives: plan + experiment_type<br/>● + severity calibration rubric"]
        HF["hypothesis_falsifiability<br/>━━━━━━━━━━<br/>● receives: plan + experiment_type<br/>● + severity calibration rubric"]
        L1GATE{{"Fail-fast Gate<br/>━━━━━━━━━━<br/>any critical?"}}
    end

    subgraph Deep ["● Steps 3–6: Deep Analysis (SKILL.md)"]
        direction TB
        REDTEAM["● Red-team (L2)<br/>━━━━━━━━━━<br/>● explicit inputs: plan + experiment_type<br/>type-specific adversarial focus"]
        L3L4["● L3 / L4 Subagents<br/>━━━━━━━━━━<br/>● explicit inputs: plan + experiment_type<br/>dimension_weight-gated spawning"]
    end

    subgraph Synth ["Step 7: Synthesis"]
        direction TB
        STOPCHK{"stop_triggers?<br/>━━━━━━━━━━<br/>critical in {estimand_clarity,<br/>hypothesis_falsifiability, red_team}"}
        CRITCHK{"critical or warnings≥3?"}
        VERDICT["verdict<br/>━━━━━━━━━━<br/>GO | REVISE | STOP"]
    end

    subgraph Recipe ["● research.yaml on_result routing"]
        direction TB
        GO_BRANCH["GO<br/>━━━━━━━━━━<br/>→ create_worktree"]
        REVISE_BRANCH["REVISE<br/>━━━━━━━━━━<br/>→ revise_design → plan_experiment"]
        STOP_BRANCH["STOP<br/>━━━━━━━━━━<br/>→ design_rejected (action: stop)"]
    end

    HALT([HARD HALT])
    PROCEED([PROCEED TO EXECUTION])
    LOOP(["LOOP back (retries: 2 max = 3 cycles)"])

    subgraph Validation ["● Contract Validation Layer"]
        direction LR
        SCON["● skill_contracts.yaml<br/>━━━━━━━━━━<br/>allowed_values enforcement<br/>verdict + experiment_type"]
        VRULE["● rules_verdict.py<br/>━━━━━━━━━━<br/>unrouted-verdict-value rule<br/>skips context-forward captures"]
    end

    %% MAIN FLOW %%
    START --> TRIAGENODE
    TRIAGENODE --> ET_VAL
    ET_VAL -->|"valid / default to exploratory"| EC
    ET_VAL -->|"valid / default to exploratory"| HF
    EC --> L1GATE
    HF --> L1GATE
    L1GATE -->|"no critical → proceed"| REDTEAM
    L1GATE -->|"critical → skip 3-6"| STOPCHK
    REDTEAM --> L3L4
    L3L4 --> STOPCHK
    STOPCHK -->|"stop_triggers present"| VERDICT
    STOPCHK -->|"no stop_triggers"| CRITCHK
    CRITCHK -->|"critical or warnings≥3"| VERDICT
    CRITCHK -->|"clean"| VERDICT
    VERDICT -->|"GO"| GO_BRANCH
    VERDICT -->|"REVISE"| REVISE_BRANCH
    VERDICT -->|"STOP"| STOP_BRANCH
    GO_BRANCH --> PROCEED
    REVISE_BRANCH --> LOOP
    STOP_BRANCH --> HALT

    SCON -.->|"validates"| VERDICT
    VRULE -.->|"audits routing"| SCON

    %% CLASS ASSIGNMENTS %%
    class START terminal;
    class HALT,PROCEED,LOOP terminal;
    class TRIAGENODE,EC,HF,REDTEAM,L3L4 handler;
    class ET_VAL,L1GATE,STOPCHK,CRITCHK stateNode;
    class VERDICT output;
    class GO_BRANCH,REVISE_BRANCH,STOP_BRANCH phase;
    class SCON,VRULE newComponent;
Loading

Color Legend:

Color Category Description
Dark Blue Terminal Entry, halt, proceed, loop states
Orange Handler Subagent execution nodes
Teal Decision Gate and routing decisions
Purple Phase Recipe routing branches
Green Validation ● Contract enforcement layer
Dark Teal Output verdict token

Closes #599

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/remediation-20260404-092701-456719/.autoskillit/temp/rectify/rectify_review-design-l1-severity-calibration_2026-04-04_094500_part_a.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step input output cached count time
investigate 23 7.9k 389.1k 1 5m 55s
rectify 5.2k 21.5k 826.0k 1 10m 58s
dry_walkthrough 55 33.7k 2.1M 3 13m 39s
implement 125 51.1k 6.1M 3 22m 8s
assess 45 10.4k 1.1M 2 8m 47s
audit_impl 58 32.3k 1.3M 3 13m 16s
make_plan 25 5.0k 560.4k 1 2m 34s
open_pr 30 20.9k 1.1M 1 7m 38s
Total 5.6k 182.8k 13.4M 1h 24m

Trecek and others added 6 commits April 4, 2026 10:20
Replace misleading phrase 'regardless of triage output' with accurate
wording: 'severity thresholds are calibrated per experiment_type via
the rubric below'. H-weight is constant; thresholds are not.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds 6 new tests (5 currently failing, 1 immediately passing) that encode
the structural contracts required by the review-design L1 severity calibration
repair:

- test_l1_subagents_receive_experiment_type: Step 2 must explicitly list
  experiment_type as a subagent input (was absent, causing false STOP verdicts)
- test_l1_severity_calibration_rubric_present: Step 2 must contain a
  calibration rubric table mapping experiment_type to severity per dimension
- test_design_rejected_step_is_action_stop: regression guard for design_rejected
  action=stop with non-empty message referencing STOP verdict
- test_kitchen_rule_6_acknowledges_stop_verdict_exception: rule #6's blanket
  'do not stop' must carve out the STOP verdict exception
- test_skill_contracts_pattern_examples_use_valid_experiment_types: pattern_examples
  must not use invalid experiment_type values (e.g. 'controlled')
- test_review_design_experiment_type_output_has_allowed_values: experiment_type
  output must have an allowed_values constraint in skill_contracts.yaml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three interconnected fixes for the false-STOP verdict regression on
benchmark/exploratory plans:

1. SKILL.md Step 2 — add explicit subagent inputs and calibration rubric
   - L1 subagents now explicitly receive experiment_type from Step 1 triage
   - Severity calibration rubric table maps experiment_type to severity per
     L1 dimension (causal_inference → critical; benchmark/config/robust →
     warning; exploratory → info)
   - Both subagent descriptions updated with type-specific severity guidance

2. research.yaml kitchen rule #6 — scope 'do not stop' to exhaustion/failure
   - Adds explicit STOP verdict exception clause so the blanket 'do not stop'
     language no longer contradicts the design_rejected hard halt

3. skill_contracts.yaml review-design outputs
   - experiment_type now has allowed_values constraint (benchmark,
     configuration_study, causal_inference, robustness_audit, exploratory)
   - pattern_examples: replaced invalid 'controlled' with 'benchmark'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The rule fired for experiment_type after allowed_values was added to its
contract output, demanding explicit on_result routing for all 5 experiment
types. experiment_type is captured for context forwarding, not routing.
Added a guard to skip the check for any output not referenced in any
on_result when condition.
…t coverage

Part B immunity tests:
- B-1 (parameterized): all subagent-spawning steps (2–5) must declare explicit
  inputs using 'Receives:', 'Inputs:', or 'Each subagent receives:' pattern
- B-2 (parameterized): Steps 4 and 5 must explicitly receive experiment_type
- B-3: unrouted-verdict-value rule must not fire for review_design in research.yaml

All 5 B-1/B-2 cases fail before SKILL.md implementation (Steps 3–5 lack
canonical input declarations). B-3 passes immediately (all three verdict
values are already explicitly routed in research.yaml).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the explicit context pattern from Step 2 (Part A) to all remaining
subagent-spawning steps in review-design SKILL.md:

- Step 3 (red-team): canonical 'Receives:' form with colon
- Step 4 (L3): 'Each L3 subagent receives:' block with experiment_type and
  L1/L2 findings context; notes that exploratory plans do not require
  pre-registered correction procedures while causal_inference demands formal
  power analysis
- Step 5 (L4): 'Each L4 subagent receives:' block with experiment_type and
  dimension_weights; notes that ecological_validity and reproducibility_spec
  rigor expectations scale with experiment_type

Together with Part A, every subagent step now has a tested explicit input
specification. Any future regression (removing experiment_type from any
subagent context) immediately fails a parameterized contract test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested


def skill_text_between(start_heading: str, end_heading: str, text: str) -> str:
"""Extract SKILL.md text between two headings (start inclusive, end exclusive)."""
pattern = re.escape(start_heading) + r".*?(?=" + re.escape(end_heading) + r")"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: skill_text_between uses a greedy dotall regex (.*?) with a lookahead for end_heading. If the end_heading appears multiple times or section order changes, the match silently covers the wrong range. Use a non-greedy match anchored to the next same-level heading boundary, or assert the matched range is non-empty.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The regex uses .*? (non-greedy), not greedy — the reviewer's 'greedy dotall' framing is factually incorrect. SKILL.md has exactly one occurrence of each '### Step N' heading, so the lookahead terminates unambiguously. The helper already asserts a non-None match (line 194) which fails loudly if the section is absent. No change warranted. (commit 2c9eb89)

assert "critical" in step2_lower and "warning" in step2_lower, (
"Step 2 must define what constitutes critical vs warning per experiment type."
)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: The phrase list ["receives:", "inputs:", "each subagent receives", "each ... receives"] is fragile: any legitimate rephrasing not in this list causes a false failure. This assertion checks prose style rather than a structural contract. Consider asserting the presence of specific variable names (e.g. 'experiment_type') instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The phrase list is a canonical structural-form guard, not an arbitrary prose check. The test docstring explicitly calls this the 'established pattern (from Step 3 red-team)' — it enforces input-declaration form across all subagent steps. Switching to variable-name assertions (e.g. 'experiment_type') would conflate this with the separate test_l1_subagents_receive_experiment_type test. All four step ranges match the pattern and tests pass. No change warranted. (commit 8764aa4)

The current blanket 'do not stop' language contradicts the design_rejected
hard halt. The rule must explicitly note that verdict=STOP is the sole exception.
"""
rule6 = recipe.kitchen_rules[5] # 0-indexed; rule #6 is index 5
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: Hard-coded kitchen_rules[5] index will silently audit the wrong rule if any rule is inserted before index 5. Locate the relevant rule by content substring match rather than positional index.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation — flagged for design decision. The test deliberately uses a documented index (# 0-indexed; rule #6 is index 5) and the kitchen_rules list currently has 8 entries. The positional approach is a tradeoff: a rule inserted before index 5 could silently audit the wrong rule if it coincidentally contains 'stop' AND 'verdict'/'exception'. Switching to a content substring match (e.g. next(r for r in recipe.kitchen_rules if 'DESIGN REVIEW LOOP' in r)) would be more robust. Flagged for human decision on whether to harden. (commit 2c9eb89)

# ── Section-scoped helpers ────────────────────────────────────────────────────


def skill_text_between(start_heading: str, end_heading: str, text: str) -> str:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] defense: skill_text_between returns m.group(0) which is the full lookahead match. If the section is empty the caller receives an empty string with no guard — callers like test_l1_subagents_receive_experiment_type then assert keyword presence in an empty string and fail with an uninformative message. Add a non-empty assertion in the helper or document the empty-string contract.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The concern is based on a misunderstanding of how lookahead assertions work: re.search with (?=end_heading) returns only the consumed portion (up to but not including the lookahead), never the lookahead text itself. The assert m on line 194 guards against a missing section. Against the actual SKILL.md, all four step ranges return non-empty strings. The 'uninformative message' failure mode does not exist in practice. No change warranted. (commit 2c9eb89)

2–4 subagents. Only spawn subagents for dimensions with weight ≥ L in the matrix.
SILENT (S) dimensions are NOT spawned and NOT mentioned in output.

Each L4 subagent receives:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] cohesion: L4 input spec includes dimension_weights but L1 and L3 input specs do not declare whether they receive or ignore this field. If L4 uses dimension_weights to calibrate thresholds, L3 (also dimension-scoped) should declare its position on dimension_weights for a consistent cross-level input contract.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation — flagged for design decision. L4 input spec explicitly passes dimension_weights (added in commit a5913f3) so H-weight dimensions apply stricter thresholds. L3 dimensions (error_budget, statistical_corrections, variance_protocol) are unconditionally spawned with no SILENT/weight-gating logic, so dimension_weights may carry less interpretive value there — but this asymmetry is undocumented. Flagged for deliberate decision: either add dimension_weights to L3 spec (with rationale) or document the intentional omission. (commit a5913f3)

- `estimand_clarity` agent: "Can the claim be written as a formal contrast (A vs B on Y in Z)?"
Reference the exp-lens-estimand-clarity philosophical mode as guidance (do NOT invoke
the skill — reference its lens question only in the subagent prompt).
Use the calibration rubric above to assign severity. For `causal_inference`: absent formal
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] slop: Lines L164-L172 repeat the rubric table content (L156-L159) in prose inline with agent instructions. The table already encodes the same severity mappings; the prose expansions are redundant restatements that duplicate the table rather than adding information.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The rubric table (lines 156-159) encodes severity values only. Lines 164-172 embed agent-specific interpretive rationale that the table cannot encode: 'informal contrast sufficient' (estimand_clarity / benchmark), 'intentionally absent' (exploratory), 'pre-registration not required' (hypothesis_falsifiability / exploratory). This prose explains why the severity is what it is and provides operational grounding that prevents subagents from misapplying the table to borderline cases. It was added in commit 8cec605 specifically to fix a false-STOP verdict regression. Removing it would reintroduce the same ambiguity. No change warranted. (commit 8cec605)

@Trecek Trecek added this pull request to the merge queue Apr 4, 2026
Merged via the queue into integration with commit 6f0ab32 Apr 4, 2026
2 checks passed
@Trecek Trecek deleted the review-design-l1-subagents-lack-experiment-type-severity-cal/599 branch April 4, 2026 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant