Skip to content

Unit 6 (prompt-design): split criterion 2 (MEDIUM, preserve-by-default)#164

Merged
LuminLynx merged 1 commit into
mainfrom
claude/unit-06-rubric-split
May 21, 2026
Merged

Unit 6 (prompt-design): split criterion 2 (MEDIUM, preserve-by-default)#164
LuminLynx merged 1 commit into
mainfrom
claude/unit-06-rubric-split

Conversation

@LuminLynx
Copy link
Copy Markdown
Owner

Summary

Unit 6 (prompt-design), fifth of the MEDIUM batch, under preserve-by-default.

Rubric (3 → 4): c1 unchanged · c2 = name the failure mode · c3 (NEW) = explain the mechanism · c4 = regime distinction (was c3).

Preserve-by-default decomposition

Faithful decomposition of the locked Opus values:

  • Every old-c2=T pair → c2=T and c3=T.
  • p007 is the lone c2=T / c3=F differential — its own authored label says it names "instructions miss edge cases" without the ambiguous-criteria mechanism. (So unlike Unit 5, this set does test the c2-vs-c3 distinction once.)
  • p009, p011 name no failure mode → c2=F, c3=F.
  • c4 (regime) carries old-c3 unchanged.

No realignments, no judgment-flips. p007/p011 labels updated for the 4-criterion shape.

Post-split distribution (21 pairs)

8 × 4-of-4 · 3 × 3-of-4 (p006, p007-differential, p008) · 2 × 2-of-4 (p010, p011) · 1 × 1-of-4 (p009) · 5 on-topic-all-missed · 2 off-topic.

Note

The known-bad p018 (emoji + structured markdown + slashed percentages — reproducible grader-payload ERROR per UNIT_6_GATE.md) is unaffected by the split; it remains a documented known-bad marker.

Local validation

  • lint_unit_markdown / ingest_units --check — clean
  • run_regression_set --check — 21 pairs valid
  • pytest — 20/20

Test plan

  • Backend + Android CI green
  • Live grader gate optional (preserve-by-default — disagreements are documented, not chased). Expect the p018 ERROR and possibly grader-lenient c1 reads.

Opened as draft.


Generated by Claude Code

…-default)

Per docs/RUBRIC_AUDIT.md (MEDIUM): old c2 bundled 'names a concrete
failure mode' with 'explains the mechanism.' Splits into
name-the-failure-mode c2 and a new c3 (explain the mechanism);
renumbers regime distinction to position 4. Rubric grows 3 -> 4.

Preserve-by-default (docs/REGRESSION_GATE.md): faithful decomposition
of the locked Opus values — old-c2=T → c2=T,c3=T. p007 is the lone
c2=T/c3=F differential (its authored label says it names
'instructions miss edge cases' without the ambiguous-criteria
mechanism); p009/p011 name no failure mode → c2=F,c3=F. No
realignments, no judgment-flips. c4 (regime) carries old-c3 unchanged.
Updated p007/p011 labels for the 4-criterion shape.

Sonnet gate disagreements are documented calibration gaps, not edits
to the gold standard. Known-bad p018 ERROR pair unaffected. Local
lint, schema check, ingest-check, pytest all pass.
@LuminLynx LuminLynx marked this pull request as ready for review May 21, 2026 20:18
@LuminLynx LuminLynx merged commit 462138b into main May 21, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants