Skip to content

eval: Evaluation #0016 -- score 86/100, eval gate clear#198

Merged
fazxes merged 1 commit intomainfrom
feat/eval-rerun-0016
Apr 8, 2026
Merged

eval: Evaluation #0016 -- score 86/100, eval gate clear#198
fazxes merged 1 commit intomainfrom
feat/eval-rerun-0016

Conversation

@fazxes
Copy link
Copy Markdown
Member

@fazxes fazxes commented Apr 8, 2026

Summary

Score Delta

Dimension #15 #16 Change
Guard rails 4 9 +5
Shift log 0 9 +9
Fix quality 3 8 +5
Discovery 5 8 +3
Usefulness 4 8 +4
Clean state 6 9 +3
State file 7 9 +2
Breadth 6 7 +1
Startup 8 9 +1
Verification 10 10 0
Total 53 86 +33

Root Cause

The +33 improvement traces directly to the three PRs that landed after eval #15:

Test Plan

  • make check passes (882 tests, ruff, mypy, dry-runs, ASCII check all green)
  • Evaluation ran with no manual overrides (prescribed default command)
  • Eval report at .recursive/evaluations/0016.md scores all 10 dimensions with raw evidence
  • Task docs: add follow-up tasks #0190 #0191 from PR #176 review #177 marked done with completion date

…clear

Two-cycle test run against Phractal confirms the three merged PRs fixed
the scored failures from eval #15. Both cycles accepted (no false
rejections), shift log persisted, counters positive. Score 86/100
exceeds the BUILD EVAL GATE threshold of >= 80.

- Guard rails: 4 -> 9 (PR #167 count-only payload fix working)
- Shift log: 0 -> 9 (durable artifact written and co-committed)
- Fix quality: 3 -> 8 (full metadata in accepted cycles)
- Discovery: 5 -> 8 (structured output preserved faithfully)
@fazxes fazxes merged commit b6254e8 into main Apr 8, 2026
7 checks passed
fazxes added a commit that referenced this pull request Apr 8, 2026
Eval gate cleared: #177 done (PR #198), score 86/100.
ROLE-SCORING.md rewritten for v2: #203 done (PR #197).
Follow-up task #207 created (eval Breadth improvement).
@fazxes fazxes deleted the feat/eval-rerun-0016 branch April 8, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant