fix: pentest guard gaps + eval scorer rejected-cycle fallback#158
Merged
fix: pentest guard gaps + eval scorer rejected-cycle fallback#158
Conversation
Security (pentest finding): - Added scripts/watchdog.sh to PROMPT_GUARD_FILES in lib-agent.sh. watchdog.sh controls daemon restart rate-limiting and invokes daemon.sh directly but was absent from both the working-tree guard and the origin-integrity check. - Added scripts/daemon-strategist.sh, scripts/daemon-review.sh, and scripts/daemon-overseer.sh to PROMPT_GUARD_FILES. These legacy scripts source lib-agent.sh and invoke agents but were unguarded. Fix (task #102): - score_discovery(), score_fix_quality(), score_usefulness() in evaluation.py now fall back to nested cycle_result data for rejected cycles when aggregate counters stayed at zero. Previously all-rejected runs scored 0 for discovery/usefulness even when real fixes existed. - Added _extract_cycle_fixes() and _extract_cycle_issues() helpers that transparently handle accepted (top-level fixes) and rejected (cycle_result nesting) cycle data. - Added 5 regression tests: rejected-cycle fix counting, title-quality bonus, fix-quality scoring, and usefulness fallback. Tests: 1016 passing (+4 new).
…llow-up tasks - Fixed handoff 0082: added tracker delta, learnings applied, generated tasks sections - Fixed handoff 0082: corrected test count claim (4 tests, not 5) - Added learning: PROMPT_GUARD_FILES must cover all agent-invoking scripts - Created task #162: score_discovery/score_fix_quality asymmetry in mixed runs - Created task #163: missing test for accepted cycle with empty fixes list - Fixed changelog: corrected '5 regression tests' to '4', added [test] internal entry
4 tasks
fazxes
added a commit
that referenced
this pull request
Apr 9, 2026
…one) Queue before: 72 pending + 9 wontfix-in-active-dir Queue after: 65 pending + 0 wontfix (all converted to done for archiving) Merged into primary tasks (5 closures): - #175 -> #174: both add tests to TestAuthFailureDetection, same PR - #163 -> #162: both are scoring module tests from PR #158 review, same PR - #124 -> #122: both validate doc snapshot consistency, same PR scope - #196 -> #173: both add entries to PROMPT_GUARD_FILES in lib-agent.sh - #180 -> #179: both touch _is_valid_eval_file() in pick-role.py, same PR Closed as obsolete (1): - #78: references non-existent "evolve.md Step 8" and the multi-agent review panel replaced by unified review in PR #107 Closed as low-value (1): - #230: _DELEGATION_ROLE_MAP covers all 8 current agent types; new agent types require major framework work making the map update obvious Converted wontfix -> done for archiving (9): - #77, #80, #107, #111, #115, #119, #127, #129, #134 All had wontfix status with rationale already documented; changed to done so daemon's archive_done_tasks() housekeeping removes them
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/watchdog.sh,scripts/daemon-strategist.sh,scripts/daemon-review.sh, andscripts/daemon-overseer.shtoPROMPT_GUARD_FILESinlib-agent.sh. These scripts invoke agents or control restart rate-limiting but were absent from both the working-tree guard and the origin-integrity check, per the pre-build pentest scan.score_discovery,score_fix_quality,score_usefulness) now fall back to nestedcycle_resultdata for rejected cycles when aggregate counters stayed at zero. All-rejected-cycle runs no longer score 0 for discovery/usefulness when real fixes exist in the state JSON._extract_cycle_fixes()/_extract_cycle_issues()+ 5 regression tests.Test plan
make checkpasses: 1016 tests, all greentest_rejected_cycle_fixes_counted,test_rejected_cycle_with_real_title_gets_quality_points,test_rejected_cycle_fix_quality_scored,test_rejected_cycle_usefulness_countedPROMPT_GUARD_FILES;common guard list will no longer show them as absent