Skip to content

feat: configurable judge models, resilient imports, parallel scenarios#1

Merged
dmitriyzhuk merged 1 commit into
CleanSlice:mainfrom
masonknox:feat/configurable-models-paths-parallel
Apr 28, 2026
Merged

feat: configurable judge models, resilient imports, parallel scenarios#1
dmitriyzhuk merged 1 commit into
CleanSlice:mainfrom
masonknox:feat/configurable-models-paths-parallel

Conversation

@masonknox
Copy link
Copy Markdown
Collaborator

  1. Configurable LLM judge models via env vars:

    • EVAL_CLAUDE_JUDGE_MODEL (default: claude-sonnet-4-6)
    • EVAL_GEMINI_JUDGE_MODEL (default: gemini-2.5-pro)
    • EVAL_OPENAI_JUDGE_MODEL (default: gpt-4o) Agent model was already configurable via EVAL_LLM_MODEL.
  2. Resilient import paths for agent runtime modules:

    • Tries src/slices/runtime/init/ first (post-refactor path)
    • Falls back to src/slices/agent/init/ (legacy path)
    • Prevents breakage when agent repos reorganize their slice structure
  3. Parallel scenario execution:

    • New --concurrency N flag (default: 1 = sequential)
    • Also configurable via .paddock/config.json: "concurrency": 8
    • Uses worker pool pattern with jitter between starts
    • Results maintain original ordering regardless of completion order
  4. Judge models displayed in header output (shows actual model IDs)

1. Configurable LLM judge models via env vars:
   - EVAL_CLAUDE_JUDGE_MODEL (default: claude-sonnet-4-6)
   - EVAL_GEMINI_JUDGE_MODEL (default: gemini-2.5-pro)
   - EVAL_OPENAI_JUDGE_MODEL (default: gpt-4o)
   Agent model was already configurable via EVAL_LLM_MODEL.

2. Resilient import paths for agent runtime modules:
   - Tries src/slices/runtime/init/ first (post-refactor path)
   - Falls back to src/slices/agent/init/ (legacy path)
   - Prevents breakage when agent repos reorganize their slice structure

3. Parallel scenario execution:
   - New --concurrency N flag (default: 1 = sequential)
   - Also configurable via .paddock/config.json: "concurrency": 8
   - Uses worker pool pattern with jitter between starts
   - Results maintain original ordering regardless of completion order

4. Judge models displayed in header output (shows actual model IDs)
@dmitriyzhuk dmitriyzhuk merged commit c26a96b into CleanSlice:main Apr 28, 2026
dmitriyzhuk pushed a commit that referenced this pull request May 6, 2026
…partial bias

Empirical evidence from a 100-iter knoxai-agent paddock run revealed
gemini-3-pro-preview gives "partial" verdicts on 47% of medium-scenario
runs WHILE simultaneously scoring all dimensions at 8.5-9.1 with
glowing reasoning text praising the agent's behavior. Sample reasoning
on partial verdicts:

  "The agent successfully attempted gcloud_exec to check the current
   state and used cve_lookup to analyze the CVEs, accurately
   incorporating the findings (severity, CVSS, exploit status) into
   its justification for the upgrade."  → verdict: partial, score: 9.1

This violates paddock's own verdict-from-scores rule ("If ALL scored
dimensions are >= 8 → verdict: pass"). Other judges (Claude, GPT-4.1)
follow the rule correctly:

  Per-judge verdict distribution on medium scenario (101 runs):
  - claude-sonnet-4-6:    101 pass / 0 partial / 0 fail
  - gpt-4.1:              99 pass / 0 partial / 2 fail
  - gemini-3-pro-preview: 54 pass / 47 partial / 0 fail   ← anomaly

Net effect: per-scenario judge agreement on medium drops from ~98% on
dev runs (no CVE criteria, no anomaly) to ~84% on feat runs, purely
because Gemini downgrades to partial despite high scores.

The previous prompt rule was suggestive ("IMPORTANT — Verdict MUST be
consistent..."). This rewrite makes the rule imperative and explicit:

- Computes the verdict as a deterministic function of scores
- Lists forbidden combinations (e.g. "partial when all dimensions
  scored 8+")
- Explicitly redirects nuance into the reasoning field
- Tells judges to lower a dimension score if they want to express
  dissatisfaction, instead of downgrading the verdict

JSON-mode output (from PR #1) keeps the schema enforced; this PR
addresses the BEHAVIORAL bias separately. Both changes are needed to
get consistent verdicts across the panel.

If gemini-3-pro continues to show elevated partial rates after this,
escalate to runtime verdict correction in parseJudgeResponse
(post-hoc enforcement of the rule).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dmitriyzhuk added a commit that referenced this pull request May 6, 2026
Resolves conflict in src/loop/orchestrator.ts:
- keep agentTokenUsage merging block from PR #2 (feat/cost-breakdown)
- drop dead `if (config.useBranch) { git.restoreOriginalBranch() }` —
  PR #1 (refactor as embeddable lib) removed useBranch + GitManager
- keep `state.budget = budget.current()` from main

typecheck + build pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants