feat: configurable judge models, resilient imports, parallel scenarios by masonknox · Pull Request #1 · CleanSlice/paddock

masonknox · 2026-04-27T20:06:09Z

Configurable LLM judge models via env vars:
- EVAL_CLAUDE_JUDGE_MODEL (default: claude-sonnet-4-6)
- EVAL_GEMINI_JUDGE_MODEL (default: gemini-2.5-pro)
- EVAL_OPENAI_JUDGE_MODEL (default: gpt-4o) Agent model was already configurable via EVAL_LLM_MODEL.
Resilient import paths for agent runtime modules:
- Tries src/slices/runtime/init/ first (post-refactor path)
- Falls back to src/slices/agent/init/ (legacy path)
- Prevents breakage when agent repos reorganize their slice structure
Parallel scenario execution:
- New --concurrency N flag (default: 1 = sequential)
- Also configurable via .paddock/config.json: "concurrency": 8
- Uses worker pool pattern with jitter between starts
- Results maintain original ordering regardless of completion order
Judge models displayed in header output (shows actual model IDs)

1. Configurable LLM judge models via env vars: - EVAL_CLAUDE_JUDGE_MODEL (default: claude-sonnet-4-6) - EVAL_GEMINI_JUDGE_MODEL (default: gemini-2.5-pro) - EVAL_OPENAI_JUDGE_MODEL (default: gpt-4o) Agent model was already configurable via EVAL_LLM_MODEL. 2. Resilient import paths for agent runtime modules: - Tries src/slices/runtime/init/ first (post-refactor path) - Falls back to src/slices/agent/init/ (legacy path) - Prevents breakage when agent repos reorganize their slice structure 3. Parallel scenario execution: - New --concurrency N flag (default: 1 = sequential) - Also configurable via .paddock/config.json: "concurrency": 8 - Uses worker pool pattern with jitter between starts - Results maintain original ordering regardless of completion order 4. Judge models displayed in header output (shows actual model IDs)

…partial bias Empirical evidence from a 100-iter knoxai-agent paddock run revealed gemini-3-pro-preview gives "partial" verdicts on 47% of medium-scenario runs WHILE simultaneously scoring all dimensions at 8.5-9.1 with glowing reasoning text praising the agent's behavior. Sample reasoning on partial verdicts: "The agent successfully attempted gcloud_exec to check the current state and used cve_lookup to analyze the CVEs, accurately incorporating the findings (severity, CVSS, exploit status) into its justification for the upgrade." → verdict: partial, score: 9.1 This violates paddock's own verdict-from-scores rule ("If ALL scored dimensions are >= 8 → verdict: pass"). Other judges (Claude, GPT-4.1) follow the rule correctly: Per-judge verdict distribution on medium scenario (101 runs): - claude-sonnet-4-6: 101 pass / 0 partial / 0 fail - gpt-4.1: 99 pass / 0 partial / 2 fail - gemini-3-pro-preview: 54 pass / 47 partial / 0 fail ← anomaly Net effect: per-scenario judge agreement on medium drops from ~98% on dev runs (no CVE criteria, no anomaly) to ~84% on feat runs, purely because Gemini downgrades to partial despite high scores. The previous prompt rule was suggestive ("IMPORTANT — Verdict MUST be consistent..."). This rewrite makes the rule imperative and explicit: - Computes the verdict as a deterministic function of scores - Lists forbidden combinations (e.g. "partial when all dimensions scored 8+") - Explicitly redirects nuance into the reasoning field - Tells judges to lower a dimension score if they want to express dissatisfaction, instead of downgrading the verdict JSON-mode output (from PR #1) keeps the schema enforced; this PR addresses the BEHAVIORAL bias separately. Both changes are needed to get consistent verdicts across the panel. If gemini-3-pro continues to show elevated partial rates after this, escalate to runtime verdict correction in parseJudgeResponse (post-hoc enforcement of the rule). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Resolves conflict in src/loop/orchestrator.ts: - keep agentTokenUsage merging block from PR #2 (feat/cost-breakdown) - drop dead `if (config.useBranch) { git.restoreOriginalBranch() }` — PR #1 (refactor as embeddable lib) removed useBranch + GitManager - keep `state.budget = budget.current()` from main typecheck + build pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dmitriyzhuk merged commit c26a96b into CleanSlice:main Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: configurable judge models, resilient imports, parallel scenarios#1

feat: configurable judge models, resilient imports, parallel scenarios#1
dmitriyzhuk merged 1 commit into
CleanSlice:mainfrom
masonknox:feat/configurable-models-paths-parallel

masonknox commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

masonknox commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants