feat: configurable judge models, resilient imports, parallel scenarios#1
Merged
dmitriyzhuk merged 1 commit intoApr 28, 2026
Conversation
1. Configurable LLM judge models via env vars: - EVAL_CLAUDE_JUDGE_MODEL (default: claude-sonnet-4-6) - EVAL_GEMINI_JUDGE_MODEL (default: gemini-2.5-pro) - EVAL_OPENAI_JUDGE_MODEL (default: gpt-4o) Agent model was already configurable via EVAL_LLM_MODEL. 2. Resilient import paths for agent runtime modules: - Tries src/slices/runtime/init/ first (post-refactor path) - Falls back to src/slices/agent/init/ (legacy path) - Prevents breakage when agent repos reorganize their slice structure 3. Parallel scenario execution: - New --concurrency N flag (default: 1 = sequential) - Also configurable via .paddock/config.json: "concurrency": 8 - Uses worker pool pattern with jitter between starts - Results maintain original ordering regardless of completion order 4. Judge models displayed in header output (shows actual model IDs)
dmitriyzhuk
pushed a commit
that referenced
this pull request
May 6, 2026
…partial bias
Empirical evidence from a 100-iter knoxai-agent paddock run revealed
gemini-3-pro-preview gives "partial" verdicts on 47% of medium-scenario
runs WHILE simultaneously scoring all dimensions at 8.5-9.1 with
glowing reasoning text praising the agent's behavior. Sample reasoning
on partial verdicts:
"The agent successfully attempted gcloud_exec to check the current
state and used cve_lookup to analyze the CVEs, accurately
incorporating the findings (severity, CVSS, exploit status) into
its justification for the upgrade." → verdict: partial, score: 9.1
This violates paddock's own verdict-from-scores rule ("If ALL scored
dimensions are >= 8 → verdict: pass"). Other judges (Claude, GPT-4.1)
follow the rule correctly:
Per-judge verdict distribution on medium scenario (101 runs):
- claude-sonnet-4-6: 101 pass / 0 partial / 0 fail
- gpt-4.1: 99 pass / 0 partial / 2 fail
- gemini-3-pro-preview: 54 pass / 47 partial / 0 fail ← anomaly
Net effect: per-scenario judge agreement on medium drops from ~98% on
dev runs (no CVE criteria, no anomaly) to ~84% on feat runs, purely
because Gemini downgrades to partial despite high scores.
The previous prompt rule was suggestive ("IMPORTANT — Verdict MUST be
consistent..."). This rewrite makes the rule imperative and explicit:
- Computes the verdict as a deterministic function of scores
- Lists forbidden combinations (e.g. "partial when all dimensions
scored 8+")
- Explicitly redirects nuance into the reasoning field
- Tells judges to lower a dimension score if they want to express
dissatisfaction, instead of downgrading the verdict
JSON-mode output (from PR #1) keeps the schema enforced; this PR
addresses the BEHAVIORAL bias separately. Both changes are needed to
get consistent verdicts across the panel.
If gemini-3-pro continues to show elevated partial rates after this,
escalate to runtime verdict correction in parseJudgeResponse
(post-hoc enforcement of the rule).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dmitriyzhuk
added a commit
that referenced
this pull request
May 6, 2026
Resolves conflict in src/loop/orchestrator.ts: - keep agentTokenUsage merging block from PR #2 (feat/cost-breakdown) - drop dead `if (config.useBranch) { git.restoreOriginalBranch() }` — PR #1 (refactor as embeddable lib) removed useBranch + GitManager - keep `state.budget = budget.current()` from main typecheck + build pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Configurable LLM judge models via env vars:
Resilient import paths for agent runtime modules:
Parallel scenario execution:
Judge models displayed in header output (shows actual model IDs)