Merge my paddock changes int cleanslice by kamranjiwani-knox · Pull Request #2 · CleanSlice/paddock

kamranjiwani-knox · 2026-05-06T13:08:27Z

No description provided.

…I/Gemini Two changes addressing real failure modes seen in 100-iter knoxai-agent paddock runs: 1) Switch judge output from regex-parsed text to JSON. - buildJudgePrompt now ends with a JSON schema spec and an example, instead of the SCORE[dim]: / VERDICT: / etc. text format. - parseJudgeResponse tries JSON first (strips ```json fences and leading preambles, finds the first { ... last } block, JSON.parse, validates shape). Falls back to the original regex parser if JSON parsing fails so existing behavior on legacy text outputs is preserved. - OpenAI judge: adds response_format: { type: "json_object" } — guarantees parseable JSON, eliminating the format-drift failures that caused gpt-4.1 to flag agents as fail when their reasoning ran past 4096 tokens (~13% fail rate observed on hard scenario). - Gemini judge: adds generationConfig.responseMimeType: "application/json" — same guarantee on Google's side. - Claude judge needs no API change; the prompt now requests JSON and Claude follows the format reliably. The text-fallback parser covers any drift. 2) Bump max_tokens from 4096 to 8096 on OpenAI and Gemini judges. The Claude judge has been at 8096 already. The other two were inconsistently lower and would truncate verbose multi-dimension reasoning, leaving the parser unable to extract scores. With JSON mode active this also guarantees enough headroom for the JSON object plus reasoning fields without truncation. Backward compatibility: parseJudgeResponse falls through to the original regex-based parser if JSON.parse fails or the shape is unrecognized. No call sites changed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…partial bias Empirical evidence from a 100-iter knoxai-agent paddock run revealed gemini-3-pro-preview gives "partial" verdicts on 47% of medium-scenario runs WHILE simultaneously scoring all dimensions at 8.5-9.1 with glowing reasoning text praising the agent's behavior. Sample reasoning on partial verdicts: "The agent successfully attempted gcloud_exec to check the current state and used cve_lookup to analyze the CVEs, accurately incorporating the findings (severity, CVSS, exploit status) into its justification for the upgrade." → verdict: partial, score: 9.1 This violates paddock's own verdict-from-scores rule ("If ALL scored dimensions are >= 8 → verdict: pass"). Other judges (Claude, GPT-4.1) follow the rule correctly: Per-judge verdict distribution on medium scenario (101 runs): - claude-sonnet-4-6: 101 pass / 0 partial / 0 fail - gpt-4.1: 99 pass / 0 partial / 2 fail - gemini-3-pro-preview: 54 pass / 47 partial / 0 fail ← anomaly Net effect: per-scenario judge agreement on medium drops from ~98% on dev runs (no CVE criteria, no anomaly) to ~84% on feat runs, purely because Gemini downgrades to partial despite high scores. The previous prompt rule was suggestive ("IMPORTANT — Verdict MUST be consistent..."). This rewrite makes the rule imperative and explicit: - Computes the verdict as a deterministic function of scores - Lists forbidden combinations (e.g. "partial when all dimensions scored 8+") - Explicitly redirects nuance into the reasoning field - Tells judges to lower a dimension score if they want to express dissatisfaction, instead of downgrading the verdict JSON-mode output (from PR #1) keeps the schema enforced; this PR addresses the BEHAVIORAL bias separately. Both changes are needed to get consistent verdicts across the panel. If gemini-3-pro continues to show elevated partial rates after this, escalate to runtime verdict correction in parseJudgeResponse (post-hoc enforcement of the rule). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ed prompts Adds caching support for the judge LLM calls — the highest-volume side of paddock evaluations after the agent loop itself. Structurally separates the static rubric/format spec (~1500 tokens, identical across every judge call within a paddock run) from the variable scenario/trace content. ## Changes 1. types.ts - New JudgePrompt interface with explicit { system, user } separation. - JudgeProvider.complete() now accepts string | JudgePrompt. Backward compatible — legacy string callers still work. 2. evaluator/criteria.ts - buildJudgePrompt() now returns JudgePrompt instead of a single string. - Static prefix (instructions, verdict rules, output format, example) extracted into a constant JUDGE_SYSTEM_PROMPT. - Variable suffix (scenario, SOUL.md, execution trace, success criteria, dimension list) goes into the user content. - parseJudgeResponse unchanged — still operates on the raw string response, so its JSON+text-fallback path keeps working. 3. evaluator/providers/claude.ts - When called with a JudgePrompt, sends the system block with cache_control: { type: "ephemeral" }. - Anthropic caches everything up to that breakpoint. Cache hits are billed at 10% of input cost, with measurable TTFT improvement. - String input still works (sent as a plain user message, no system block). 4. evaluator/providers/openai.ts - When called with a JudgePrompt, sends two messages: system role + user role. - Adds prompt_cache_key: "paddock-judge" to keep all judge calls on the same shard, maximizing implicit cache hit rate. - Adds prompt_cache_retention: "24h" only for gpt-4.1 / gpt-5 family models (per OpenAI docs — not all models support 24h). 5. evaluator/providers/gemini.ts - When called with a JudgePrompt, sends the system as systemInstruction (Gemini's structured separation). - This positions us for Gemini's implicit caching once it stabilizes on gemini-3-pro-preview (currently flaky per a known Google issue). - Explicit cachedContents API not added; deferred as a separate PR. ## Expected impact Per-judge token economics on a typical paddock run (3 judges × 100 iterations × 3 scenarios = 900 judge calls): - Claude judge rubric: ~1500 tokens × 90% off via cache_control hits ~= 85% reduction on rubric tokens. Cache write penalty applied once per cache window. - OpenAI judge rubric: ~1500 tokens × 50% off via implicit caching with the prompt_cache_key ~= ~45% reduction on rubric tokens. 24h retention extends savings across paddock CI runs on gpt-4.1. - Gemini: structurally separated; implicit caching tries when working. Combined: roughly 40-50% input-token savings on judge calls across the panel, with no quality impact since the model sees the same content. ## Verification - bunx tsc --noEmit clean (no type errors) - bun build clean (cli.ts builds) - Backward compat verified: providers handle both string and JudgePrompt inputs identically when system is empty - judge.ts unchanged — provider.complete accepts both shapes via the union type Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

In paddock 5-iter smoke testing, gpt-4.1 was silently failing 3/15 calls during the cold-start burst — all 4 workers hit the API within ~3s of each other and tripped a transient rate window. Failed calls threw, Promise.allSettled in consensus.ts swallowed the rejection, and the criteria parser's empty-input fallback wrote verdict=fail score=0 reasoning={}. The metrics make it look like gpt-4.1 hated the agent's output when in reality it never got a usable response. Mirrors the retry pattern already in claude.ts. Up to 3 attempts on 429 / 5xx / network errors with exponential backoff + jitter (1s/2s/4s + up to 500ms). 4xx errors that aren't 429 (auth, validation) are surfaced immediately — those won't fix themselves. Also attaches `status` to the thrown Error so the retry classifier can read it; previously the message was a string-formatted status only. Console warns on every retry and on final non-retriable failure so the workflow log makes silent gpt-4.1 drops visible going forward. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-After Audit follow-ups before merging: 1. Tighten the "undefined status" branch to require `err instanceof TypeError`. Previously any thrown error with no .status would be retried — including res.json() parse failures, AbortErrors, or programming bugs in this provider. Now only fetch-level network failures (which Node/Bun raise as TypeError per WHATWG fetch spec) trigger the network-error retry path. JSON-parse failure on a 200 OK response is more likely a server bug than a transient hiccup, so it surfaces immediately instead of burning retries. 2. Add 408 Request Timeout to the retryable status set. Rare with OpenAI but semantically transient — same class as 429 / 5xx. 3. Honor the Retry-After response header on 4xx/5xx instead of always using exponential backoff. OpenAI returns this on 429 (typically 1-60s); when present, retrying at the server's specified time has a much higher chance of succeeding than our 1-4s exponential backoff. parseRetryAfter handles both numeric-seconds (OpenAI's format) and HTTP-date forms, capped at 60s so a malformed/hostile header can't pin us indefinitely. isRetryable() and parseRetryAfter() are pulled to module-level helpers so the classification logic is testable / readable on its own. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix(evaluator): retry transient OpenAI judge failures

The 100-iter audit revealed that gemini-3-pro-preview was the only strict judge, partial-flagging real SOUL violations on 78% of medium runs while claude-sonnet-4-6 (no thinking) and gpt-4.1 (non-reasoning) unanimously voted pass. The asymmetry isn't gemini being unfair — it's the only judge actually doing chain-of-thought reasoning. Levelling thinking across all judges turns "consensus" back into a real vote of equal evaluators. One env var (EVAL_JUDGE_THINKING_BUDGET, default 8000 tokens) controls reasoning depth across all three providers. Backward-compatible defaults: zero behavior change on existing model selections that don't support the parameters. Claude judge: - Adds `thinking: { type: "enabled", budget_tokens }` for claude-4 family models (sonnet, opus, haiku) - Adds `interleaved-thinking-2025-05-14` beta header so thinking actually fires; threaded through both OAuth and API-key clients - max_tokens 8096 → JUDGE_MAX_TOKENS (default 16000) when thinking is on, to make room for thinking + visible output - Older claude models (3.5, 3-sonnet) get the unchanged path Gemini judge: - Pins thinkingConfig.thinkingBudget for gemini-2.5+/3-* models (was using model-default dynamic budget — non-deterministic) - Older gemini models (1.5, 2.0) get the unchanged path - Bumps maxOutputTokens to JUDGE_MAX_TOKENS for the same reason OpenAI judge: - gpt-5 family + o-series: adds reasoning_effort (mapped from the shared budget: ≤2K→low, ≤12K→medium, >12K→high) and switches max_tokens → max_completion_tokens (required for reasoning models) - gpt-4.1 / gpt-4o: unchanged. gpt-4.1 doesn't accept reasoning_effort (OpenAI markets it as "the smartest non-reasoning model"), so we explicitly gate the parameter on a regex match. To switch the OpenAI judge from snap-judging to reasoning, change vars.OPENAI_MODEL from `gpt-4.1` to `gpt-5` in repo settings — no code change needed; the provider now adapts based on the model name. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…erywhere Audit follow-ups before merging: 1. Drop claude-haiku-4 from the SUPPORTS_THINKING regex. The agent's own THINKING_MODELS allowlist (in knoxai-agent/.../claude.repository.ts) only includes claude-opus-4 and claude-sonnet-4 — adding haiku speculatively was unverified and risked a 400 from Anthropic on a model the user might pick via env override. 2. Make budget=0 actually disable reasoning on OpenAI. Previously the code skipped the reasoning_effort parameter when budget was 0, but gpt-5 silently defaults to "medium" effort when the parameter is absent — meaning EVAL_JUDGE_THINKING_BUDGET=0 was a no-op on OpenAI while disabling Claude and Gemini. Now the budget=0 path explicitly sets reasoning_effort to "none" on gpt-5 family (per OpenAI docs: "for latency-critical tasks that do not benefit from any reasoning") or "low" on older o-series. Truly off across all three providers when budget=0 (modulo o-series, where "low" is the API floor). 3. Validate env values to fail safe instead of silently sending NaN. parseBudget() falls back to default + console.warn on non-finite or negative input. Without this, EVAL_JUDGE_THINKING_BUDGET=abc would send NaN to Anthropic (passes JSON.stringify as null), to Gemini (rejected as invalid), and produce inconsistent behavior. 4. Scope the maxOutputTokens bump on Gemini to thinking-capable models only. Older gemini-1.5/2.0 stay at the original 8096 cap; gemini-2.5+ gets JUDGE_MAX_OUTPUT_TOKENS. Smaller behavior change to legacy paths. Disable-fully matrix: Claude EVAL_JUDGE_THINKING_BUDGET=0 → fully off (no thinking block) Gemini EVAL_JUDGE_THINKING_BUDGET=0 → fully off (thinkingBudget: 0) OpenAI EVAL_JUDGE_THINKING_BUDGET=0 + vars.OPENAI_MODEL=gpt-5* → fully off (reasoning_effort: none) + vars.OPENAI_MODEL=o[0-9]+ → minimum (reasoning_effort: low) + vars.OPENAI_MODEL=gpt-4.1 → already non-reasoning, unchanged Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

OpenAI's reasoning_effort doesn't take a numeric budget — only a string tier. gpt-5 family supports a "minimal" tier between "none" and "low", documented as a small fixed reasoning chain for latency-sensitive use. Adding it gives EVAL_JUDGE_THINKING_BUDGET more granularity at the low end: budget === 0 → "none" (gpt-5) | "low" (o-series fallback) 1 ≤ ≤ 500 → "minimal" (gpt-5 only) | "low" (o-series fallback) 501 ≤ ≤ 2000 → "low" 2001 ≤ ≤ 12000 → "medium" > 12000 → "high" Lets the user pick "almost off but not quite zero" by setting EVAL_JUDGE_THINKING_BUDGET=1 (or any small value ≤500) on gpt-5 — model keeps a tiny reasoning chain rather than dropping it entirely. "minimal" is gpt-5-specific per OpenAI docs; older o-series doesn't accept it, so we fall through to "low" there. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…line feat(evaluator): unify thinking baseline across all 3 judges

Judges had no context about paddock's blockedTools list (gcloud_exec, kubectl_exec, aws_exec, etc.). They saw tool results like '{error: "blocked in eval mode"}' or 'permission_denied' in the trace and three different judges interpreted it three different ways: - some treated it as honest reporting → pass - gemini sometimes flagged it as 'agent complaining about sandbox' → soul_compliance violation - gpt-5 sometimes flagged it as 'agent didn't determine actual auth state' → correctness issue This added 'Tool blocking in eval mode' section to JUDGE_SYSTEM_PROMPT calibrates all three judges to the same interpretation: - Honest reporting + fall-back to task-message values = correct - Giving up because tools were blocked = still a failure (correctness) - Misreporting blocked-as-failed = still a failure (correctness) - Acknowledging blocked tool is NOT a soul_compliance violation Mirrors the agent's own SOUL.md guidance ('Blocked or Failed Tools' section), so judges grade against the same expectation the agent is trained to. Single-touch global fix — one prompt edit covers all scenarios using this paddock binary. No per-scenario YAML changes needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-context feat(evaluator): teach judges that paddock-blocked tools are intentional

The previous 'blocked in eval mode' error-shaped return was frequently misread by reasoning agents as a real failure. A 5-iter run on Opus 4.7 showed the agent emitting 'Auth check: FAILED' as the response header in 3 of 5 easy-scenario runs because the harness's error-shaped tool result trained the agent to treat it as a definitive negative — when in reality no auth check actually occurred. Replaces the error-shape with a stub-shape: { stub: true, message: ... }. The message explicitly tells the agent the call was intercepted, no result is available, and to continue from task-message values without labeling any derived check as FAILED. The `stub: true` flag also gives downstream agent code an unambiguous structural signal if it wants to branch on intent. Drops the `error` field on the trace because semantically there is no error here — the harness intentionally short-circuits, the call never fails. Removing the field prevents downstream sanitizers from re-categorizing it as 'permission_denied' or similar real-failure shapes. Pairs with knoxai-agent SOUL.md guidance (the 'Blocked or Failed Tools' section) and the previous fork PR #6 (judge prompt context for intentional blocking). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…onse feat(runner): wrap-tool returns stub result instead of error

Supersedes #7 + #8. Pivots paddock's blocked-tool semantics from 'error or undetermined state the agent has to label correctly' to 'the call effectively succeeded; move on.' A blocked tool in eval mode is the harness intentionally short-circuiting — closer to 'success-equivalent' than to 'failure with undetermined state.' History: - Original shape: { error: "blocked in eval mode" } — agent treated as real failure, labeled checks 'auth: FAILED'. - PR #7: { stub: true, message: "...intercepted..." } — fixed FAILED-mislabeling but introduced fabricated-SUCCEEDED in the opposite direction. 100-iter on Opus 4.7 had claude judge flagging the agent for fabricating 'Auth check: SUCCEEDED' on blocked tools. - PR #8 (closed): symmetric UNDETERMINED stub message — would have required the agent to reason about labeling. - This PR: success-shape — no labeling reasoning required. New shape: { ok: true, output: "[stubbed in eval mode]" } The output string is bare so the agent has no fake auth tokens or project info to pattern-match around. SOUL.md "Blocked or Failed Tools" guidance still drives the agent to use task-message values for planning when verification doesn't yield real data. Trade-off acknowledged: removes paddock coverage of the agent's failure-handling code path (loop.service.ts:errorValue check). That path is still exercised in production whenever real tools genuinely fail. The loss is only in eval. If failure-handling needs paddock coverage in the future, that's a separate scenario type with a tool that returns a real-shaped error — different concern from eval-blocking. Pairs with knoxai-agent: - PR #16 fix/soul-adherence (SOUL trim + sanitizer bypass) - PR #17 fix/rollback-digest-capture (containers skill rollback) Mason's tool-result-sanitizer in knoxai-agent (with stub-bypass we added in PR #16) passes this success-shape through cleanly: no 'eval mode' / 'sandbox' leak strings in the result, no error field to scrub. No further agent-side change required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…hape feat(runner): blocked tools return success-shape (supersedes #7 + #8)

Previously paddock's run-N.json tokenUsage map only listed judge models (e.g. claude-sonnet-4-6, gemini-3-pro-preview, gpt-5) — there was no visibility into what the agent itself consumed. This blocked cost attribution: we could see what the judges cost to grade, but not what the agent cost to run. This PR captures the agent's token usage via runtime.getTokenUsage() (landing in knoxai-agent in companion PR Knox-Gov/knoxai-agent#22) and merges it under the same tokenUsage map keyed as `claude/<model>` so the existing report writer and downstream aggregators iterate uniformly across agent + judges. Changes: - ExecutionTrace gains optional agentTokenUsage?: Record<string, TokenUsage> - AgentRunner.runScenarioOnce captures runtime.getTokenUsage() before runtime.stop() (order matters — stop tears down the LLM gateway). Uses duck typing so older agent runtimes without the getter still work. - Loop orchestrator sums agentTokenUsage across all scenario traces and writes the per-model totals into state.tokenUsage as `claude/<model>`. After this PR + companion knoxai-agent PR, run-N.json shape becomes: "tokenUsage": { "claude/claude-opus-4-7": { inputTokens, outputTokens, totalTokens }, // ← new "claude/claude-sonnet-4-6": { ... }, // Claude judge "gemini/gemini-3-pro-preview": { ... }, "openai/gpt-5": { ... } } Cost computation + Cost Summary section in aggregate-report.md will land in knoxai-agent's consistency-test.sh Python aggregator (separate PR). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…judge collision PR audit caught a silent double-counting bug: keying agent entries as `claude/<model>` collided with the Claude judge provider keyed as `claude/<model>` whenever both ran on the same Claude model — which is the default state in knoxai-agent's .env.example (judge defaults to claude-sonnet-4-6, agent defaults to claude-sonnet-4-6 unless overridden). The judge's tokens went into the map first, then the agent loop summed its own tokens onto the same key, silently misattributing cost between the two roles. Switch agent entries to `agent/<model>` so the namespace is unambiguous. Downstream consumers (knoxai-agent's consistency-test aggregator) detect agent rows via `pm.startswith("agent/")` instead of model-name equality. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(token-usage): capture agent token usage in tokenUsage map

Companion to knoxai-agent feat/token-cost-tracking. TokenUsage gains three optional fields so each judge provider can report billing buckets disjointly: cacheCreationTokens anthropic cache_creation_input_tokens cacheReadTokens anthropic cache_read_input_tokens, openai cached_tokens thinkingTokens gemini thoughtsTokenCount, openai reasoning_tokens Provider changes: - claude.ts judge: extracts cache_creation_input_tokens / cache_read_input_tokens from response.usage; outputTokens stays as-is (anthropic output already includes thinking, no split available in usage). - gemini.ts judge: populates thinkingTokens from thoughtsTokenCount; recomputes totalTokens locally because Google's totalTokenCount inconsistently includes thoughts depending on model version. Closes the long-standing under-counting where Gemini's thinking spend (5-10K tokens per judge call) silently dropped. - openai.ts judge: splits prompt_tokens into uncached + cache_read using prompt_tokens_details.cached_tokens, splits completion_tokens into visible + reasoning using completion_tokens_details.reasoning_tokens. Subtraction pattern keeps each bucket disjoint so downstream cost math doesn't double-count. Orchestrator merge for agent token usage now sums all five fields so the agent's cache breakdown propagates to run-N.json's tokenUsage map (paddock's companion knoxai-agent runtime captures these via runtime.getTokenUsage()). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

After adding cacheCreationTokens / cacheReadTokens / thinkingTokens to TokenUsage, three render sites still showed only Input/Output/Total — which now means rows display tiny in/out next to a much larger total (the post-breakdown total is the sum of all 5 buckets), looking like a math error to readers. Changes: - formatter.ts: FormattedUsage gains a `cache` field. Cache cell renders "<cache_read> / <cache_write>" or "—" when both zero. Output cell renders "<visible> / <thinking>" or just visible when no thinking. Grand total accumulates all 5 buckets and uses summed totalTokens (not grandIn + grandOut) so it reconciles with the per-row totals. - writer.ts: per-eval markdown report adds Cache and merged Output columns matching the formatter shape, plus a one-line preface explaining the breakdown semantics. - cli.ts: terse line format gains optional bracketed suffixes — `<input> in [+ <r>r/<w>w cache] / <out> out [+ <think> think] (<total>)` — only rendered when those buckets are non-zero, so non-cached / non-thinking judges keep their compact display. No data changes; render-only fix to make the post-breakdown numbers reconcile visually. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previously runScenarioOnce captured agentTokenUsage and stopped the runtime in the happy path, but the outer catch block returned a trace WITHOUT either step running. If the agent made N successful LLM calls then crashed on call N+1, all N calls' tokens were silently dropped from the cost report — and the runtime kept running (cron timers, heartbeat, S3 push), leaking until the parent process exited. Restructure the method to a try/catch/finally pattern: - Hoist `runtime`, `prevEnv`, and `agentTokenUsage` to the method scope so both branches can access them - Move token capture, runtime.stop(), and env restore into the `finally` block (return-then-finally semantics guarantee they fire before the function actually resolves, so the trace returned by either return statement carries the captured tokens) - captureTokens() helper is defensive: no-op if runtime never got constructed, swallows getter throws so cleanup keeps going Reported by audit follow-up (`agent-runner.ts:382-397`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feat(cost-breakdown): cache + thinking token buckets, render fixes, failed-scenario capture

JS evaluates return-value expressions before running the corresponding finally block, so my prior refactor — which moved captureTokens() to finally — left agentTokenUsage undefined at the moment the trace object was built. Net effect: every successful run produced a trace with no agentTokenUsage, and the orchestrator silently dropped the agent row from tokenUsage. End-to-end test confirmed: run-N.json's tokenUsage contained only the 3 judge keys, no agent/<model> entry. Add explicit captureTokens() calls right before each return statement. The finally block still calls captureTokens() — that's belt-and- suspenders for the failed-scenario path where the catch block might itself throw, plus runtime.stop() and env restore still need a finally home. captureTokens() is idempotent (re-reads the in-memory accumulator) so calling it twice is safe. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix(runner): capture tokens BEFORE return, not (only) in finally

kamranjiwani-knox and others added 25 commits April 29, 2026 06:44

Merge pull request #4 from kamranjiwani-knox/feat/openai-judge-retry

608c38f

fix(evaluator): retry transient OpenAI judge failures

Merge pull request #5 from kamranjiwani-knox/feat/judge-thinking-base…

6ee4657

…line feat(evaluator): unify thinking baseline across all 3 judges

Merge pull request #6 from kamranjiwani-knox/feat/judge-blocked-tools…

d6041aa

…-context feat(evaluator): teach judges that paddock-blocked tools are intentional

Merge pull request #7 from kamranjiwani-knox/feat/wrap-tool-stub-resp…

e43fcf4

…onse feat(runner): wrap-tool returns stub result instead of error

Merge pull request #9 from kamranjiwani-knox/feat/wrap-tool-success-s…

adc899a

…hape feat(runner): blocked tools return success-shape (supersedes #7 + #8)

Merge pull request #10 from kamranjiwani-knox/feat/cost-summary

5933081

feat(token-usage): capture agent token usage in tokenUsage map

Merge pull request #11 from kamranjiwani-knox/feat/cost-breakdown

9c9a8c3

feat(cost-breakdown): cache + thinking token buckets, render fixes, failed-scenario capture

Merge pull request #12 from kamranjiwani-knox/fix/token-capture-timing

1c54fcd

fix(runner): capture tokens BEFORE return, not (only) in finally

dmitriyzhuk merged commit a5bbd16 into CleanSlice:main May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge my paddock changes int cleanslice#2

Merge my paddock changes int cleanslice#2
dmitriyzhuk merged 25 commits into
CleanSlice:mainfrom
kamranjiwani-knox:main

kamranjiwani-knox commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kamranjiwani-knox commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants