Skip to content

Merge my paddock changes int cleanslice#2

Merged
dmitriyzhuk merged 25 commits into
CleanSlice:mainfrom
kamranjiwani-knox:main
May 6, 2026
Merged

Merge my paddock changes int cleanslice#2
dmitriyzhuk merged 25 commits into
CleanSlice:mainfrom
kamranjiwani-knox:main

Conversation

@kamranjiwani-knox
Copy link
Copy Markdown
Contributor

No description provided.

kamranjiwani-knox and others added 25 commits April 29, 2026 06:44
…I/Gemini

Two changes addressing real failure modes seen in 100-iter knoxai-agent
paddock runs:

1) Switch judge output from regex-parsed text to JSON.

   - buildJudgePrompt now ends with a JSON schema spec and an example,
     instead of the SCORE[dim]: / VERDICT: / etc. text format.
   - parseJudgeResponse tries JSON first (strips ```json fences and
     leading preambles, finds the first { ... last } block, JSON.parse,
     validates shape). Falls back to the original regex parser if JSON
     parsing fails so existing behavior on legacy text outputs is
     preserved.
   - OpenAI judge: adds response_format: { type: "json_object" } —
     guarantees parseable JSON, eliminating the format-drift failures
     that caused gpt-4.1 to flag agents as fail when their reasoning
     ran past 4096 tokens (~13% fail rate observed on hard scenario).
   - Gemini judge: adds generationConfig.responseMimeType:
     "application/json" — same guarantee on Google's side.
   - Claude judge needs no API change; the prompt now requests JSON
     and Claude follows the format reliably. The text-fallback parser
     covers any drift.

2) Bump max_tokens from 4096 to 8096 on OpenAI and Gemini judges.

   The Claude judge has been at 8096 already. The other two were
   inconsistently lower and would truncate verbose multi-dimension
   reasoning, leaving the parser unable to extract scores. With JSON
   mode active this also guarantees enough headroom for the JSON
   object plus reasoning fields without truncation.

Backward compatibility: parseJudgeResponse falls through to the
original regex-based parser if JSON.parse fails or the shape is
unrecognized. No call sites changed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…partial bias

Empirical evidence from a 100-iter knoxai-agent paddock run revealed
gemini-3-pro-preview gives "partial" verdicts on 47% of medium-scenario
runs WHILE simultaneously scoring all dimensions at 8.5-9.1 with
glowing reasoning text praising the agent's behavior. Sample reasoning
on partial verdicts:

  "The agent successfully attempted gcloud_exec to check the current
   state and used cve_lookup to analyze the CVEs, accurately
   incorporating the findings (severity, CVSS, exploit status) into
   its justification for the upgrade."  → verdict: partial, score: 9.1

This violates paddock's own verdict-from-scores rule ("If ALL scored
dimensions are >= 8 → verdict: pass"). Other judges (Claude, GPT-4.1)
follow the rule correctly:

  Per-judge verdict distribution on medium scenario (101 runs):
  - claude-sonnet-4-6:    101 pass / 0 partial / 0 fail
  - gpt-4.1:              99 pass / 0 partial / 2 fail
  - gemini-3-pro-preview: 54 pass / 47 partial / 0 fail   ← anomaly

Net effect: per-scenario judge agreement on medium drops from ~98% on
dev runs (no CVE criteria, no anomaly) to ~84% on feat runs, purely
because Gemini downgrades to partial despite high scores.

The previous prompt rule was suggestive ("IMPORTANT — Verdict MUST be
consistent..."). This rewrite makes the rule imperative and explicit:

- Computes the verdict as a deterministic function of scores
- Lists forbidden combinations (e.g. "partial when all dimensions
  scored 8+")
- Explicitly redirects nuance into the reasoning field
- Tells judges to lower a dimension score if they want to express
  dissatisfaction, instead of downgrading the verdict

JSON-mode output (from PR #1) keeps the schema enforced; this PR
addresses the BEHAVIORAL bias separately. Both changes are needed to
get consistent verdicts across the panel.

If gemini-3-pro continues to show elevated partial rates after this,
escalate to runtime verdict correction in parseJudgeResponse
(post-hoc enforcement of the rule).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ed prompts

Adds caching support for the judge LLM calls — the highest-volume side
of paddock evaluations after the agent loop itself. Structurally
separates the static rubric/format spec (~1500 tokens, identical
across every judge call within a paddock run) from the variable
scenario/trace content.

## Changes

1. types.ts
   - New JudgePrompt interface with explicit { system, user } separation.
   - JudgeProvider.complete() now accepts string | JudgePrompt.
     Backward compatible — legacy string callers still work.

2. evaluator/criteria.ts
   - buildJudgePrompt() now returns JudgePrompt instead of a single string.
   - Static prefix (instructions, verdict rules, output format, example)
     extracted into a constant JUDGE_SYSTEM_PROMPT.
   - Variable suffix (scenario, SOUL.md, execution trace, success
     criteria, dimension list) goes into the user content.
   - parseJudgeResponse unchanged — still operates on the raw string
     response, so its JSON+text-fallback path keeps working.

3. evaluator/providers/claude.ts
   - When called with a JudgePrompt, sends the system block with
     cache_control: { type: "ephemeral" }.
   - Anthropic caches everything up to that breakpoint. Cache hits
     are billed at 10% of input cost, with measurable TTFT improvement.
   - String input still works (sent as a plain user message, no system
     block).

4. evaluator/providers/openai.ts
   - When called with a JudgePrompt, sends two messages: system role
     + user role.
   - Adds prompt_cache_key: "paddock-judge" to keep all judge calls on
     the same shard, maximizing implicit cache hit rate.
   - Adds prompt_cache_retention: "24h" only for gpt-4.1 / gpt-5
     family models (per OpenAI docs — not all models support 24h).

5. evaluator/providers/gemini.ts
   - When called with a JudgePrompt, sends the system as
     systemInstruction (Gemini's structured separation).
   - This positions us for Gemini's implicit caching once it stabilizes
     on gemini-3-pro-preview (currently flaky per a known Google issue).
   - Explicit cachedContents API not added; deferred as a separate PR.

## Expected impact

Per-judge token economics on a typical paddock run (3 judges × 100
iterations × 3 scenarios = 900 judge calls):

- Claude judge rubric: ~1500 tokens × 90% off via cache_control hits
  ~= 85% reduction on rubric tokens. Cache write penalty applied
  once per cache window.
- OpenAI judge rubric: ~1500 tokens × 50% off via implicit caching
  with the prompt_cache_key ~= ~45% reduction on rubric tokens. 24h
  retention extends savings across paddock CI runs on gpt-4.1.
- Gemini: structurally separated; implicit caching tries when working.

Combined: roughly 40-50% input-token savings on judge calls across the
panel, with no quality impact since the model sees the same content.

## Verification

- bunx tsc --noEmit clean (no type errors)
- bun build clean (cli.ts builds)
- Backward compat verified: providers handle both string and
  JudgePrompt inputs identically when system is empty
- judge.ts unchanged — provider.complete accepts both shapes via the
  union type

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
In paddock 5-iter smoke testing, gpt-4.1 was silently failing 3/15
calls during the cold-start burst — all 4 workers hit the API within
~3s of each other and tripped a transient rate window. Failed calls
threw, Promise.allSettled in consensus.ts swallowed the rejection,
and the criteria parser's empty-input fallback wrote verdict=fail
score=0 reasoning={}. The metrics make it look like gpt-4.1 hated the
agent's output when in reality it never got a usable response.

Mirrors the retry pattern already in claude.ts. Up to 3 attempts on
429 / 5xx / network errors with exponential backoff + jitter
(1s/2s/4s + up to 500ms). 4xx errors that aren't 429 (auth, validation)
are surfaced immediately — those won't fix themselves.

Also attaches `status` to the thrown Error so the retry classifier
can read it; previously the message was a string-formatted status only.

Console warns on every retry and on final non-retriable failure so
the workflow log makes silent gpt-4.1 drops visible going forward.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-After

Audit follow-ups before merging:

1. Tighten the "undefined status" branch to require `err instanceof TypeError`.
   Previously any thrown error with no .status would be retried — including
   res.json() parse failures, AbortErrors, or programming bugs in this
   provider. Now only fetch-level network failures (which Node/Bun raise as
   TypeError per WHATWG fetch spec) trigger the network-error retry path.
   JSON-parse failure on a 200 OK response is more likely a server bug than
   a transient hiccup, so it surfaces immediately instead of burning retries.

2. Add 408 Request Timeout to the retryable status set. Rare with OpenAI but
   semantically transient — same class as 429 / 5xx.

3. Honor the Retry-After response header on 4xx/5xx instead of always using
   exponential backoff. OpenAI returns this on 429 (typically 1-60s); when
   present, retrying at the server's specified time has a much higher
   chance of succeeding than our 1-4s exponential backoff. parseRetryAfter
   handles both numeric-seconds (OpenAI's format) and HTTP-date forms,
   capped at 60s so a malformed/hostile header can't pin us indefinitely.

isRetryable() and parseRetryAfter() are pulled to module-level helpers so
the classification logic is testable / readable on its own.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(evaluator): retry transient OpenAI judge failures
The 100-iter audit revealed that gemini-3-pro-preview was the only
strict judge, partial-flagging real SOUL violations on 78% of medium
runs while claude-sonnet-4-6 (no thinking) and gpt-4.1 (non-reasoning)
unanimously voted pass. The asymmetry isn't gemini being unfair — it's
the only judge actually doing chain-of-thought reasoning. Levelling
thinking across all judges turns "consensus" back into a real vote of
equal evaluators.

One env var (EVAL_JUDGE_THINKING_BUDGET, default 8000 tokens) controls
reasoning depth across all three providers. Backward-compatible defaults:
zero behavior change on existing model selections that don't support
the parameters.

Claude judge:
  - Adds `thinking: { type: "enabled", budget_tokens }` for claude-4
    family models (sonnet, opus, haiku)
  - Adds `interleaved-thinking-2025-05-14` beta header so thinking
    actually fires; threaded through both OAuth and API-key clients
  - max_tokens 8096 → JUDGE_MAX_TOKENS (default 16000) when thinking
    is on, to make room for thinking + visible output
  - Older claude models (3.5, 3-sonnet) get the unchanged path

Gemini judge:
  - Pins thinkingConfig.thinkingBudget for gemini-2.5+/3-* models
    (was using model-default dynamic budget — non-deterministic)
  - Older gemini models (1.5, 2.0) get the unchanged path
  - Bumps maxOutputTokens to JUDGE_MAX_TOKENS for the same reason

OpenAI judge:
  - gpt-5 family + o-series: adds reasoning_effort (mapped from the
    shared budget: ≤2K→low, ≤12K→medium, >12K→high) and switches
    max_tokens → max_completion_tokens (required for reasoning models)
  - gpt-4.1 / gpt-4o: unchanged. gpt-4.1 doesn't accept reasoning_effort
    (OpenAI markets it as "the smartest non-reasoning model"), so we
    explicitly gate the parameter on a regex match.

To switch the OpenAI judge from snap-judging to reasoning, change
vars.OPENAI_MODEL from `gpt-4.1` to `gpt-5` in repo settings — no code
change needed; the provider now adapts based on the model name.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…erywhere

Audit follow-ups before merging:

1. Drop claude-haiku-4 from the SUPPORTS_THINKING regex. The agent's own
   THINKING_MODELS allowlist (in knoxai-agent/.../claude.repository.ts)
   only includes claude-opus-4 and claude-sonnet-4 — adding haiku
   speculatively was unverified and risked a 400 from Anthropic on a
   model the user might pick via env override.

2. Make budget=0 actually disable reasoning on OpenAI. Previously the
   code skipped the reasoning_effort parameter when budget was 0, but
   gpt-5 silently defaults to "medium" effort when the parameter is
   absent — meaning EVAL_JUDGE_THINKING_BUDGET=0 was a no-op on OpenAI
   while disabling Claude and Gemini. Now the budget=0 path explicitly
   sets reasoning_effort to "none" on gpt-5 family (per OpenAI docs:
   "for latency-critical tasks that do not benefit from any reasoning")
   or "low" on older o-series. Truly off across all three providers
   when budget=0 (modulo o-series, where "low" is the API floor).

3. Validate env values to fail safe instead of silently sending NaN.
   parseBudget() falls back to default + console.warn on non-finite or
   negative input. Without this, EVAL_JUDGE_THINKING_BUDGET=abc would
   send NaN to Anthropic (passes JSON.stringify as null), to Gemini
   (rejected as invalid), and produce inconsistent behavior.

4. Scope the maxOutputTokens bump on Gemini to thinking-capable models
   only. Older gemini-1.5/2.0 stay at the original 8096 cap; gemini-2.5+
   gets JUDGE_MAX_OUTPUT_TOKENS. Smaller behavior change to legacy paths.

Disable-fully matrix:

  Claude    EVAL_JUDGE_THINKING_BUDGET=0      → fully off (no thinking block)
  Gemini    EVAL_JUDGE_THINKING_BUDGET=0      → fully off (thinkingBudget: 0)
  OpenAI    EVAL_JUDGE_THINKING_BUDGET=0
            + vars.OPENAI_MODEL=gpt-5*        → fully off (reasoning_effort: none)
            + vars.OPENAI_MODEL=o[0-9]+       → minimum (reasoning_effort: low)
            + vars.OPENAI_MODEL=gpt-4.1       → already non-reasoning, unchanged

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OpenAI's reasoning_effort doesn't take a numeric budget — only a string
tier. gpt-5 family supports a "minimal" tier between "none" and "low",
documented as a small fixed reasoning chain for latency-sensitive use.
Adding it gives EVAL_JUDGE_THINKING_BUDGET more granularity at the
low end:

  budget === 0     → "none" (gpt-5) | "low" (o-series fallback)
  1   ≤ ≤ 500      → "minimal" (gpt-5 only) | "low" (o-series fallback)
  501 ≤ ≤ 2000     → "low"
  2001 ≤ ≤ 12000   → "medium"
  > 12000          → "high"

Lets the user pick "almost off but not quite zero" by setting
EVAL_JUDGE_THINKING_BUDGET=1 (or any small value ≤500) on gpt-5 — model
keeps a tiny reasoning chain rather than dropping it entirely.
"minimal" is gpt-5-specific per OpenAI docs; older o-series doesn't
accept it, so we fall through to "low" there.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…line

feat(evaluator): unify thinking baseline across all 3 judges
Judges had no context about paddock's blockedTools list (gcloud_exec,
kubectl_exec, aws_exec, etc.). They saw tool results like
'{error: "blocked in eval mode"}' or 'permission_denied' in the trace
and three different judges interpreted it three different ways:
- some treated it as honest reporting → pass
- gemini sometimes flagged it as 'agent complaining about sandbox' →
  soul_compliance violation
- gpt-5 sometimes flagged it as 'agent didn't determine actual auth
  state' → correctness issue

This added 'Tool blocking in eval mode' section to JUDGE_SYSTEM_PROMPT
calibrates all three judges to the same interpretation:

- Honest reporting + fall-back to task-message values = correct
- Giving up because tools were blocked = still a failure (correctness)
- Misreporting blocked-as-failed = still a failure (correctness)
- Acknowledging blocked tool is NOT a soul_compliance violation

Mirrors the agent's own SOUL.md guidance ('Blocked or Failed Tools'
section), so judges grade against the same expectation the agent is
trained to.

Single-touch global fix — one prompt edit covers all scenarios using
this paddock binary. No per-scenario YAML changes needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-context

feat(evaluator): teach judges that paddock-blocked tools are intentional
The previous 'blocked in eval mode' error-shaped return was frequently
misread by reasoning agents as a real failure. A 5-iter run on Opus 4.7
showed the agent emitting 'Auth check: FAILED' as the response header
in 3 of 5 easy-scenario runs because the harness's error-shaped tool
result trained the agent to treat it as a definitive negative — when
in reality no auth check actually occurred.

Replaces the error-shape with a stub-shape: { stub: true, message: ... }.
The message explicitly tells the agent the call was intercepted, no
result is available, and to continue from task-message values without
labeling any derived check as FAILED. The `stub: true` flag also
gives downstream agent code an unambiguous structural signal if it
wants to branch on intent.

Drops the `error` field on the trace because semantically there is no
error here — the harness intentionally short-circuits, the call never
fails. Removing the field prevents downstream sanitizers from
re-categorizing it as 'permission_denied' or similar real-failure
shapes.

Pairs with knoxai-agent SOUL.md guidance (the 'Blocked or Failed Tools'
section) and the previous fork PR #6 (judge prompt context for
intentional blocking).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…onse

feat(runner): wrap-tool returns stub result instead of error
Supersedes #7 + #8. Pivots paddock's blocked-tool semantics from
'error or undetermined state the agent has to label correctly' to
'the call effectively succeeded; move on.' A blocked tool in eval
mode is the harness intentionally short-circuiting — closer to
'success-equivalent' than to 'failure with undetermined state.'

History:
  - Original shape: { error: "blocked in eval mode" } — agent
    treated as real failure, labeled checks 'auth: FAILED'.
  - PR #7: { stub: true, message: "...intercepted..." } — fixed
    FAILED-mislabeling but introduced fabricated-SUCCEEDED in the
    opposite direction. 100-iter on Opus 4.7 had claude judge
    flagging the agent for fabricating 'Auth check: SUCCEEDED' on
    blocked tools.
  - PR #8 (closed): symmetric UNDETERMINED stub message — would
    have required the agent to reason about labeling.
  - This PR: success-shape — no labeling reasoning required.

New shape:
  { ok: true, output: "[stubbed in eval mode]" }

The output string is bare so the agent has no fake auth tokens or
project info to pattern-match around. SOUL.md "Blocked or Failed
Tools" guidance still drives the agent to use task-message values
for planning when verification doesn't yield real data.

Trade-off acknowledged: removes paddock coverage of the agent's
failure-handling code path (loop.service.ts:errorValue check). That
path is still exercised in production whenever real tools genuinely
fail. The loss is only in eval. If failure-handling needs paddock
coverage in the future, that's a separate scenario type with a tool
that returns a real-shaped error — different concern from
eval-blocking.

Pairs with knoxai-agent:
  - PR #16 fix/soul-adherence (SOUL trim + sanitizer bypass)
  - PR #17 fix/rollback-digest-capture (containers skill rollback)

Mason's tool-result-sanitizer in knoxai-agent (with stub-bypass we
added in PR #16) passes this success-shape through cleanly:
no 'eval mode' / 'sandbox' leak strings in the result, no error
field to scrub. No further agent-side change required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hape

feat(runner): blocked tools return success-shape (supersedes #7 + #8)
Previously paddock's run-N.json tokenUsage map only listed judge models
(e.g. claude-sonnet-4-6, gemini-3-pro-preview, gpt-5) — there was no
visibility into what the agent itself consumed. This blocked cost
attribution: we could see what the judges cost to grade, but not what
the agent cost to run.

This PR captures the agent's token usage via runtime.getTokenUsage()
(landing in knoxai-agent in companion PR Knox-Gov/knoxai-agent#22) and
merges it under the same tokenUsage map keyed as `claude/<model>` so
the existing report writer and downstream aggregators iterate uniformly
across agent + judges.

Changes:
- ExecutionTrace gains optional agentTokenUsage?: Record<string, TokenUsage>
- AgentRunner.runScenarioOnce captures runtime.getTokenUsage() before
  runtime.stop() (order matters — stop tears down the LLM gateway). Uses
  duck typing so older agent runtimes without the getter still work.
- Loop orchestrator sums agentTokenUsage across all scenario traces and
  writes the per-model totals into state.tokenUsage as `claude/<model>`.

After this PR + companion knoxai-agent PR, run-N.json shape becomes:

  "tokenUsage": {
    "claude/claude-opus-4-7":      { inputTokens, outputTokens, totalTokens },  // ← new
    "claude/claude-sonnet-4-6":    { ... },  // Claude judge
    "gemini/gemini-3-pro-preview": { ... },
    "openai/gpt-5":                { ... }
  }

Cost computation + Cost Summary section in aggregate-report.md will land
in knoxai-agent's consistency-test.sh Python aggregator (separate PR).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…judge collision

PR audit caught a silent double-counting bug: keying agent entries as
`claude/<model>` collided with the Claude judge provider keyed as
`claude/<model>` whenever both ran on the same Claude model — which is
the default state in knoxai-agent's .env.example (judge defaults to
claude-sonnet-4-6, agent defaults to claude-sonnet-4-6 unless overridden).
The judge's tokens went into the map first, then the agent loop summed
its own tokens onto the same key, silently misattributing cost between
the two roles.

Switch agent entries to `agent/<model>` so the namespace is unambiguous.
Downstream consumers (knoxai-agent's consistency-test aggregator) detect
agent rows via `pm.startswith("agent/")` instead of model-name equality.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat(token-usage): capture agent token usage in tokenUsage map
Companion to knoxai-agent feat/token-cost-tracking.

TokenUsage gains three optional fields so each judge provider can report
billing buckets disjointly:
  cacheCreationTokens   anthropic cache_creation_input_tokens
  cacheReadTokens       anthropic cache_read_input_tokens, openai cached_tokens
  thinkingTokens        gemini thoughtsTokenCount, openai reasoning_tokens

Provider changes:
- claude.ts judge: extracts cache_creation_input_tokens / cache_read_input_tokens
  from response.usage; outputTokens stays as-is (anthropic output already
  includes thinking, no split available in usage).
- gemini.ts judge: populates thinkingTokens from thoughtsTokenCount; recomputes
  totalTokens locally because Google's totalTokenCount inconsistently includes
  thoughts depending on model version. Closes the long-standing under-counting
  where Gemini's thinking spend (5-10K tokens per judge call) silently dropped.
- openai.ts judge: splits prompt_tokens into uncached + cache_read using
  prompt_tokens_details.cached_tokens, splits completion_tokens into visible
  + reasoning using completion_tokens_details.reasoning_tokens. Subtraction
  pattern keeps each bucket disjoint so downstream cost math doesn't
  double-count.

Orchestrator merge for agent token usage now sums all five fields so the
agent's cache breakdown propagates to run-N.json's tokenUsage map (paddock's
companion knoxai-agent runtime captures these via runtime.getTokenUsage()).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After adding cacheCreationTokens / cacheReadTokens / thinkingTokens to
TokenUsage, three render sites still showed only Input/Output/Total —
which now means rows display tiny in/out next to a much larger total
(the post-breakdown total is the sum of all 5 buckets), looking like a
math error to readers.

Changes:
- formatter.ts: FormattedUsage gains a `cache` field. Cache cell renders
  "<cache_read> / <cache_write>" or "—" when both zero. Output cell
  renders "<visible> / <thinking>" or just visible when no thinking.
  Grand total accumulates all 5 buckets and uses summed totalTokens
  (not grandIn + grandOut) so it reconciles with the per-row totals.
- writer.ts: per-eval markdown report adds Cache and merged Output
  columns matching the formatter shape, plus a one-line preface
  explaining the breakdown semantics.
- cli.ts: terse line format gains optional bracketed suffixes —
  `<input> in [+ <r>r/<w>w cache] / <out> out [+ <think> think] (<total>)`
  — only rendered when those buckets are non-zero, so non-cached /
  non-thinking judges keep their compact display.

No data changes; render-only fix to make the post-breakdown numbers
reconcile visually.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously runScenarioOnce captured agentTokenUsage and stopped the
runtime in the happy path, but the outer catch block returned a trace
WITHOUT either step running. If the agent made N successful LLM calls
then crashed on call N+1, all N calls' tokens were silently dropped
from the cost report — and the runtime kept running (cron timers,
heartbeat, S3 push), leaking until the parent process exited.

Restructure the method to a try/catch/finally pattern:
- Hoist `runtime`, `prevEnv`, and `agentTokenUsage` to the method scope
  so both branches can access them
- Move token capture, runtime.stop(), and env restore into the `finally`
  block (return-then-finally semantics guarantee they fire before the
  function actually resolves, so the trace returned by either return
  statement carries the captured tokens)
- captureTokens() helper is defensive: no-op if runtime never got
  constructed, swallows getter throws so cleanup keeps going

Reported by audit follow-up (`agent-runner.ts:382-397`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat(cost-breakdown): cache + thinking token buckets, render fixes, failed-scenario capture
JS evaluates return-value expressions before running the corresponding
finally block, so my prior refactor — which moved captureTokens() to
finally — left agentTokenUsage undefined at the moment the trace object
was built. Net effect: every successful run produced a trace with no
agentTokenUsage, and the orchestrator silently dropped the agent row
from tokenUsage. End-to-end test confirmed: run-N.json's tokenUsage
contained only the 3 judge keys, no agent/<model> entry.

Add explicit captureTokens() calls right before each return statement.
The finally block still calls captureTokens() — that's belt-and-
suspenders for the failed-scenario path where the catch block might
itself throw, plus runtime.stop() and env restore still need a finally
home. captureTokens() is idempotent (re-reads the in-memory accumulator)
so calling it twice is safe.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(runner): capture tokens BEFORE return, not (only) in finally
@dmitriyzhuk dmitriyzhuk merged commit a5bbd16 into CleanSlice:main May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants