Merge my paddock changes int cleanslice#2
Merged
Conversation
…I/Gemini
Two changes addressing real failure modes seen in 100-iter knoxai-agent
paddock runs:
1) Switch judge output from regex-parsed text to JSON.
- buildJudgePrompt now ends with a JSON schema spec and an example,
instead of the SCORE[dim]: / VERDICT: / etc. text format.
- parseJudgeResponse tries JSON first (strips ```json fences and
leading preambles, finds the first { ... last } block, JSON.parse,
validates shape). Falls back to the original regex parser if JSON
parsing fails so existing behavior on legacy text outputs is
preserved.
- OpenAI judge: adds response_format: { type: "json_object" } —
guarantees parseable JSON, eliminating the format-drift failures
that caused gpt-4.1 to flag agents as fail when their reasoning
ran past 4096 tokens (~13% fail rate observed on hard scenario).
- Gemini judge: adds generationConfig.responseMimeType:
"application/json" — same guarantee on Google's side.
- Claude judge needs no API change; the prompt now requests JSON
and Claude follows the format reliably. The text-fallback parser
covers any drift.
2) Bump max_tokens from 4096 to 8096 on OpenAI and Gemini judges.
The Claude judge has been at 8096 already. The other two were
inconsistently lower and would truncate verbose multi-dimension
reasoning, leaving the parser unable to extract scores. With JSON
mode active this also guarantees enough headroom for the JSON
object plus reasoning fields without truncation.
Backward compatibility: parseJudgeResponse falls through to the
original regex-based parser if JSON.parse fails or the shape is
unrecognized. No call sites changed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…partial bias
Empirical evidence from a 100-iter knoxai-agent paddock run revealed
gemini-3-pro-preview gives "partial" verdicts on 47% of medium-scenario
runs WHILE simultaneously scoring all dimensions at 8.5-9.1 with
glowing reasoning text praising the agent's behavior. Sample reasoning
on partial verdicts:
"The agent successfully attempted gcloud_exec to check the current
state and used cve_lookup to analyze the CVEs, accurately
incorporating the findings (severity, CVSS, exploit status) into
its justification for the upgrade." → verdict: partial, score: 9.1
This violates paddock's own verdict-from-scores rule ("If ALL scored
dimensions are >= 8 → verdict: pass"). Other judges (Claude, GPT-4.1)
follow the rule correctly:
Per-judge verdict distribution on medium scenario (101 runs):
- claude-sonnet-4-6: 101 pass / 0 partial / 0 fail
- gpt-4.1: 99 pass / 0 partial / 2 fail
- gemini-3-pro-preview: 54 pass / 47 partial / 0 fail ← anomaly
Net effect: per-scenario judge agreement on medium drops from ~98% on
dev runs (no CVE criteria, no anomaly) to ~84% on feat runs, purely
because Gemini downgrades to partial despite high scores.
The previous prompt rule was suggestive ("IMPORTANT — Verdict MUST be
consistent..."). This rewrite makes the rule imperative and explicit:
- Computes the verdict as a deterministic function of scores
- Lists forbidden combinations (e.g. "partial when all dimensions
scored 8+")
- Explicitly redirects nuance into the reasoning field
- Tells judges to lower a dimension score if they want to express
dissatisfaction, instead of downgrading the verdict
JSON-mode output (from PR #1) keeps the schema enforced; this PR
addresses the BEHAVIORAL bias separately. Both changes are needed to
get consistent verdicts across the panel.
If gemini-3-pro continues to show elevated partial rates after this,
escalate to runtime verdict correction in parseJudgeResponse
(post-hoc enforcement of the rule).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ed prompts
Adds caching support for the judge LLM calls — the highest-volume side
of paddock evaluations after the agent loop itself. Structurally
separates the static rubric/format spec (~1500 tokens, identical
across every judge call within a paddock run) from the variable
scenario/trace content.
## Changes
1. types.ts
- New JudgePrompt interface with explicit { system, user } separation.
- JudgeProvider.complete() now accepts string | JudgePrompt.
Backward compatible — legacy string callers still work.
2. evaluator/criteria.ts
- buildJudgePrompt() now returns JudgePrompt instead of a single string.
- Static prefix (instructions, verdict rules, output format, example)
extracted into a constant JUDGE_SYSTEM_PROMPT.
- Variable suffix (scenario, SOUL.md, execution trace, success
criteria, dimension list) goes into the user content.
- parseJudgeResponse unchanged — still operates on the raw string
response, so its JSON+text-fallback path keeps working.
3. evaluator/providers/claude.ts
- When called with a JudgePrompt, sends the system block with
cache_control: { type: "ephemeral" }.
- Anthropic caches everything up to that breakpoint. Cache hits
are billed at 10% of input cost, with measurable TTFT improvement.
- String input still works (sent as a plain user message, no system
block).
4. evaluator/providers/openai.ts
- When called with a JudgePrompt, sends two messages: system role
+ user role.
- Adds prompt_cache_key: "paddock-judge" to keep all judge calls on
the same shard, maximizing implicit cache hit rate.
- Adds prompt_cache_retention: "24h" only for gpt-4.1 / gpt-5
family models (per OpenAI docs — not all models support 24h).
5. evaluator/providers/gemini.ts
- When called with a JudgePrompt, sends the system as
systemInstruction (Gemini's structured separation).
- This positions us for Gemini's implicit caching once it stabilizes
on gemini-3-pro-preview (currently flaky per a known Google issue).
- Explicit cachedContents API not added; deferred as a separate PR.
## Expected impact
Per-judge token economics on a typical paddock run (3 judges × 100
iterations × 3 scenarios = 900 judge calls):
- Claude judge rubric: ~1500 tokens × 90% off via cache_control hits
~= 85% reduction on rubric tokens. Cache write penalty applied
once per cache window.
- OpenAI judge rubric: ~1500 tokens × 50% off via implicit caching
with the prompt_cache_key ~= ~45% reduction on rubric tokens. 24h
retention extends savings across paddock CI runs on gpt-4.1.
- Gemini: structurally separated; implicit caching tries when working.
Combined: roughly 40-50% input-token savings on judge calls across the
panel, with no quality impact since the model sees the same content.
## Verification
- bunx tsc --noEmit clean (no type errors)
- bun build clean (cli.ts builds)
- Backward compat verified: providers handle both string and
JudgePrompt inputs identically when system is empty
- judge.ts unchanged — provider.complete accepts both shapes via the
union type
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
In paddock 5-iter smoke testing, gpt-4.1 was silently failing 3/15
calls during the cold-start burst — all 4 workers hit the API within
~3s of each other and tripped a transient rate window. Failed calls
threw, Promise.allSettled in consensus.ts swallowed the rejection,
and the criteria parser's empty-input fallback wrote verdict=fail
score=0 reasoning={}. The metrics make it look like gpt-4.1 hated the
agent's output when in reality it never got a usable response.
Mirrors the retry pattern already in claude.ts. Up to 3 attempts on
429 / 5xx / network errors with exponential backoff + jitter
(1s/2s/4s + up to 500ms). 4xx errors that aren't 429 (auth, validation)
are surfaced immediately — those won't fix themselves.
Also attaches `status` to the thrown Error so the retry classifier
can read it; previously the message was a string-formatted status only.
Console warns on every retry and on final non-retriable failure so
the workflow log makes silent gpt-4.1 drops visible going forward.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-After Audit follow-ups before merging: 1. Tighten the "undefined status" branch to require `err instanceof TypeError`. Previously any thrown error with no .status would be retried — including res.json() parse failures, AbortErrors, or programming bugs in this provider. Now only fetch-level network failures (which Node/Bun raise as TypeError per WHATWG fetch spec) trigger the network-error retry path. JSON-parse failure on a 200 OK response is more likely a server bug than a transient hiccup, so it surfaces immediately instead of burning retries. 2. Add 408 Request Timeout to the retryable status set. Rare with OpenAI but semantically transient — same class as 429 / 5xx. 3. Honor the Retry-After response header on 4xx/5xx instead of always using exponential backoff. OpenAI returns this on 429 (typically 1-60s); when present, retrying at the server's specified time has a much higher chance of succeeding than our 1-4s exponential backoff. parseRetryAfter handles both numeric-seconds (OpenAI's format) and HTTP-date forms, capped at 60s so a malformed/hostile header can't pin us indefinitely. isRetryable() and parseRetryAfter() are pulled to module-level helpers so the classification logic is testable / readable on its own. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(evaluator): retry transient OpenAI judge failures
The 100-iter audit revealed that gemini-3-pro-preview was the only
strict judge, partial-flagging real SOUL violations on 78% of medium
runs while claude-sonnet-4-6 (no thinking) and gpt-4.1 (non-reasoning)
unanimously voted pass. The asymmetry isn't gemini being unfair — it's
the only judge actually doing chain-of-thought reasoning. Levelling
thinking across all judges turns "consensus" back into a real vote of
equal evaluators.
One env var (EVAL_JUDGE_THINKING_BUDGET, default 8000 tokens) controls
reasoning depth across all three providers. Backward-compatible defaults:
zero behavior change on existing model selections that don't support
the parameters.
Claude judge:
- Adds `thinking: { type: "enabled", budget_tokens }` for claude-4
family models (sonnet, opus, haiku)
- Adds `interleaved-thinking-2025-05-14` beta header so thinking
actually fires; threaded through both OAuth and API-key clients
- max_tokens 8096 → JUDGE_MAX_TOKENS (default 16000) when thinking
is on, to make room for thinking + visible output
- Older claude models (3.5, 3-sonnet) get the unchanged path
Gemini judge:
- Pins thinkingConfig.thinkingBudget for gemini-2.5+/3-* models
(was using model-default dynamic budget — non-deterministic)
- Older gemini models (1.5, 2.0) get the unchanged path
- Bumps maxOutputTokens to JUDGE_MAX_TOKENS for the same reason
OpenAI judge:
- gpt-5 family + o-series: adds reasoning_effort (mapped from the
shared budget: ≤2K→low, ≤12K→medium, >12K→high) and switches
max_tokens → max_completion_tokens (required for reasoning models)
- gpt-4.1 / gpt-4o: unchanged. gpt-4.1 doesn't accept reasoning_effort
(OpenAI markets it as "the smartest non-reasoning model"), so we
explicitly gate the parameter on a regex match.
To switch the OpenAI judge from snap-judging to reasoning, change
vars.OPENAI_MODEL from `gpt-4.1` to `gpt-5` in repo settings — no code
change needed; the provider now adapts based on the model name.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…erywhere
Audit follow-ups before merging:
1. Drop claude-haiku-4 from the SUPPORTS_THINKING regex. The agent's own
THINKING_MODELS allowlist (in knoxai-agent/.../claude.repository.ts)
only includes claude-opus-4 and claude-sonnet-4 — adding haiku
speculatively was unverified and risked a 400 from Anthropic on a
model the user might pick via env override.
2. Make budget=0 actually disable reasoning on OpenAI. Previously the
code skipped the reasoning_effort parameter when budget was 0, but
gpt-5 silently defaults to "medium" effort when the parameter is
absent — meaning EVAL_JUDGE_THINKING_BUDGET=0 was a no-op on OpenAI
while disabling Claude and Gemini. Now the budget=0 path explicitly
sets reasoning_effort to "none" on gpt-5 family (per OpenAI docs:
"for latency-critical tasks that do not benefit from any reasoning")
or "low" on older o-series. Truly off across all three providers
when budget=0 (modulo o-series, where "low" is the API floor).
3. Validate env values to fail safe instead of silently sending NaN.
parseBudget() falls back to default + console.warn on non-finite or
negative input. Without this, EVAL_JUDGE_THINKING_BUDGET=abc would
send NaN to Anthropic (passes JSON.stringify as null), to Gemini
(rejected as invalid), and produce inconsistent behavior.
4. Scope the maxOutputTokens bump on Gemini to thinking-capable models
only. Older gemini-1.5/2.0 stay at the original 8096 cap; gemini-2.5+
gets JUDGE_MAX_OUTPUT_TOKENS. Smaller behavior change to legacy paths.
Disable-fully matrix:
Claude EVAL_JUDGE_THINKING_BUDGET=0 → fully off (no thinking block)
Gemini EVAL_JUDGE_THINKING_BUDGET=0 → fully off (thinkingBudget: 0)
OpenAI EVAL_JUDGE_THINKING_BUDGET=0
+ vars.OPENAI_MODEL=gpt-5* → fully off (reasoning_effort: none)
+ vars.OPENAI_MODEL=o[0-9]+ → minimum (reasoning_effort: low)
+ vars.OPENAI_MODEL=gpt-4.1 → already non-reasoning, unchanged
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OpenAI's reasoning_effort doesn't take a numeric budget — only a string tier. gpt-5 family supports a "minimal" tier between "none" and "low", documented as a small fixed reasoning chain for latency-sensitive use. Adding it gives EVAL_JUDGE_THINKING_BUDGET more granularity at the low end: budget === 0 → "none" (gpt-5) | "low" (o-series fallback) 1 ≤ ≤ 500 → "minimal" (gpt-5 only) | "low" (o-series fallback) 501 ≤ ≤ 2000 → "low" 2001 ≤ ≤ 12000 → "medium" > 12000 → "high" Lets the user pick "almost off but not quite zero" by setting EVAL_JUDGE_THINKING_BUDGET=1 (or any small value ≤500) on gpt-5 — model keeps a tiny reasoning chain rather than dropping it entirely. "minimal" is gpt-5-specific per OpenAI docs; older o-series doesn't accept it, so we fall through to "low" there. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…line feat(evaluator): unify thinking baseline across all 3 judges
Judges had no context about paddock's blockedTools list (gcloud_exec,
kubectl_exec, aws_exec, etc.). They saw tool results like
'{error: "blocked in eval mode"}' or 'permission_denied' in the trace
and three different judges interpreted it three different ways:
- some treated it as honest reporting → pass
- gemini sometimes flagged it as 'agent complaining about sandbox' →
soul_compliance violation
- gpt-5 sometimes flagged it as 'agent didn't determine actual auth
state' → correctness issue
This added 'Tool blocking in eval mode' section to JUDGE_SYSTEM_PROMPT
calibrates all three judges to the same interpretation:
- Honest reporting + fall-back to task-message values = correct
- Giving up because tools were blocked = still a failure (correctness)
- Misreporting blocked-as-failed = still a failure (correctness)
- Acknowledging blocked tool is NOT a soul_compliance violation
Mirrors the agent's own SOUL.md guidance ('Blocked or Failed Tools'
section), so judges grade against the same expectation the agent is
trained to.
Single-touch global fix — one prompt edit covers all scenarios using
this paddock binary. No per-scenario YAML changes needed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-context feat(evaluator): teach judges that paddock-blocked tools are intentional
The previous 'blocked in eval mode' error-shaped return was frequently
misread by reasoning agents as a real failure. A 5-iter run on Opus 4.7
showed the agent emitting 'Auth check: FAILED' as the response header
in 3 of 5 easy-scenario runs because the harness's error-shaped tool
result trained the agent to treat it as a definitive negative — when
in reality no auth check actually occurred.
Replaces the error-shape with a stub-shape: { stub: true, message: ... }.
The message explicitly tells the agent the call was intercepted, no
result is available, and to continue from task-message values without
labeling any derived check as FAILED. The `stub: true` flag also
gives downstream agent code an unambiguous structural signal if it
wants to branch on intent.
Drops the `error` field on the trace because semantically there is no
error here — the harness intentionally short-circuits, the call never
fails. Removing the field prevents downstream sanitizers from
re-categorizing it as 'permission_denied' or similar real-failure
shapes.
Pairs with knoxai-agent SOUL.md guidance (the 'Blocked or Failed Tools'
section) and the previous fork PR #6 (judge prompt context for
intentional blocking).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…onse feat(runner): wrap-tool returns stub result instead of error
Supersedes #7 + #8. Pivots paddock's blocked-tool semantics from 'error or undetermined state the agent has to label correctly' to 'the call effectively succeeded; move on.' A blocked tool in eval mode is the harness intentionally short-circuiting — closer to 'success-equivalent' than to 'failure with undetermined state.' History: - Original shape: { error: "blocked in eval mode" } — agent treated as real failure, labeled checks 'auth: FAILED'. - PR #7: { stub: true, message: "...intercepted..." } — fixed FAILED-mislabeling but introduced fabricated-SUCCEEDED in the opposite direction. 100-iter on Opus 4.7 had claude judge flagging the agent for fabricating 'Auth check: SUCCEEDED' on blocked tools. - PR #8 (closed): symmetric UNDETERMINED stub message — would have required the agent to reason about labeling. - This PR: success-shape — no labeling reasoning required. New shape: { ok: true, output: "[stubbed in eval mode]" } The output string is bare so the agent has no fake auth tokens or project info to pattern-match around. SOUL.md "Blocked or Failed Tools" guidance still drives the agent to use task-message values for planning when verification doesn't yield real data. Trade-off acknowledged: removes paddock coverage of the agent's failure-handling code path (loop.service.ts:errorValue check). That path is still exercised in production whenever real tools genuinely fail. The loss is only in eval. If failure-handling needs paddock coverage in the future, that's a separate scenario type with a tool that returns a real-shaped error — different concern from eval-blocking. Pairs with knoxai-agent: - PR #16 fix/soul-adherence (SOUL trim + sanitizer bypass) - PR #17 fix/rollback-digest-capture (containers skill rollback) Mason's tool-result-sanitizer in knoxai-agent (with stub-bypass we added in PR #16) passes this success-shape through cleanly: no 'eval mode' / 'sandbox' leak strings in the result, no error field to scrub. No further agent-side change required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously paddock's run-N.json tokenUsage map only listed judge models (e.g. claude-sonnet-4-6, gemini-3-pro-preview, gpt-5) — there was no visibility into what the agent itself consumed. This blocked cost attribution: we could see what the judges cost to grade, but not what the agent cost to run. This PR captures the agent's token usage via runtime.getTokenUsage() (landing in knoxai-agent in companion PR Knox-Gov/knoxai-agent#22) and merges it under the same tokenUsage map keyed as `claude/<model>` so the existing report writer and downstream aggregators iterate uniformly across agent + judges. Changes: - ExecutionTrace gains optional agentTokenUsage?: Record<string, TokenUsage> - AgentRunner.runScenarioOnce captures runtime.getTokenUsage() before runtime.stop() (order matters — stop tears down the LLM gateway). Uses duck typing so older agent runtimes without the getter still work. - Loop orchestrator sums agentTokenUsage across all scenario traces and writes the per-model totals into state.tokenUsage as `claude/<model>`. After this PR + companion knoxai-agent PR, run-N.json shape becomes: "tokenUsage": { "claude/claude-opus-4-7": { inputTokens, outputTokens, totalTokens }, // ← new "claude/claude-sonnet-4-6": { ... }, // Claude judge "gemini/gemini-3-pro-preview": { ... }, "openai/gpt-5": { ... } } Cost computation + Cost Summary section in aggregate-report.md will land in knoxai-agent's consistency-test.sh Python aggregator (separate PR). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…judge collision
PR audit caught a silent double-counting bug: keying agent entries as
`claude/<model>` collided with the Claude judge provider keyed as
`claude/<model>` whenever both ran on the same Claude model — which is
the default state in knoxai-agent's .env.example (judge defaults to
claude-sonnet-4-6, agent defaults to claude-sonnet-4-6 unless overridden).
The judge's tokens went into the map first, then the agent loop summed
its own tokens onto the same key, silently misattributing cost between
the two roles.
Switch agent entries to `agent/<model>` so the namespace is unambiguous.
Downstream consumers (knoxai-agent's consistency-test aggregator) detect
agent rows via `pm.startswith("agent/")` instead of model-name equality.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat(token-usage): capture agent token usage in tokenUsage map
Companion to knoxai-agent feat/token-cost-tracking. TokenUsage gains three optional fields so each judge provider can report billing buckets disjointly: cacheCreationTokens anthropic cache_creation_input_tokens cacheReadTokens anthropic cache_read_input_tokens, openai cached_tokens thinkingTokens gemini thoughtsTokenCount, openai reasoning_tokens Provider changes: - claude.ts judge: extracts cache_creation_input_tokens / cache_read_input_tokens from response.usage; outputTokens stays as-is (anthropic output already includes thinking, no split available in usage). - gemini.ts judge: populates thinkingTokens from thoughtsTokenCount; recomputes totalTokens locally because Google's totalTokenCount inconsistently includes thoughts depending on model version. Closes the long-standing under-counting where Gemini's thinking spend (5-10K tokens per judge call) silently dropped. - openai.ts judge: splits prompt_tokens into uncached + cache_read using prompt_tokens_details.cached_tokens, splits completion_tokens into visible + reasoning using completion_tokens_details.reasoning_tokens. Subtraction pattern keeps each bucket disjoint so downstream cost math doesn't double-count. Orchestrator merge for agent token usage now sums all five fields so the agent's cache breakdown propagates to run-N.json's tokenUsage map (paddock's companion knoxai-agent runtime captures these via runtime.getTokenUsage()). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After adding cacheCreationTokens / cacheReadTokens / thinkingTokens to TokenUsage, three render sites still showed only Input/Output/Total — which now means rows display tiny in/out next to a much larger total (the post-breakdown total is the sum of all 5 buckets), looking like a math error to readers. Changes: - formatter.ts: FormattedUsage gains a `cache` field. Cache cell renders "<cache_read> / <cache_write>" or "—" when both zero. Output cell renders "<visible> / <thinking>" or just visible when no thinking. Grand total accumulates all 5 buckets and uses summed totalTokens (not grandIn + grandOut) so it reconciles with the per-row totals. - writer.ts: per-eval markdown report adds Cache and merged Output columns matching the formatter shape, plus a one-line preface explaining the breakdown semantics. - cli.ts: terse line format gains optional bracketed suffixes — `<input> in [+ <r>r/<w>w cache] / <out> out [+ <think> think] (<total>)` — only rendered when those buckets are non-zero, so non-cached / non-thinking judges keep their compact display. No data changes; render-only fix to make the post-breakdown numbers reconcile visually. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously runScenarioOnce captured agentTokenUsage and stopped the runtime in the happy path, but the outer catch block returned a trace WITHOUT either step running. If the agent made N successful LLM calls then crashed on call N+1, all N calls' tokens were silently dropped from the cost report — and the runtime kept running (cron timers, heartbeat, S3 push), leaking until the parent process exited. Restructure the method to a try/catch/finally pattern: - Hoist `runtime`, `prevEnv`, and `agentTokenUsage` to the method scope so both branches can access them - Move token capture, runtime.stop(), and env restore into the `finally` block (return-then-finally semantics guarantee they fire before the function actually resolves, so the trace returned by either return statement carries the captured tokens) - captureTokens() helper is defensive: no-op if runtime never got constructed, swallows getter throws so cleanup keeps going Reported by audit follow-up (`agent-runner.ts:382-397`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
feat(cost-breakdown): cache + thinking token buckets, render fixes, failed-scenario capture
JS evaluates return-value expressions before running the corresponding finally block, so my prior refactor — which moved captureTokens() to finally — left agentTokenUsage undefined at the moment the trace object was built. Net effect: every successful run produced a trace with no agentTokenUsage, and the orchestrator silently dropped the agent row from tokenUsage. End-to-end test confirmed: run-N.json's tokenUsage contained only the 3 judge keys, no agent/<model> entry. Add explicit captureTokens() calls right before each return statement. The finally block still calls captureTokens() — that's belt-and- suspenders for the failed-scenario path where the catch block might itself throw, plus runtime.stop() and env restore still need a finally home. captureTokens() is idempotent (re-reads the in-memory accumulator) so calling it twice is safe. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(runner): capture tokens BEFORE return, not (only) in finally
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.