Skip to content

Fix AI docs benchmark: fail loudly, cache prompts, add judge model#1202

Open
dkijania wants to merge 3 commits into
mainfrom
dkijania/fix-ai-benchmark
Open

Fix AI docs benchmark: fail loudly, cache prompts, add judge model#1202
dkijania wants to merge 3 commits into
mainfrom
dkijania/fix-ai-benchmark

Conversation

@dkijania
Copy link
Copy Markdown
Member

@dkijania dkijania commented May 9, 2026

Summary

The scheduled `Benchmark-LLMs-Docs` workflow has been silently producing 0% on every run since at least early April. The `ANTHROPIC_API_KEY` secret is out of credits, every API call returns 400, and the script swallows each per-question error, records score=0, and exits 0. CI shows the job green with a clean 0/30 results file. Nobody noticed because the workflow appeared healthy.

Run logs from May 4 confirm:
```
ERROR: Anthropic API error 400: "Your credit balance is too low..."
OVERALL: 0.00/30 (0.0%)
```
…and exit code 0. The 28-41-second run times across the last 5 scheduled runs (real benchmarks take 5-10 min) were the only outward sign.

What this PR fixes

Before After
Failure visibility All errors → silent 0% → green CI >50% errors → exit 2 → red CI
Token cost ~50k system tokens × 30 questions × 3 sources, no cache Same prompt cached across questions; cache reads at ~10% cost
Grading bias Same model answers and judges Stronger separate judge model (`--judge-model`, default Opus 4.7)
Context truncation Silently truncated at 180k chars Warns and records `truncated_context: true` in results
Question bank f1 asked about a "minimum recommended fee" not in the docs Reworded to "current average fee" matching the FAQ
Run usage data Not reported input / cache_read / cache_write / output tokens in results

The exhausted API key is a separate operational problem the secret owner has to top up. This PR ensures the next time the key runs dry, the workflow turns red instead of pretending to succeed.

Files changed

  • `scripts/benchmark-llms-docs.mjs` — error tracking, prompt caching, separate judge model, truncation warning, usage accounting, f1 reword
  • `.github/workflows/benchmark-llms-docs.yml` — `judge_model` workflow input, `matrix.fail-fast: false`

Test plan

  • Top up `ANTHROPIC_API_KEY` org secret
  • Trigger the workflow manually with default inputs and confirm it actually exercises the API (~5-10 min runtime, non-zero scores)
  • Check usage output shows cache_read > 0 on all but the first question per source
  • Re-run with a clearly broken key (e.g., dummy value) and confirm the job fails red with the new exit-2 message

🤖 Generated with Claude Code

The scheduled Benchmark-LLMs-Docs workflow has been silently scoring
0% on every run since at least early April: the ANTHROPIC_API_KEY
secret is out of credits, every API call returns 400, and the script
catches each per-question error, records score=0, and exits 0. CI then
reports the job "successful" with a 0/30 results file.

Changes:

1. Track per-question API errors. If more than 50% of questions
   error, exit 2 with a clear FAILED message that names the likely
   causes (credits, model availability). The credit-exhaustion case
   now turns the workflow red instead of green.

2. Wrap the docs system prompt in a cache-controlled content block.
   The same ~50k-token system prompt is re-sent for all 30 questions
   per source × 3 sources, so prompt caching cuts repeated input
   costs roughly 10x. Token usage (input / cache_read / cache_write
   / output) is now reported and saved in the results JSON.

3. Add --judge-model (default claude-opus-4-7), separate from the
   answering --model (default claude-sonnet-4-6-20250514). Same model
   judging itself biases scores upward; a stronger separate judge
   gives more honest grades on open-ended categories.

4. Surface truncation explicitly. When the docs corpus exceeds the
   180k-char system budget, print a warning and record
   truncated_context: true in the results metadata, so "full" mode
   regressions stop being silent if the docs grow past the limit.

5. Reword f1 from "minimum recommended fee" to "current average fee"
   — the 0.001 MINA value the question expects is described in the
   FAQ as the average, not a minimum (no minimum is documented).

6. Workflow inputs gain judge_model and matrix.fail-fast: false, so
   one source's failure no longer cancels sibling jobs.

The exhausted ANTHROPIC_API_KEY itself is a separate operational
problem the secret owner needs to top up — but with these changes
the next run will at least surface that loudly instead of pretending
to succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs2 Ready Ready Preview, Comment May 9, 2026 4:37pm

Request Review

The first run of the patched workflow surfaced two issues that the
new failure detection cleanly exposed:

1. The default model ID claude-sonnet-4-6-20250514 is invalid (404
   not_found_error from the API). The actual published ID is just
   claude-sonnet-4-6 — the dated suffix isn't a real model.

2. The 180_000-char system-prompt budget was 9% of llms-full.txt
   (which is 2_046_837 chars). "full" mode comparisons were running
   against a tiny prefix of the corpus, not the full text. Sonnet 4.6
   has a 200k-token context window — bumping to 750_000 chars
   (~190k tokens, leaves headroom for question + response) lets
   "full" mode actually represent ~37% of the corpus and matches the
   model's real capability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first full-matrix run revealed two interacting issues with Tier 1
rate limits (30k input tokens/min for Sonnet 4.6):

1. The 750k-char system budget produced ~190k-token cache writes on
   the first call of the "full" source — instant 429.
2. The matrix ran all three sources in parallel, so even small "llms"
   and "none" jobs got caught in the rate-limit window.

Changes:

- maxSystemChars: 750_000 → 100_000 (~25k tokens), leaving headroom
  under the 30k/min cap. Tier 2+ accounts can bump this back up.
- Workflow matrix gains max-parallel: 1 — sources run sequentially
  instead of competing for the same rate-limit bucket.
- callAnthropic now retries on 429 / 529, honoring Retry-After when
  present, with exponential backoff otherwise (5s, 10s, 20s). Caps at
  4 attempts before failing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant