Fix AI docs benchmark: fail loudly, cache prompts, add judge model by dkijania · Pull Request #1202 · MinaProtocol/docs2

dkijania · 2026-05-09T10:34:29Z

Summary

The scheduled `Benchmark-LLMs-Docs` workflow has been silently producing 0% on every run since at least early April. The `ANTHROPIC_API_KEY` secret is out of credits, every API call returns 400, and the script swallows each per-question error, records score=0, and exits 0. CI shows the job green with a clean 0/30 results file. Nobody noticed because the workflow appeared healthy.

Run logs from May 4 confirm:
```
ERROR: Anthropic API error 400: "Your credit balance is too low..."
OVERALL: 0.00/30 (0.0%)
```
…and exit code 0. The 28-41-second run times across the last 5 scheduled runs (real benchmarks take 5-10 min) were the only outward sign.

What this PR fixes

	Before	After
Failure visibility	All errors → silent 0% → green CI	>50% errors → exit 2 → red CI
Token cost	~50k system tokens × 30 questions × 3 sources, no cache	Same prompt cached across questions; cache reads at ~10% cost
Grading bias	Same model answers and judges	Stronger separate judge model (`--judge-model`, default Opus 4.7)
Context truncation	Silently truncated at 180k chars	Warns and records `truncated_context: true` in results
Question bank	f1 asked about a "minimum recommended fee" not in the docs	Reworded to "current average fee" matching the FAQ
Run usage data	Not reported	input / cache_read / cache_write / output tokens in results

The exhausted API key is a separate operational problem the secret owner has to top up. This PR ensures the next time the key runs dry, the workflow turns red instead of pretending to succeed.

Files changed

`scripts/benchmark-llms-docs.mjs` — error tracking, prompt caching, separate judge model, truncation warning, usage accounting, f1 reword
`.github/workflows/benchmark-llms-docs.yml` — `judge_model` workflow input, `matrix.fail-fast: false`

Test plan

Top up `ANTHROPIC_API_KEY` org secret
Trigger the workflow manually with default inputs and confirm it actually exercises the API (~5-10 min runtime, non-zero scores)
Check usage output shows cache_read > 0 on all but the first question per source
Re-run with a clearly broken key (e.g., dummy value) and confirm the job fails red with the new exit-2 message

🤖 Generated with Claude Code

The scheduled Benchmark-LLMs-Docs workflow has been silently scoring 0% on every run since at least early April: the ANTHROPIC_API_KEY secret is out of credits, every API call returns 400, and the script catches each per-question error, records score=0, and exits 0. CI then reports the job "successful" with a 0/30 results file. Changes: 1. Track per-question API errors. If more than 50% of questions error, exit 2 with a clear FAILED message that names the likely causes (credits, model availability). The credit-exhaustion case now turns the workflow red instead of green. 2. Wrap the docs system prompt in a cache-controlled content block. The same ~50k-token system prompt is re-sent for all 30 questions per source × 3 sources, so prompt caching cuts repeated input costs roughly 10x. Token usage (input / cache_read / cache_write / output) is now reported and saved in the results JSON. 3. Add --judge-model (default claude-opus-4-7), separate from the answering --model (default claude-sonnet-4-6-20250514). Same model judging itself biases scores upward; a stronger separate judge gives more honest grades on open-ended categories. 4. Surface truncation explicitly. When the docs corpus exceeds the 180k-char system budget, print a warning and record truncated_context: true in the results metadata, so "full" mode regressions stop being silent if the docs grow past the limit. 5. Reword f1 from "minimum recommended fee" to "current average fee" — the 0.001 MINA value the question expects is described in the FAQ as the average, not a minimum (no minimum is documented). 6. Workflow inputs gain judge_model and matrix.fail-fast: false, so one source's failure no longer cancels sibling jobs. The exhausted ANTHROPIC_API_KEY itself is a separate operational problem the secret owner needs to top up — but with these changes the next run will at least surface that loudly instead of pretending to succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-09T10:34:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docs2	Ready	Preview, Comment	May 9, 2026 4:37pm

The first run of the patched workflow surfaced two issues that the new failure detection cleanly exposed: 1. The default model ID claude-sonnet-4-6-20250514 is invalid (404 not_found_error from the API). The actual published ID is just claude-sonnet-4-6 — the dated suffix isn't a real model. 2. The 180_000-char system-prompt budget was 9% of llms-full.txt (which is 2_046_837 chars). "full" mode comparisons were running against a tiny prefix of the corpus, not the full text. Sonnet 4.6 has a 200k-token context window — bumping to 750_000 chars (~190k tokens, leaves headroom for question + response) lets "full" mode actually represent ~37% of the corpus and matches the model's real capability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The first full-matrix run revealed two interacting issues with Tier 1 rate limits (30k input tokens/min for Sonnet 4.6): 1. The 750k-char system budget produced ~190k-token cache writes on the first call of the "full" source — instant 429. 2. The matrix ran all three sources in parallel, so even small "llms" and "none" jobs got caught in the rate-limit window. Changes: - maxSystemChars: 750_000 → 100_000 (~25k tokens), leaving headroom under the 30k/min cap. Tier 2+ accounts can bump this back up. - Workflow matrix gains max-parallel: 1 — sources run sequentially instead of competing for the same rate-limit bucket. - callAnthropic now retries on 429 / 529, honoring Retry-After when present, with exponential backoff otherwise (5s, 10s, 20s). Caps at 4 attempts before failing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview – docs2 May 9, 2026 10:34 View deployment

vercel Bot deployed to Preview – docs2 May 9, 2026 11:02 View deployment

vercel Bot deployed to Preview – docs2 May 9, 2026 16:37 View deployment

This was referenced May 10, 2026

Auto-generate llms.txt from sidebars + frontmatter #1203

Open

Add frontmatter description to ~20 high-value docs missing them #1204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AI docs benchmark: fail loudly, cache prompts, add judge model#1202

Fix AI docs benchmark: fail loudly, cache prompts, add judge model#1202
dkijania wants to merge 3 commits into
mainfrom
dkijania/fix-ai-benchmark

dkijania commented May 9, 2026

Uh oh!

vercel Bot commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dkijania commented May 9, 2026

Summary

What this PR fixes

Files changed

Test plan

Uh oh!

vercel Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 9, 2026 •

edited

Loading