Weekly benchmark review (2026-05-20)
Automated check from scripts/weekly-benchmarks-check.mjs. Triage and either:
- Update
data/benchmarks.json if a new flagship model dropped this week, then close this issue, OR
- Comment
noop and close if nothing actionable surfaced.
Current state of data/benchmarks.json
Model-release-flavored news, last 7 days
Matched 15 articles (keyword scan; not all will be real releases).
| Date |
Source |
Title |
| 2026-05-19 |
Hacker News AI |
My Arduino spins faster when Claude burns more tokens |
| 2026-05-19 |
ZDNet AI |
Google I/O 2026 live: Our takes on Gemini 3.5, Spark, Android XR, and more |
| 2026-05-19 |
The Verge AI |
Google’s AI future demands trust — and your personal data |
| 2026-05-19 |
WIRED AI |
Everything Announced at Google I/O 2026: Gemini, Search, Smart Glasses |
| 2026-05-19 |
The Verge AI |
Gemini will use Volvo’s external cameras to interpret parking signs |
| 2026-05-19 |
The Verge AI |
The 13 biggest announcements at Google I/O 2026 |
| 2026-05-19 |
The Verge AI |
Google wants to compete with Anthropic’s Mythos |
| 2026-05-19 |
Google AI Blog |
I/O 2026: Welcome to the agentic Gemini era |
| 2026-05-19 |
Google AI Blog |
Gemini 3.5: frontier intelligence with action |
| 2026-05-19 |
Google AI Blog |
Everything new in our Google AI subscriptions, fresh from I/O 2026 |
| 2026-05-19 |
The Verge AI |
Would you let robots spend your money? Google is betting on it |
| 2026-05-19 |
The Verge AI |
Gmail is going to start talking to you |
| 2026-05-19 |
WIRED AI |
Gemini Spark Is Google’s Response to OpenClaw’s 24/7 AI Agent |
| 2026-05-19 |
ZDNet AI |
Google's new Omni AI tool will let you video clone yourself - I'm intrigued (and concerned) |
| 2026-05-19 |
Hugging Face Blog |
Introducing the Ettin Reranker Family |
Sources: The Verge AI (6), Google AI Blog (3), ZDNet AI (2), WIRED AI (2), Hacker News AI (1), Hugging Face Blog (1)
HF Open LLM Leaderboard top 10
Captured: 2026-05-19
| Rank |
Model |
| 1 |
MaziyarPanahi_calme-3.2-instruct-78b_bfloat16 (avg 52.08 · 78B) |
| 2 |
MaziyarPanahi_calme-3.1-instruct-78b_bfloat16 (avg 51.29 · 78B) |
| 3 |
dfurman_CalmeRys-78B-Orpo-v0.1_bfloat16 (avg 51.23 · 78B) |
| 4 |
MaziyarPanahi_calme-2.4-rys-78b_bfloat16 (avg 50.77 · 78B) |
| 5 |
huihui-ai_Qwen2.5-72B-Instruct-abliterated_bfloat16 (avg 48.11 · 73B) |
| 6 |
Qwen_Qwen2.5-72B-Instruct_bfloat16 (avg 47.98 · 73B) |
| 7 |
MaziyarPanahi_calme-2.1-qwen2.5-72b_bfloat16 (avg 47.86 · 73B) |
| 8 |
newsbang_Homer-v1.0-Qwen2.5-72B_bfloat16 (avg 47.46 · 73B) |
| 9 |
ehristoforu_qwen2.5-test-32b-it_bfloat16 (avg 47.37 · 33B) |
| 10 |
Saxo_Linkbricks-Horizon-AI-Avengers-V1-32B_bfloat16 (avg 47.34 · 33B) |
What "needs update" usually means
- A flagship from Anthropic / OpenAI / Google / Meta / Mistral / DeepSeek / xAI launched this week → add a row to
data/benchmarks.json.
- A tracked model has materially-shifted benchmark scores (re-running, methodology change) → update the row.
- A new benchmark itself (e.g. a successor to MMLU-Pro) is becoming canonical → add it.
What it usually does NOT mean
- Research papers about benchmarks (those land on /research, not /benchmarks).
- HN opinion threads about a model.
- Pricing-only changes (those go in
data/pricing.json).
Bump lastUpdated in data/benchmarks.json whenever you change anything else in the file.
Weekly benchmark review (2026-05-20)
Automated check from
scripts/weekly-benchmarks-check.mjs. Triage and either:data/benchmarks.jsonif a new flagship model dropped this week, then close this issue, ORnoopand close if nothing actionable surfaced.Current state of
data/benchmarks.jsonlastUpdated:
2026-05-10(10 days ago)Models tracked: 18
Benchmarks tracked: 5
Models released within last 60 days: 4
Model-release-flavored news, last 7 days
Matched 15 articles (keyword scan; not all will be real releases).
Sources: The Verge AI (6), Google AI Blog (3), ZDNet AI (2), WIRED AI (2), Hacker News AI (1), Hugging Face Blog (1)
HF Open LLM Leaderboard top 10
Captured:
2026-05-19What "needs update" usually means
data/benchmarks.json.What it usually does NOT mean
data/pricing.json).Bump
lastUpdatedindata/benchmarks.jsonwhenever you change anything else in the file.