Weekly benchmark review: 2026-05-20

# Weekly benchmark review (2026-05-20)

Automated check from `scripts/weekly-benchmarks-check.mjs`. Triage and either:
- Update `data/benchmarks.json` if a new flagship model dropped this week, then close this issue, OR
- Comment `noop` and close if nothing actionable surfaced.

## Current state of `data/benchmarks.json`
- **lastUpdated:** `2026-05-10` (10 days ago)
- **Models tracked:** 18
- **Benchmarks tracked:** 5
- **Models released within last 60 days:** 4

  - 2026-04 | OpenAI | GPT-5.5
  - 2026-04 | DeepSeek | DeepSeek V4 Pro
  - 2026-04 | DeepSeek | DeepSeek V4 Flash
  - 2026-04 | Anthropic | Claude Opus 4.7

## Model-release-flavored news, last 7 days
Matched **15** articles (keyword scan; not all will be real releases).

| Date | Source | Title |
|---|---|---|
| 2026-05-19 | Hacker News AI | My Arduino spins faster when Claude burns more tokens |
| 2026-05-19 | ZDNet AI | Google I/O 2026 live: Our takes on Gemini 3.5, Spark, Android XR, and more |
| 2026-05-19 | The Verge AI | Google’s AI future demands trust — and your personal data |
| 2026-05-19 | WIRED AI | Everything Announced at Google I/O 2026: Gemini, Search, Smart Glasses |
| 2026-05-19 | The Verge AI | Gemini will use Volvo’s external cameras to interpret parking signs |
| 2026-05-19 | The Verge AI | The 13 biggest announcements at Google I/O 2026 |
| 2026-05-19 | The Verge AI | Google wants to compete with Anthropic’s Mythos |
| 2026-05-19 | Google AI Blog | I/O 2026: Welcome to the agentic Gemini era |
| 2026-05-19 | Google AI Blog | Gemini 3.5: frontier intelligence with action |
| 2026-05-19 | Google AI Blog | Everything new in our Google AI subscriptions, fresh from I/O 2026 |
| 2026-05-19 | The Verge AI | Would you let robots spend your money? Google is betting on it |
| 2026-05-19 | The Verge AI | Gmail is going to start talking to you |
| 2026-05-19 | WIRED AI | Gemini Spark Is Google’s Response to OpenClaw’s 24/7 AI Agent |
| 2026-05-19 | ZDNet AI | Google's new Omni AI tool will let you video clone yourself - I'm intrigued (and concerned) |
| 2026-05-19 | Hugging Face Blog | Introducing the Ettin Reranker Family |

**Sources:** The Verge AI (6), Google AI Blog (3), ZDNet AI (2), WIRED AI (2), Hacker News AI (1), Hugging Face Blog (1)

## HF Open LLM Leaderboard top 10
Captured: `2026-05-19`

| Rank | Model |
|---|---|
| 1 | MaziyarPanahi_calme-3.2-instruct-78b_bfloat16 (avg 52.08 · 78B) |
| 2 | MaziyarPanahi_calme-3.1-instruct-78b_bfloat16 (avg 51.29 · 78B) |
| 3 | dfurman_CalmeRys-78B-Orpo-v0.1_bfloat16 (avg 51.23 · 78B) |
| 4 | MaziyarPanahi_calme-2.4-rys-78b_bfloat16 (avg 50.77 · 78B) |
| 5 | huihui-ai_Qwen2.5-72B-Instruct-abliterated_bfloat16 (avg 48.11 · 73B) |
| 6 | Qwen_Qwen2.5-72B-Instruct_bfloat16 (avg 47.98 · 73B) |
| 7 | MaziyarPanahi_calme-2.1-qwen2.5-72b_bfloat16 (avg 47.86 · 73B) |
| 8 | newsbang_Homer-v1.0-Qwen2.5-72B_bfloat16 (avg 47.46 · 73B) |
| 9 | ehristoforu_qwen2.5-test-32b-it_bfloat16 (avg 47.37 · 33B) |
| 10 | Saxo_Linkbricks-Horizon-AI-Avengers-V1-32B_bfloat16 (avg 47.34 · 33B) |

---

### What "needs update" usually means
1. A flagship from Anthropic / OpenAI / Google / Meta / Mistral / DeepSeek / xAI launched this week → add a row to `data/benchmarks.json`.
2. A tracked model has materially-shifted benchmark scores (re-running, methodology change) → update the row.
3. A new benchmark itself (e.g. a successor to MMLU-Pro) is becoming canonical → add it.

### What it usually does NOT mean
- Research papers about benchmarks (those land on /research, not /benchmarks).
- HN opinion threads about a model.
- Pricing-only changes (those go in `data/pricing.json`).

Bump `lastUpdated` in `data/benchmarks.json` whenever you change anything else in the file.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weekly benchmark review: 2026-05-20 #3

Weekly benchmark review (2026-05-20)

Current state of `data/benchmarks.json`

Model-release-flavored news, last 7 days

HF Open LLM Leaderboard top 10

What "needs update" usually means

What it usually does NOT mean

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Date	Source	Title
2026-05-19	Hacker News AI	My Arduino spins faster when Claude burns more tokens
2026-05-19	ZDNet AI	Google I/O 2026 live: Our takes on Gemini 3.5, Spark, Android XR, and more
2026-05-19	The Verge AI	Google’s AI future demands trust — and your personal data
2026-05-19	WIRED AI	Everything Announced at Google I/O 2026: Gemini, Search, Smart Glasses
2026-05-19	The Verge AI	Gemini will use Volvo’s external cameras to interpret parking signs
2026-05-19	The Verge AI	The 13 biggest announcements at Google I/O 2026
2026-05-19	The Verge AI	Google wants to compete with Anthropic’s Mythos
2026-05-19	Google AI Blog	I/O 2026: Welcome to the agentic Gemini era
2026-05-19	Google AI Blog	Gemini 3.5: frontier intelligence with action
2026-05-19	Google AI Blog	Everything new in our Google AI subscriptions, fresh from I/O 2026
2026-05-19	The Verge AI	Would you let robots spend your money? Google is betting on it
2026-05-19	The Verge AI	Gmail is going to start talking to you
2026-05-19	WIRED AI	Gemini Spark Is Google’s Response to OpenClaw’s 24/7 AI Agent
2026-05-19	ZDNet AI	Google's new Omni AI tool will let you video clone yourself - I'm intrigued (and concerned)
2026-05-19	Hugging Face Blog	Introducing the Ettin Reranker Family

Rank	Model
1	MaziyarPanahi_calme-3.2-instruct-78b_bfloat16 (avg 52.08 · 78B)
2	MaziyarPanahi_calme-3.1-instruct-78b_bfloat16 (avg 51.29 · 78B)
3	dfurman_CalmeRys-78B-Orpo-v0.1_bfloat16 (avg 51.23 · 78B)
4	MaziyarPanahi_calme-2.4-rys-78b_bfloat16 (avg 50.77 · 78B)
5	huihui-ai_Qwen2.5-72B-Instruct-abliterated_bfloat16 (avg 48.11 · 73B)
6	Qwen_Qwen2.5-72B-Instruct_bfloat16 (avg 47.98 · 73B)
7	MaziyarPanahi_calme-2.1-qwen2.5-72b_bfloat16 (avg 47.86 · 73B)
8	newsbang_Homer-v1.0-Qwen2.5-72B_bfloat16 (avg 47.46 · 73B)
9	ehristoforu_qwen2.5-test-32b-it_bfloat16 (avg 47.37 · 33B)
10	Saxo_Linkbricks-Horizon-AI-Avengers-V1-32B_bfloat16 (avg 47.34 · 33B)

Weekly benchmark review: 2026-05-20 #3

Description

Weekly benchmark review (2026-05-20)

Current state of data/benchmarks.json

Model-release-flavored news, last 7 days

HF Open LLM Leaderboard top 10

What "needs update" usually means

What it usually does NOT mean

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Current state of `data/benchmarks.json`