Release 3.7.0: Claude Fable 5 + Arena Reliability · Ammaar-Alam/minebench

Models

Model	PR	Average generation time	Cost	Average JSON size	Max JSON size	Notes
Claude Fable 5	#50	18m 04s (1,084.4s)	$54.93 for 15 builds	30.65 MiB	97.39 MiB (worldtree)	Added as a first-class Anthropic model with max adaptive effort by default, a 128,000-token output cap, direct Anthropic/OpenRouter routing, and default sampling for restricted Fable requests
Claude 4.8 Opus	#41	24m 48s (1,487.9s)	$41.52 for 15 builds	39.95 MiB	244.38 MiB (worldtree)	Shown here for comparison

(Note: Opus 4.8 was added in the previous release and is shown here only for comparison)

Notes

Fable was quite an interesting model in a way that's harder to describe, so I hope the posted statistics help convey my point. The model was surprisingly cheap (for its MineBench outputs anyway). In fact, I'm quite surprised, as in the web harness (Claude.ai), the model seems to think for much longer, but through the API calls you can see that it actually ended up thinking for less time than Opus 4.8 did.

Furthermore, I think the quality of the model's builds was very surprising: they don't seem as big of a leap over GPT 5.5 Pro as the the official benchmark scores might suggest, but the model clearly has very high attention to detail. For example, this is the first model that in the Arcade Machine build, actually created a correctly detailed screen (of PacMan), including the full layout, a score, and even a "1UP" label. Though it seems the model was quite conservative with its interpretation of the system-prompt, and (subjectively) not all of its builds were clearly more impressive than 4.8

It's interesting how the model was able to make these detailed builds while keeping the overall JSON size lower in comparison to Opus 4.8, and while thinking for less time. Pure speculation: I think this might indicate why Claude Fable is supposedly much better at coding-related tasks; it actually completes the task with an intuitive approach and without adding excess.

Still, the results were quite surprising, so I reached out to the VoxelBench team, who also confirmed in their tests the builds were of generally much smaller size. They mentioned adding these two lines to the template produced much better builds in their case:

LEVEL OF DETAIL: MAXIMUM
BOUNDING BOX: UNLIMITED

Though we're not changing the MineBench system-prompt to cater to any specific models, I do think it's worth noting that one might be able to achieve much better results with improved prompting.

Lastly, despite the API costs being double that of Opus 4.8, Fable ended up actually only being ~30% more than 4.8, which is definitely a result of Fable producing much less tokens overall (though there's no comparison of CoT versus output tokens specifically).

What's Changed

Models

Added Claude Fable 5 as a first-class Anthropic model with native Anthropic routing, OpenRouter fallback, max adaptive effort by default, and a 128,000-token output cap. #50
Omitted non-default sampling parameters for Claude Fable 5 and Claude 4.8 Opus routes where the provider rejects them. #50, 27aa09b
Added focused regression coverage for Fable catalog wiring, upload slug, direct Anthropic request shape, OpenRouter request shape, output cap, and max-effort traces. #50

Arena and Leaderboard

Accounted for queued arena votes in coverage and matchup state so pending work is reflected before drain completion. #45
Kept the queued-vote coverage cache warm, stabilized coverage refreshes, indexed vote job refreshes, replayed drained ratings, and guarded arena coverage eligibility. #45
Added raw rating to leaderboard details. e51a120

Viewer, Exports, and Artifacts

Tuned GIF export speed and quality and added a dedicated verification check for export configuration. #46

Maintenance

Organized regression tests into clearer config, integration, repo, UI, and unit groups. #47
Replaced script-level npx calls with package-manager-aware commands. #47
Added named CI quality gates and hardened repository checks. #47
Made export performance budgets opt-in and hardened regression test execution. #47
Refreshed current-line dependencies and kept local testing docs aligned with the new command structure. #47

Changelog

cfdff93 — (fix) account for queued arena votes
ee8a2f4 — (fix) tune gif export speed and quality
f9d07fa — Merge pull request #46 from Ammaar-Alam/codex/tune-gif-export-speed-quality
623a4f0 — (fix) keep queued vote coverage cache warm
271894d — (fix) stabilize arena coverage refresh
22f9335 — (fix) index arena vote job refreshes
8496fd0 — (fix) replay drained arena ratings
ea2e36d — (fix) guard arena coverage eligibility
fb929a6 — Merge pull request #45 from Ammaar-Alam/codex/fix-arena-vote-queue-matchmaking
27aa09b — (fix) omit OpenRouter Opus temperature
e51a120 — (fix) adding raw-rating to leaderboard details
065bc63 — (refactor) organize regression tests
0c59cdb — (chore) remove npx from pnpm scripts
422a54a — (docs) keep test commands in local guide
2e86f17 — (test) make export perf budget opt-in
2daa1b6 — (ci) add named quality gates
52be097 — (chore) harden repository checks
ac7e748 — (chore) refresh current-line dependencies
f0fb685 — (fix) harden regression test execution
7f80887 — Merge pull request #47 from Ammaar-Alam/chore/repo-maintenance-cleanup
44a43e8 — (feat) add claude fable 5 support
5afcde0 — (fix) omit claude fable sampling parameters

Full Changelog: 3.6.0...3.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.7.0: Claude Fable 5 + Arena Reliability

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Models

Notes

What's Changed

Models

Arena and Leaderboard

Viewer, Exports, and Artifacts

Maintenance

Changelog

Uh oh!