✨ Model Benchmarks — EQ-first LLM comparison page#13
Conversation
Interactive comparison of current-gen LLMs emphasizing emotional intelligence. Default-sorts by EQ-Bench to highlight what makes this different: we measure how models feel, not just what they know. - Sortable table with search and capability filters - Column header tooltips explaining each metric - EQ-Bench detail tooltips (empathy, social IQ, insight traits) - Cost tooltips with input/output pricing breakdown - Mobile-responsive card layout - XSS-safe rendering via esc() helper - Null values always sort to bottom - Loading, error, and empty states Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Opus 4, Sonnet 4, GPT-5.4, Grok 4.20, Gemini 3.1 Pro, Qwen 3.6 Plus, MiniMax M2.7, MiMo V2 Pro, GLM 5 Turbo, Step 3. Includes composite scores (reasoning/coding/agentic), EQ-Bench Elo + traits, Arena Elo, speed, and pricing from OpenRouter, Artificial Analysis, EQ-Bench, PinchBench, and Arena. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fetches from OpenRouter API, enriches with Artificial Analysis benchmarks, and computes weighted composite scores. Generates llms.txt for LLM consumption. Supports EQ-Bench, PinchBench, and Arena data integration. Also adds env.local to gitignore for API keys. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Plain-text model summary for LLM consumption, generated from model-data.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the complete workflow for researching and adding new LLM models: OpenRouter, Artificial Analysis, PinchBench, Arena, and EQ-Bench sources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add eq-cell class with subtle tint + bolder score values so EQ visually carries the page's emotional emphasis into the data surface - Remove AOS fade-up from table wrapper (data appears via row animation) - Reduce hero padding so table peeks above fold on desktop - Increase filter button touch targets to 44px minimum - Boost score bar track opacity and row hover contrast Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review — Model Benchmarks PageThis is a well-crafted feature. The EQ-first framing is coherent with the site's heart-centered mission, the XSS mitigations are solid, and the null-safe sorting behavior is thoughtfully handled. A few things worth addressing before merge: 🔴 Issues1. Floating Alpine.js version ( <script defer src="https://cdn.jsdelivr.net/npm/alpinejs@3.x.x/dist/cdn.min.js"></script>The 2. No SRI hashes on CDN dependencies 3. 🟡 Accessibility4. Missing 5. Missing 🟢 Minor / Nice-to-haves6. 7. 8. Cost sort direction is inverted from UX expectation 9. What's working well
The SRI issue is the only one I'd call a blocker for a production site. The accessibility items would meaningfully improve the experience for keyboard and screen reader users. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 680722245b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for bench_key in ("eq_bench",): | ||
| existing_bench = m.get("benchmarks", {}).get(bench_key, {}) | ||
| if existing_bench and not new_model["benchmarks"].get(bench_key): | ||
| new_model["benchmarks"][bench_key] = existing_bench |
There was a problem hiding this comment.
Preserve existing benchmark fields during model merges
Expand the preservation logic before models[i] = new_model: right now only eq_bench is carried forward, so refreshes that don't repopulate every source (for example, no AA key available, or manually maintained Arena entries) silently drop existing benchmarks.arena, prior AA benchmark/scores, and other manual fields like notes. This causes model-data.json to lose previously valid data on routine refresh runs.
Useful? React with 👍 / 👎.
| data = load_model_data() | ||
| fetched = 0 |
There was a problem hiding this comment.
Implement file locking for concurrent refresh safety
The script advertises parallel-agent safety, but updates are still a plain read-modify-write cycle with no lock; two invocations can read the same starting JSON and whichever saves last will overwrite the other process's merged models. In environments where multiple fetches run in parallel, this creates nondeterministic data loss.
Useful? React with 👍 / 👎.
- Fix eqbench-raw.json double encoding (was JSON string, now proper array) - Fix EQ trait values for GLM 5 Turbo and Qwen3.6 Plus to match raw source - Preserve arena Elo and notes fields on --refresh (not just eq_bench) - Remove unused eloHtml() function (dead code with latent XSS gap) - Fix sortModels() to be non-destructive ([...list].sort) - Pin Alpine.js to 3.14.9 instead of floating @3.x.x - Add rel="noreferrer" to all target="_blank" links - Remove arcee-ai/trinity from CURATED_MODELS (not in dataset) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the thorough review. Fixed in 66daba4: Fixed: eqbench-raw.json double encoding, EQ trait data mismatches, arena/notes preservation on --refresh, unused eloHtml() dead code, sortModels mutation, Alpine.js pinned to 3.14.9, added noreferrer to all external links, removed arcee from CURATED_MODELS. WONTFIX: SRI hashes — low risk for a static GitHub Pages site with no user auth or sensitive data. Aria labels/sort — not a project priority per conventions. Incorrect: Cost sort UX — 'Cost ↓' (descending = cheapest first) is intentional: surfaces best value by default, matching the 'Lower is better' tooltip framing. |
Code Review — PR #13: Model Benchmarks PageGreat feature! The EQ-first framing is well-aligned with the site's mission, and the implementation is solid overall. Here are observations across the files: ✅ What's working well
🐛 Bugs / Correctness1. return `<span class="score-value">${arena.elo}</span>${note}`;Every other numeric display uses 2. Missing file locking despite docstring promise ( 3. ♿ Accessibility4. Hamburger button has no accessible label ( <button @click="mobileMenuOpen = !mobileMenuOpen" ...>
<svg ...> <!-- hamburger icon, no text -->Screen readers will announce this as an unlabeled button. Add 5. 🎨 Minor inconsistencies6. Mobile cards lose EQ trait detail 7. Duplicate 8. ⚡ Performance note9. 🔒 Security (non-blocking)10. PostHog key committed Summary
Overall this is high-quality work — the EQ-first framing and transparent methodology section are exactly right for this site. The issues above are mostly polish; the core logic is sound. 🌱 |
- GPT-5.4: coding 57 (AA), agentic from PinchBench (90.5/81.7) - Grok 4.20: coding 42 (AA), agentic from PinchBench via grok-4.1-fast (82.4/71.8) - Gemini 3.1 Pro: coding 56 (AA), agentic from PinchBench (86.7/77.0) - Recompute all agentic scores consistently: PinchBench Best (4x) + Avg (2x) - Drop IFBench from agentic formula (not persisted, can't reproduce) - Fix EQ trait data for GLM 5 and Qwen3.6 from raw source - Fix eqbench-raw.json double encoding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Haiku 4.5: reasoning 31, agentic 85.7 (PinchBench 89.5/78.1), Arena 1407, 93 t/s, $2.0 blended. Coding/EQ gaps remain. GPT-5.4 Mini: reasoning 48, coding 51, agentic 56 (AA indices), Arena 1455, 186 t/s, $1.69 blended. EQ gap remains. Also notes MiMo EQ-Bench v2 score (80.08, not comparable to v3 Elo). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EQ-Bench v2 scores (0-100, not comparable to v3 Elo): - GPT-5.4 Mini: 84.10 - MiMo-V2-Pro: 80.08 - Claude Haiku 4.5: 73.74 Table fixes: - Remove sticky header (unnecessary for 12-row table) - Right-align all numeric columns via CSS (not Tailwind) - Remove score bars for cleaner number alignment - Sentence-case column headers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review — Model Benchmarks pageOverall this is genuinely solid work. The XSS handling is thorough, null-safe sorting is clean, and the data pipeline is well-structured. A few things worth addressing: Bugs / Issues1. Methodology docs don't match the code ( The Agentic Score card in the Methodology section says:
But 2. No file locking despite the claim ( The module docstring says:
No locking is actually implemented. If two agents run concurrently with Accessibility3. Hamburger button has no accessible name ( No 4. Sortable columns don't expose sort state to assistive tech ( Sort direction is shown visually via CSS arrows, but no th.setAttribute('aria-sort', th.dataset.sort === sortKey
? (sortDir === 'asc' ? 'ascending' : 'descending')
: 'none');5. Column header tooltips are mouse-only (
Minor6. Unused f-string prefixes ( Several strings in 7. EQ/column tooltips can overflow viewport (
What's well done
The EQ-first framing is a genuinely differentiated angle for an LLM comparison page. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c38a19b. Configure here.
| "score_methodology": { | ||
| "reasoning": "Weighted average of: AA Intelligence Index (3x), GPQA (2.5x), MMLU-Pro (2x), HLE (1.5x), AIME 2025 (1x). Scale 0-100.", | ||
| "coding": "Weighted average of: AA Coding Index (3x), LiveCodeBench (2x), TerminalBench Hard (2x), SciCode (1x). Scale 0-100.", | ||
| "agentic": "Weighted average of: PinchBench Best (4x), IFBench (3x), PinchBench Avg (2x). Scale 0-100.", |
There was a problem hiding this comment.
Agentic scores contradict stated methodology across files
Medium Severity
The score_methodology.agentic field in model-data.json states "PinchBench Best (4x), IFBench (3x), PinchBench Avg (2x)" but the actual stored scores.agentic values are computed without IFBench (just PB Best 4x + PB Avg 2x). Meanwhile, llms.txt shows different agentic scores that do include IFBench. For example, Claude Sonnet 4.6 shows agentic 85.7 on the web page but 70.9 in llms.txt. Users and AI consumers of these two files see conflicting data for every model.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c38a19b. Configure here.
- Fix OpenRouter model links to use /models/ prefix - arenaHtml: add Math.round() for consistency with other numeric cells - scoreHtml: default max=100, drop redundant arg from all call sites - Mobile cards: add EQ color tier via eqTier() (EQ is the headline feature) - CSS: merge duplicate .model-card animation into primary rule - fetch-model.py: remove false file-locking claim from docstring - merge_model: preserve pinchbench and scores on refresh - CURATED_MODELS: add comment explaining why some models aren't listed - index.html: use idiomatic :aria-expanded="mobileMenuOpen" (no .toString()) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Review: Model Benchmarks PageThis is a well-constructed PR — solid vanilla JS architecture, thoughtful data pipeline design, and good attention to accessibility and security. A few things worth addressing before merge: Methodology Documentation InconsistencyThe Agentic Score methodology card in
But Data Discrepancy:
|
|
Most items from this review were addressed in commits 66daba4, a04d33e, and 24cdfc9 before merge: Fixed:
Outstanding items tracked in #17:
Great catches on the data preservation bugs and 404 links! 🎯 |


Summary
/model-benchmarks/page comparing 10 current-gen LLMs with emotional intelligence (EQ-Bench) as the default sort and visual focal pointfetch-model.py) pulling from OpenRouter, Artificial Analysis, EQ-Bench, PinchBench, and ArenaModels included
Claude Opus 4, Claude Sonnet 4, GPT-5.4, Grok 4.20, Gemini 3.1 Pro, Qwen 3.6 Plus, MiniMax M2.7, MiMo V2 Pro, GLM 5 Turbo, Step 3
Known gaps (deferred)
Test plan
/model-benchmarks/🤖 Generated with Claude Code