Skip to content

✨ Model Benchmarks — EQ-first LLM comparison page#13

Merged
TechNickAI merged 11 commits intomainfrom
feature/model-benchmarks
Apr 7, 2026
Merged

✨ Model Benchmarks — EQ-first LLM comparison page#13
TechNickAI merged 11 commits intomainfrom
feature/model-benchmarks

Conversation

@TechNickAI
Copy link
Copy Markdown
Owner

Summary

  • New /model-benchmarks/ page comparing 10 current-gen LLMs with emotional intelligence (EQ-Bench) as the default sort and visual focal point
  • Interactive table with sorting, search, capability filters, column tooltips, and EQ trait detail tooltips
  • Mobile-responsive card layout
  • Data pipeline (fetch-model.py) pulling from OpenRouter, Artificial Analysis, EQ-Bench, PinchBench, and Arena
  • Design review polish: EQ column warmth, 44px touch targets, reduced hero padding, boosted contrast
  • XSS-safe rendering, null-safe sorting, loading/error/empty states

Models included

Claude Opus 4, Claude Sonnet 4, GPT-5.4, Grok 4.20, Gemini 3.1 Pro, Qwen 3.6 Plus, MiniMax M2.7, MiMo V2 Pro, GLM 5 Turbo, Step 3

Known gaps (deferred)

  • AA coding indices for GPT-5.4/Grok/Gemini (rate limited, will fill on next run)
  • Agentic scores for Grok/Gemini (no PinchBench data yet)
  • EQ-Bench scores for MiniMax/MiMo/Step (not yet tested by EQ-Bench)

Test plan

  • Verify page loads at /model-benchmarks/
  • Sort by each column — nulls should always sort to bottom
  • Search filters by name and provider
  • Capability filter buttons toggle correctly
  • EQ column has subtle tint and bolder values
  • Column header tooltips appear on hover
  • EQ Elo hover shows trait breakdown
  • Cost hover shows input/output pricing
  • Model names link to OpenRouter detail pages
  • Mobile card layout renders cleanly
  • Filter pills are tappable on mobile (44px targets)

🤖 Generated with Claude Code

Nick Sullivan and others added 6 commits April 6, 2026 22:03
Interactive comparison of current-gen LLMs emphasizing emotional intelligence.
Default-sorts by EQ-Bench to highlight what makes this different: we measure
how models feel, not just what they know.

- Sortable table with search and capability filters
- Column header tooltips explaining each metric
- EQ-Bench detail tooltips (empathy, social IQ, insight traits)
- Cost tooltips with input/output pricing breakdown
- Mobile-responsive card layout
- XSS-safe rendering via esc() helper
- Null values always sort to bottom
- Loading, error, and empty states

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Claude Opus 4, Sonnet 4, GPT-5.4, Grok 4.20, Gemini 3.1 Pro, Qwen 3.6 Plus,
MiniMax M2.7, MiMo V2 Pro, GLM 5 Turbo, Step 3. Includes composite scores
(reasoning/coding/agentic), EQ-Bench Elo + traits, Arena Elo, speed, and
pricing from OpenRouter, Artificial Analysis, EQ-Bench, PinchBench, and Arena.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fetches from OpenRouter API, enriches with Artificial Analysis benchmarks,
and computes weighted composite scores. Generates llms.txt for LLM consumption.
Supports EQ-Bench, PinchBench, and Arena data integration.

Also adds env.local to gitignore for API keys.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Plain-text model summary for LLM consumption, generated from model-data.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the complete workflow for researching and adding new LLM models:
OpenRouter, Artificial Analysis, PinchBench, Arena, and EQ-Bench sources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add eq-cell class with subtle tint + bolder score values so EQ visually
  carries the page's emotional emphasis into the data surface
- Remove AOS fade-up from table wrapper (data appears via row animation)
- Reduce hero padding so table peeks above fold on desktop
- Increase filter button touch targets to 44px minimum
- Boost score bar track opacity and row hover contrast

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 7, 2026

PR Review — Model Benchmarks Page

This is a well-crafted feature. The EQ-first framing is coherent with the site's heart-centered mission, the XSS mitigations are solid, and the null-safe sorting behavior is thoughtfully handled. A few things worth addressing before merge:


🔴 Issues

1. Floating Alpine.js version (@3.x.x)

<script defer src="https://cdn.jsdelivr.net/npm/alpinejs@3.x.x/dist/cdn.min.js"></script>

The 3.x.x semver range resolves to whatever is current at request time — a new minor release could silently change behavior or break the mobile menu. Pin to a specific version like @3.14.1.

2. No SRI hashes on CDN dependencies
Tailwind (cdn.tailwindcss.com), Alpine.js (jsdelivr), and AOS (unpkg) are all loaded without Subresource Integrity hashes. If any of those CDNs are compromised, arbitrary JS runs on your page. For a static GitHub Pages site this is the primary attack surface. Either add integrity="sha384-..." attributes or self-host these assets.

3. eqbench-raw.json is double-encoded
The file contains a JSON string (the entire array is wrapped in quotes with escaped characters) rather than a JSON array. This makes it unusable as-is from JSON.parse() without an extra unwrap step. Either store it as a proper JSON array or document that callers must call JSON.parse(JSON.parse(raw)).


🟡 Accessibility

4. Missing aria-label on mobile menu button
The hamburger button has :aria-expanded but no aria-label, so screen readers announce it as an unlabeled button. Add aria-label="Toggle navigation menu".

5. Missing aria-sort on sortable column headers
Sort state is communicated via CSS classes (sort-asc/sort-desc) but not via aria-sort. Screen reader users have no way to know which column is sorted or in which direction. The updateSortIndicators() function should also set th.setAttribute("aria-sort", ...).


🟢 Minor / Nice-to-haves

6. sortModels mutates the input
list.sort() sorts in place. Since filterModels() returns a new array this works correctly today, but it's a subtle footgun if the call order ever changes. [...list].sort(...) makes the intent explicit.

7. rel="noopener" without rel="noreferrer"
External links (openrouter.ai, claude.ai) use rel="noopener" but not noreferrer. Adding noreferrer prevents the Referer header from leaking to third-party destinations and is the standard pattern for target="_blank" links.

8. Cost sort direction is inverted from UX expectation
When sorting by Cost descending (the default behavior for most columns), the result shows cheapest first because of the cost-flip logic. The column tooltip says "Lower is better" which implies clicking should surface the most expensive models first for comparability. Worth verifying this UX is intentional.

9. CURATED_MODELS in fetch-model.py includes arcee-ai/trinity-large-preview:free
This model isn't in model-data.json or the page. Either add it or remove it from the curated list to avoid confusion on the next data refresh.


What's working well

  • The esc() helper is clean and covers all the necessary HTML injection vectors
  • Null sentinel handling in getSortValue/sortModels (nulls always to bottom) is exactly right
  • Loading/error/empty states are all handled — no blank screen on fetch failure
  • Schema.org Dataset structured data is a nice SEO touch given the content type
  • 44px touch targets on mobile filter pills are correct
  • The data pipeline (fetch-model.py) is nicely designed: preserves manually-entered data on --refresh, handles the OpenRouter → AA slug mismatch cleanly, and the merge_model() pattern is sensible

The SRI issue is the only one I'd call a blocker for a production site. The accessibility items would meaningfully improve the experience for keyboard and screen reader users.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 680722245b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread model-benchmarks/scripts/fetch-model.py Outdated
Comment on lines +401 to +404
for bench_key in ("eq_bench",):
existing_bench = m.get("benchmarks", {}).get(bench_key, {})
if existing_bench and not new_model["benchmarks"].get(bench_key):
new_model["benchmarks"][bench_key] = existing_bench
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve existing benchmark fields during model merges

Expand the preservation logic before models[i] = new_model: right now only eq_bench is carried forward, so refreshes that don't repopulate every source (for example, no AA key available, or manually maintained Arena entries) silently drop existing benchmarks.arena, prior AA benchmark/scores, and other manual fields like notes. This causes model-data.json to lose previously valid data on routine refresh runs.

Useful? React with 👍 / 👎.

Comment on lines +564 to +565
data = load_model_data()
fetched = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Implement file locking for concurrent refresh safety

The script advertises parallel-agent safety, but updates are still a plain read-modify-write cycle with no lock; two invocations can read the same starting JSON and whichever saves last will overwrite the other process's merged models. In environments where multiple fetches run in parallel, this creates nondeterministic data loss.

Useful? React with 👍 / 👎.

Comment thread model-benchmarks/scripts/fetch-model.py
Comment thread model-benchmarks/js/app.js
Comment thread model-benchmarks/data/model-data.json
- Fix eqbench-raw.json double encoding (was JSON string, now proper array)
- Fix EQ trait values for GLM 5 Turbo and Qwen3.6 Plus to match raw source
- Preserve arena Elo and notes fields on --refresh (not just eq_bench)
- Remove unused eloHtml() function (dead code with latent XSS gap)
- Fix sortModels() to be non-destructive ([...list].sort)
- Pin Alpine.js to 3.14.9 instead of floating @3.x.x
- Add rel="noreferrer" to all target="_blank" links
- Remove arcee-ai/trinity from CURATED_MODELS (not in dataset)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TechNickAI
Copy link
Copy Markdown
Owner Author

Thanks for the thorough review. Fixed in 66daba4:

Fixed: eqbench-raw.json double encoding, EQ trait data mismatches, arena/notes preservation on --refresh, unused eloHtml() dead code, sortModels mutation, Alpine.js pinned to 3.14.9, added noreferrer to all external links, removed arcee from CURATED_MODELS.

WONTFIX: SRI hashes — low risk for a static GitHub Pages site with no user auth or sensitive data. Aria labels/sort — not a project priority per conventions.

Incorrect: Cost sort UX — 'Cost ↓' (descending = cheapest first) is intentional: surfaces best value by default, matching the 'Lower is better' tooltip framing.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 7, 2026

Code Review — PR #13: Model Benchmarks Page

Great feature! The EQ-first framing is well-aligned with the site's mission, and the implementation is solid overall. Here are observations across the files:


✅ What's working well

  • XSS safety: The esc() function in app.js is comprehensive and consistently applied across all user-controlled rendering paths
  • Null-safe sorting: Sentinel values (-1 / 9999) push missing data to the bottom regardless of sort direction — cleanly handled
  • Loading/error/empty states: All three covered in both table and mobile card views
  • SEO hygiene: Structured data (Dataset + BreadcrumbList), canonical URL, OG/Twitter tags, rel="noopener noreferrer" on external links
  • 44px touch targets: Filter pills and mobile nav hit the a11y minimum
  • Python script: Clean separation of concerns, good docstrings, merge_model() correctly preserves manually-entered benchmark data on refresh

🐛 Bugs / Correctness

1. arenaHtml skips Math.round() (app.js)

return `<span class="score-value">${arena.elo}</span>${note}`;

Every other numeric display uses Math.round(). Arena Elo values in model-data.json are stored as integers anyway, but this is inconsistent with eqHtml which does round.

2. Missing file locking despite docstring promise (fetch-model.py)
The module docstring says "each invocation locks the JSON file briefly during the merge step" but merge_model() / save_model_data() have no actual locking (fcntl, filelock, etc.). Parallel runs will produce a race condition. Either add locking or update the docstring.

3. CURATED_MODELS only covers 7 of the 10 showcased models
GPT-5.4, Grok 4.20, and Gemini 3.1 Pro are in model-data.json but not in CURATED_MODELS. Running --curated on a fresh clone will silently omit them. Consider adding them (with --no-aa notes if AA data isn't available) or add a comment explaining the gap.


♿ Accessibility

4. Hamburger button has no accessible label (index.html)

<button @click="mobileMenuOpen = !mobileMenuOpen" ...>
  <svg ...>  <!-- hamburger icon, no text -->

Screen readers will announce this as an unlabeled button. Add aria-label="Toggle navigation".

5. aria-expanded binding
:aria-expanded="mobileMenuOpen.toString()" — Alpine handles boolean-to-attribute coercion natively. :aria-expanded="mobileMenuOpen" is idiomatic and equivalent.


🎨 Minor inconsistencies

6. Mobile cards lose EQ trait detail
Desktop table has the hover tooltip showing Empathy/Social IQ/Insight/Humanlike/Warm breakdown via eqHtml(). Mobile cards render raw eq_bench.elo with no color tier or traits. Low priority, but the EQ column is the headline feature — even a small visual cue (color class from scoreTier) on mobile would reinforce it.

7. Duplicate .model-card rule (styles.css)
.model-card is defined twice — once with layout/border/padding (line ~3888) and again at the bottom with just the animation (line ~4063). Merge the animation declaration into the first rule.

8. scoreHtml(value, max)max is always 100
All six call sites pass 100. Default the parameter (max = 100) and drop the argument from all call sites for clarity.


⚡ Performance note

9. cdn.tailwindcss.com in production
The Play CDN is ~3 MB and re-parses classes at runtime. Since the site has no build step this is an accepted tradeoff, but it's worth flagging for when traffic grows. The standalone CLI can produce a purged ~10 KB file with zero runtime cost, no Node required.


🔒 Security (non-blocking)

10. PostHog key committed
phc_hyD2NBSE7eJXWw1lsdN4Zj5ojP3ArpfYJ5Ho0iWMZmg is in the HTML. PostHog client keys are designed to be public (they're scoped to your project, not an auth credential), so this is fine — just noting it for awareness.


Summary

Priority Item
🔴 File locking docstring vs implementation mismatch (#2)
🟡 Missing aria-label on hamburger (#4), arenaHtml rounding (#1)
🟡 CURATED_MODELS gap (#3)
🟢 Duplicate CSS rule (#7), aria-expanded (#5), scoreHtml default (#8)
💬 Mobile EQ visual (#6), Tailwind CDN note (#9)

Overall this is high-quality work — the EQ-first framing and transparent methodology section are exactly right for this site. The issues above are mostly polish; the core logic is sound. 🌱

Comment thread model-benchmarks/js/app.js Outdated
Comment thread model-benchmarks/scripts/fetch-model.py
Nick Sullivan and others added 3 commits April 6, 2026 22:49
- GPT-5.4: coding 57 (AA), agentic from PinchBench (90.5/81.7)
- Grok 4.20: coding 42 (AA), agentic from PinchBench via grok-4.1-fast (82.4/71.8)
- Gemini 3.1 Pro: coding 56 (AA), agentic from PinchBench (86.7/77.0)
- Recompute all agentic scores consistently: PinchBench Best (4x) + Avg (2x)
- Drop IFBench from agentic formula (not persisted, can't reproduce)
- Fix EQ trait data for GLM 5 and Qwen3.6 from raw source
- Fix eqbench-raw.json double encoding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Haiku 4.5: reasoning 31, agentic 85.7 (PinchBench 89.5/78.1),
Arena 1407, 93 t/s, $2.0 blended. Coding/EQ gaps remain.

GPT-5.4 Mini: reasoning 48, coding 51, agentic 56 (AA indices),
Arena 1455, 186 t/s, $1.69 blended. EQ gap remains.

Also notes MiMo EQ-Bench v2 score (80.08, not comparable to v3 Elo).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EQ-Bench v2 scores (0-100, not comparable to v3 Elo):
- GPT-5.4 Mini: 84.10
- MiMo-V2-Pro: 80.08
- Claude Haiku 4.5: 73.74

Table fixes:
- Remove sticky header (unnecessary for 12-row table)
- Right-align all numeric columns via CSS (not Tailwind)
- Remove score bars for cleaner number alignment
- Sentence-case column headers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 7, 2026

Code Review — Model Benchmarks page

Overall this is genuinely solid work. The XSS handling is thorough, null-safe sorting is clean, and the data pipeline is well-structured. A few things worth addressing:


Bugs / Issues

1. Methodology docs don't match the code (index.html, fetch-model.py)

The Agentic Score card in the Methodology section says:

PinchBench Best (4x) and PinchBench Avg (2x)

But compute_agentic_score() in fetch-model.py also includes IFBench (3x) — weighted higher than PinchBench Avg. That's a meaningful omission from the public methodology explanation.

2. No file locking despite the claim (fetch-model.py)

The module docstring says:

each invocation locks the JSON file briefly during the merge step to avoid conflicts

No locking is actually implemented. If two agents run concurrently with --refresh, the last writer wins and the other's changes are silently dropped. Either add fcntl.flock / a lockfile, or remove the claim from the docstring.


Accessibility

3. Hamburger button has no accessible name (index.html)

No aria-label on the mobile menu <button>. Screen readers announce it as an unlabeled button. Add aria-label="Toggle navigation".

4. Sortable columns don't expose sort state to assistive tech (index.html, app.js)

Sort direction is shown visually via CSS arrows, but no aria-sort attribute is set on <th>. Should be toggled in updateSortIndicators():

th.setAttribute('aria-sort', th.dataset.sort === sortKey
  ? (sortDir === 'asc' ? 'ascending' : 'descending')
  : 'none');

5. Column header tooltips are mouse-only (styles.css)

.col-tooltip is CSS-hover-only — keyboard users can't reach it. Consider also triggering on :focus-within, or use title attributes on the <th> elements as a baseline fallback.


Minor

6. Unused f-string prefixes (fetch-model.py)

Several strings in generate_llms_txt() use f"..." with no {} interpolation variables. They work fine but the f prefix is misleading — drop it.

7. EQ/column tooltips can overflow viewport (styles.css)

.col-tooltip uses left: 50%; transform: translateX(-50%) with no viewport clamping. On narrower screens, tooltips on edge columns (Reasoning, Context) will clip. Consider clamp() or a small JS nudge.


What's well done

  • esc() covers all five HTML metacharacters and is applied consistently before any data-derived string hits innerHTML
  • Null sentinels in getSortValue are symmetric and clearly intentional (−1 for higher-is-better columns, 9999 for cost)
  • merge_model() correctly preserves manually-entered EQ/Arena data on refresh — the guard if existing_bench and not new_model['benchmarks'].get(bench_key) is exactly right
  • 44px touch targets on filter buttons
  • rel="noopener noreferrer" on all external links
  • env.local gitignored before it was ever committed

The EQ-first framing is a genuinely differentiated angle for an LLM comparison page.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c38a19b. Configure here.

"score_methodology": {
"reasoning": "Weighted average of: AA Intelligence Index (3x), GPQA (2.5x), MMLU-Pro (2x), HLE (1.5x), AIME 2025 (1x). Scale 0-100.",
"coding": "Weighted average of: AA Coding Index (3x), LiveCodeBench (2x), TerminalBench Hard (2x), SciCode (1x). Scale 0-100.",
"agentic": "Weighted average of: PinchBench Best (4x), IFBench (3x), PinchBench Avg (2x). Scale 0-100.",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agentic scores contradict stated methodology across files

Medium Severity

The score_methodology.agentic field in model-data.json states "PinchBench Best (4x), IFBench (3x), PinchBench Avg (2x)" but the actual stored scores.agentic values are computed without IFBench (just PB Best 4x + PB Avg 2x). Meanwhile, llms.txt shows different agentic scores that do include IFBench. For example, Claude Sonnet 4.6 shows agentic 85.7 on the web page but 70.9 in llms.txt. Users and AI consumers of these two files see conflicting data for every model.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c38a19b. Configure here.

- Fix OpenRouter model links to use /models/ prefix
- arenaHtml: add Math.round() for consistency with other numeric cells
- scoreHtml: default max=100, drop redundant arg from all call sites
- Mobile cards: add EQ color tier via eqTier() (EQ is the headline feature)
- CSS: merge duplicate .model-card animation into primary rule
- fetch-model.py: remove false file-locking claim from docstring
- merge_model: preserve pinchbench and scores on refresh
- CURATED_MODELS: add comment explaining why some models aren't listed
- index.html: use idiomatic :aria-expanded="mobileMenuOpen" (no .toString())

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Apr 7, 2026

PR Review: Model Benchmarks Page

This is a well-constructed PR — solid vanilla JS architecture, thoughtful data pipeline design, and good attention to accessibility and security. A few things worth addressing before merge:


Methodology Documentation Inconsistency

The Agentic Score methodology card in index.html reads:

Weighted average of PinchBench Best (4x) and PinchBench Avg (2x).

But fetch-model.py's compute_agentic_score() and SKILL.md both include IFBench (3x weight) as a third component. The HTML should reflect the actual scoring formula to avoid misleading users.


Data Discrepancy: model-data.json vs llms.txt

Claude Opus 4.6 shows "agentic": 89.9 in model-data.json but Agentic: 74.8 in llms.txt. These values don't match, suggesting llms.txt may not have been regenerated after the last data update. Worth a python fetch-model.py --refresh to sync them.


Cost Column Sort UX

In sortModels(), sorting by cost in desc direction actually sorts ascending (cheaper first) — the flip is intentional to show best value at top. This logic is correct but could surprise users: clicking "Cost" once shows cheapest-first, with a down-arrow indicator. Consider defaulting cost to asc (expensive-first on first click) so the sort arrow direction matches the data direction visually, or add a note in the column tooltip.


Minor: eqHtml() Note Handling

If a model has eq_bench.note but no eq_bench.elo (e.g., models with only v2_score), the note is silently dropped because the function returns early:

if (!eq || !eq.elo) return `<span class="score-na">—</span>`;

The v2_score models (MiMo-V2-Pro, Claude Haiku 4.5, GPT-5.4 Mini) correctly render , but their notes about the v2/v3 scale difference are invisible. Consider surfacing the asterisk note on the cell for these models.


Security: All Clear ✅

  • esc() function correctly handles &, <, >, ", '
  • All user-generated content goes through esc() before innerHTML
  • External links use rel="noopener noreferrer"
  • Data loaded from same-origin relative path — no CORS risk
  • env.local properly gitignored

Accessibility

  • Mobile filter buttons have 44px touch targets ✅
  • aria-expanded on mobile menu toggle ✅
  • Sortable column headers lack aria-sort attribute — screen readers won't know the current sort state. Consider adding aria-sort="ascending" / aria-sort="descending" / aria-sort="none" in updateSortIndicators().
  • The model count <div id="model-count"> updates dynamically but has no aria-live region — assistive tech won't announce filter result changes.

Code Quality Highlights (Praise)

  • IIFE + "use strict" — clean module pattern for a no-build site
  • getSortValue() sentinel values (-1 / 9999) ensuring nulls sort to bottom — well thought out
  • merge_model() preserving manually-entered benchmark data on refresh — defensive and correct
  • fetch-model.py docstring with usage examples — excellent DX for future maintainers
  • Grok 4.20's low EQ score (856) is clearly noted in both the data and llms.txt — transparent about data anomalies

Summary

The methodology doc mismatch and llms.txt staleness are the two issues most worth fixing before merge. The accessibility gaps are lower priority but worth a follow-up issue. Everything else is polish.

🤖 Reviewed with Claude Code

@TechNickAI TechNickAI merged commit dbc1ec8 into main Apr 7, 2026
2 checks passed
@TechNickAI TechNickAI deleted the feature/model-benchmarks branch April 7, 2026 04:13
@TechNickAI
Copy link
Copy Markdown
Owner Author

Most items from this review were addressed in commits 66daba4, a04d33e, and 24cdfc9 before merge:

Fixed:

  • ✓ eqbench-raw.json double encoding
  • ✓ EQ trait data for GLM 5/Qwen3.6
  • ✓ Preserve arena/notes/pinchbench/scores on refresh
  • ✓ Removed unused eloHtml()
  • ✓ Pinned Alpine.js to 3.14.9
  • ✓ Added rel="noreferrer"
  • ✓ OpenRouter links now use /models/ path
  • ✓ Methodology made consistent (dropped IFBench)
  • ✓ File locking docstring corrected
  • ✓ Mobile EQ color tiers added

Outstanding items tracked in #17:

  • SRI hashes on CDN dependencies (security)
  • aria-label on mobile menu button (a11y)
  • aria-sort on sortable columns (a11y)
  • Column tooltip keyboard access (a11y)
  • Tailwind Play CDN performance note

Great catches on the data preservation bugs and 404 links! 🎯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant