The open public benchmark for AI agent skills.
Versuz indexes every public Claude skill and CLAUDE.md on GitHub, runs them through 5 held-out tasks, lets three frontier LLMs judge the outputs, and publishes a ranking based on actual performance, not stars.
Free · Open · Updated daily
🌐 versuz.dev · 📦 npx versuz · 🔌 @versuz/mcp · 📚 Methodology
In May 2026, the Claude Code ecosystem already lists 4,200+ skills on claudemarketplaces.com and 1.2M+ on skillsmp.com. Anthropic has shipped its own enterprise marketplace. None of them answer the only question a developer actually asks :
"Of the 47 skills claiming to do task X, which one actually works?"
Existing directories rank by stars and installs — popularity, not quality. Versuz benchmarks the outputs. Three frontier judges grade every skill against the same task suite, and the winners rank #1 in their category.
It's LMArena, but for agent skills.
| Count (May 2026) | Auto-discovered from | |
|---|---|---|
| SKILL.md files | ~2,590 | GitHub Code Search · Sourcegraph · 14 awesome-* lists · 26 GitHub Topics |
| CLAUDE.md files | ~3,474 | Same pipeline |
| Quality-judged | ~714 (and growing) | LLM 5-axis judge (Groq Llama / Gemini / OpenRouter) |
| Bench-judged Elo | Active (cycle #21 live) | Haiku 4.5 + DeepSeek V4 Flash + GPT-5 mini via OpenRouter |
| Categories | 6 skill + 8 CLAUDE.md | document · sql · data · web · shell · code · nextjs · react · python-data · backend-api · mobile · devops · ml-training · generic |
| Official orgs flagged | ~30 | anthropics, openai, google, vercel, stripe, supabase, etc. |
- Free — scraped public, verified progressively (5-level trust ladder :
claimed → verified → reviewed → featured) - Premium — author-listed, fixed price, Versuz takes 30% / author keeps 70% via Stripe Connect Express
- Featured — Versuz first-party curation, 100% Versuz
versuz.dev/marketplace — full registry with filters (kind, category, tier, trust level, quality, tokens, official, topic).
npx versuz # interactive mode
npx versuz search pdf # cross-kind full-text search
npx versuz install <slug> # download to .claude/skills/<slug>/
npx versuz login # auth with GitHub PAT
npx versuz submit <github-url> # share your skill (own repo only)claude mcp add versuz npx -y @versuz/mcpClaude Code now has 5 native tools : versuz_search, versuz_list_skills, versuz_list_claude_md, versuz_get, versuz_install. Ask "find me a PDF skill and install it" — Claude searches, inspects, downloads, all inline.
GET /api/v1/skills·GET /api/v1/skills/<slug>·/contentGET /api/v1/claude-md· same shapeGET /feed/skills·GET /feed/claude-md— RSS 2.0GET /sitemap.xml- Per-item embed badges :
/badge/<kind>/<slug>(SVG)
- Discover — multi-source scrape (GitHub Code Search + Sourcegraph + awesome-* aggregators) with multi-token rotation, content-hash dedup, official-org auto-flag.
- Score — LLM 5-axis quality judge (clarity, specificity, completeness, structure, usefulness) on every item. Mean 67/100, target distribution N(65, 12).
- Benchmark — for each category, every skill runs through 5 held-out tasks (N=5 = statistically validated sweet spot). Output dedup by content hash so identical runs are reused.
- Judge — 3 frontier LLMs grade each output independently. Judge dedup by output hash. Prompt cache hits ~82%, cutting input cost by ~57%.
- Aggregate — Bayesian Elo per category. Refreshed every 15 min via Vercel cron.
Full methodology : versuz.dev/methodology.
| Layer | Tech |
|---|---|
| Frontend | Next.js 16 (App Router · Turbopack · React Compiler) · React 19 · JavaScript |
| Styling | Tailwind v4 (@theme inline) · Instrument Serif + Geist + JetBrains Mono |
| Backend | Next.js API routes · Postgres via Supabase · service-role admin client |
| Storage | Supabase Storage (premium-content bucket, private + signed URLs) |
| Hosting | Vercel · 6 daily/weekly crons |
| Auth | Supabase Auth + GitHub OAuth |
| Payments | Stripe Connect Express + destination charges (70/30) |
| Resend SMTP (welcome + branded receipts) | |
| Bench | Pure Node in scripts/bench/ · OpenRouter (1 key, 200+ models) · multi-tier modes (dev / v1 / or-v1 / prod / gold) |
| CLI | Pure Node + figlet + chalk + cli-table3 + ora + prompts |
| MCP | @modelcontextprotocol/sdk |
git clone https://github.com/TomaTV/versuz
cd versuz
npm install
cp .env.local.example .env.local # fill Supabase, Stripe, Resend, GitHub PATs
npm run dev # http://localhost:3000| Command | What it does |
|---|---|
npm run dev |
Dev server (Turbopack) |
npm run build / start |
Prod build & run |
npm run scrape:skills |
Scrape SKILL.md via GitHub Code Search |
npm run scrape:claude-md |
Same for CLAUDE.md |
npm run scrape:aggregators |
14 awesome-* + 8 GitHub Topics |
npm run scrape:codesearch |
Sourcegraph stream API |
npm run scrape:max |
Exhaustive mode (40 sub-queries × 2 kinds) |
npm run scrape:all |
All scrapers chained |
npm run bench |
Run a queued cycle (agent + judge + refresh) |
npm run bench:quality |
LLM quality judge on un-judged items |
npm run seed |
Push fixtures to Supabase |
NEXT_PUBLIC_SUPABASE_URL=
NEXT_PUBLIC_SUPABASE_ANON_KEY=
SUPABASE_SERVICE_ROLE_KEY=
GITHUB_TOKENS=ghp_xxx,ghp_yyy,ghp_zzz # multi-token rotation
OPENROUTER_API_KEY= # bench mode or-v1
STRIPE_SECRET_KEY=
STRIPE_WEBHOOK_SECRET=
RESEND_API_KEY=
ADMIN_GITHUB_LOGINS=YourGitHubLogin
versuz/
├── src/ # Next.js app + lib + components
│ ├── app/ # App Router pages + /api routes
│ ├── components/ # Brand, site, marketplace, motion, …
│ └── lib/ # queries, auth, stripe, supabase, …
├── scripts/ # Node scrapers + bench engine
│ ├── scrape/ # GitHub Code Search (SKILL.md)
│ ├── scrape-claude-md/ # GitHub Code Search (CLAUDE.md)
│ ├── scrape-aggregators/ # 14 awesome-* + 8 topics
│ ├── scrape-codesearch/ # Sourcegraph adapter
│ └── bench/ # Agent + judge + cycle orchestrator
├── supabase/migrations/ # 22 SQL migrations
├── cli/ # `npx versuz` CLI package
├── mcp-server/ # `@versuz/mcp` MCP server package
├── docs/ # Go-live checklists (stripe, domain, cli-mcp)
├── .ui/ # Design reference (Claude Design v1)
├── CONTEXT.md # Long-form project context
├── CLAUDE.md # Agent working agreement
└── TODO.md # Task list
Mode or-v1 (default) : Haiku 4.5 + DeepSeek V4 Flash + GPT-5 mini via OpenRouter. One key, one dashboard.
| Scale | Outputs | Judge scores | Cost (with ~82% cache hit) |
|---|---|---|---|
| 100 skills × 5 tasks | 500 | 1,500 | ~$1.60 |
| 1,000 skills × 5 tasks | 5,000 | 15,000 | ~$16 |
| 5,000 skills × 5 tasks | 25,000 | 75,000 | ~$80 |
| Full catalog (~5,200) | 26,000 | 78,000 | ~$83 |
Why some judges cost more : DeepSeek V4 Flash emits internal reasoning tokens (Chain-of-Thought) before the JSON score. These invisible tokens are billed but not shown. A cap of 900 output tokens keeps reasoning concise without truncating the JSON. Override via BENCH_JUDGE_MAX_TOKENS.
Optimisations available :
BENCH_JUDGE_COUNT=2→ cut judge calls by 33%, save ~$27 on 5k skillsBENCH_JUDGE_MAX_TOKENS=500→ squeeze DeepSeek reasoning further, save ~$15 on 5k skillsBENCH_MODE=or-thrift→ single judge (GPT-5 nano), ~$0.86/dayBENCH_MODE=dev→ 3 free Groq Llama models, $0 (3000 RPD cap)
- ✅ V0 — public marketplace, scraping pipeline, quality judge, CLI v0.1, MCP v0.1
- ✅ V0.5 — bench engine live (or-v1 with prompt cache), first cycles completed
- ⏳ V1 — domain launch, Stripe live mode, full catalog bench
- 🔮 V2 — real-time battles (LMArena-style), dark theme, API plans for companies
→ Full plan in TODO.md.
FlukX Studio — Toma, solo build from France. EdTech automation + design + full-stack.
MIT — see LICENSE.
