Skip to content

Add benchmark results page to website#46

Merged
Wenyueh merged 2 commits intomainfrom
package-benchmarkresults
Mar 24, 2026
Merged

Add benchmark results page to website#46
Wenyueh merged 2 commits intomainfrom
package-benchmarkresults

Conversation

@Sripadkarne
Copy link
Copy Markdown
Collaborator

@Sripadkarne Sripadkarne commented Mar 23, 2026

Issue #50

Summary

  • New "Benchmark Results" page on the AgentOpt website documenting 200-sample brute force evaluations across 4 benchmarks and 9 Bedrock models
  • Added to site navigation next to Examples

What's on the page

  • Model pricing table — all 9 AWS Bedrock models with per-token costs
  • Cross-benchmark summary — best combo, accuracy, and selector savings for GPQA, BFCL, HotpotQA, MathQA
  • Per-benchmark sections with:
    • Model/combo result tables (top 15 + bottom 15 for 2-tuple benchmarks)
    • Selector comparison tables (7 algorithms, sorted by mean accuracy, 50-seed simulations)
    • Collapsible full 81-combo tables for HotpotQA and MathQA
  • Key findings highlighted as callouts (e.g., HotpotQA "Capability as Liability" — Opus as planner bypasses the solver and scores worst at ~32%)

Changes

  • docs/benchmark-results/index.md — new page
  • mkdocs.yml — nav entry added

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Sripadkarne Sripadkarne requested a review from Wenyueh March 23, 2026 15:58
Comment thread docs/benchmark-results/index.md Outdated
@@ -0,0 +1,390 @@
# Benchmark Results

We evaluated **9 models on AWS Bedrock** across **4 benchmarks** using LangGraph-based agents, then ran 8 model selection algorithms to measure how efficiently each finds the best model without exhaustive search. All results use 198–200 samples per benchmark with brute force ground truth. Selector comparisons were run with 50 random seeds.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mention AWS bedrock

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it

| 80 | Claude 3 Haiku | Qwen3 32B | 63.86% | $0.27 |
| 81 | Claude 3 Haiku | Claude 3 Haiku | 59.88% | $0.30 |

??? note "Full 81 Combo Results"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this symbol "???"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's MkDocs syntax for a collapsible section!

It keeps the full 81 combo table hidden by default so the page isn't overwhelming, and users can click to expand it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice!

@Wenyueh Wenyueh merged commit fbde3d3 into main Mar 24, 2026
@Wenyueh Wenyueh deleted the package-benchmarkresults branch March 24, 2026 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants