Add benchmark results page to website by Sripadkarne · Pull Request #46 · AgentOptimizer/agentopt

Sripadkarne · 2026-03-23T15:58:34Z

Issue #50

Summary

New "Benchmark Results" page on the AgentOpt website documenting 200-sample brute force evaluations across 4 benchmarks and 9 Bedrock models
Added to site navigation next to Examples

What's on the page

Model pricing table — all 9 AWS Bedrock models with per-token costs
Cross-benchmark summary — best combo, accuracy, and selector savings for GPQA, BFCL, HotpotQA, MathQA
Per-benchmark sections with:
- Model/combo result tables (top 15 + bottom 15 for 2-tuple benchmarks)
- Selector comparison tables (7 algorithms, sorted by mean accuracy, 50-seed simulations)
- Collapsible full 81-combo tables for HotpotQA and MathQA
Key findings highlighted as callouts (e.g., HotpotQA "Capability as Liability" — Opus as planner bypasses the solver and scores worst at ~32%)

Changes

docs/benchmark-results/index.md — new page
mkdocs.yml — nav entry added

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wenyueh · 2026-03-24T19:33:19Z

@@ -0,0 +1,390 @@
+# Benchmark Results
+
+We evaluated **9 models on AWS Bedrock** across **4 benchmarks** using LangGraph-based agents, then ran 8 model selection algorithms to measure how efficiently each finds the best model without exhaustive search. All results use 198–200 samples per benchmark with brute force ground truth. Selector comparisons were run with 50 random seeds.


Don't mention AWS bedrock

Wenyueh · 2026-03-24T19:34:00Z

+| 80 | Claude 3 Haiku | Qwen3 32B | 63.86% | $0.27 |
+| 81 | Claude 3 Haiku | Claude 3 Haiku | 59.88% | $0.30 |
+
+??? note "Full 81 Combo Results"


what's this symbol "???"

That's MkDocs syntax for a collapsible section!

It keeps the full 81 combo table hidden by default so the page isn't overwhelming, and users can click to expand it.

Add benchmark results page to website

8854a89

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Sripadkarne requested a review from Wenyueh March 23, 2026 15:58

Wenyueh reviewed Mar 24, 2026

View reviewed changes

Remove 'AWS Bedrock' from benchmark results description

300b646

Wenyueh merged commit fbde3d3 into main Mar 24, 2026

Wenyueh deleted the package-benchmarkresults branch March 24, 2026 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark results page to website#46

Add benchmark results page to website#46
Wenyueh merged 2 commits intomainfrom
package-benchmarkresults

Sripadkarne commented Mar 23, 2026 •

edited

Loading

Uh oh!

Wenyueh Mar 24, 2026

Uh oh!

Sripadkarne Mar 24, 2026

Uh oh!

Wenyueh Mar 24, 2026

Uh oh!

Sripadkarne Mar 24, 2026

Uh oh!

Wenyueh Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,390 @@
		# Benchmark Results

		We evaluated 9 models on AWS Bedrock across 4 benchmarks using LangGraph-based agents, then ran 8 model selection algorithms to measure how efficiently each finds the best model without exhaustive search. All results use 198–200 samples per benchmark with brute force ground truth. Selector comparisons were run with 50 random seeds.

Conversation

Sripadkarne commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's on the page

Changes

Uh oh!

Wenyueh Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Sripadkarne Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Wenyueh Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Sripadkarne Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Wenyueh Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sripadkarne commented Mar 23, 2026 •

edited

Loading