From 2edba90a9e7b8315dbf4bb7989fc6312d393cfe4 Mon Sep 17 00:00:00 2001 From: Sripad Karne Date: Thu, 2 Apr 2026 13:23:58 -0700 Subject: [PATCH 1/5] Update docs with latest benchmark results and blog post fixes - benchmark-results/index.md: All tables updated with corrected numbers, added GPQA thinking effort ablation section - blog technical-deep-dive: Updated budget alternatives, algorithm comparison, selector summary, Opus+Opus fix, cumulative cost wording - mkdocs.yml: Minor config updates Co-Authored-By: Claude Opus 4.6 --- docs/benchmark-results/index.md | 571 +++++++++++++------------ docs/blog/posts/technical-deep-dive.md | 49 +-- mkdocs.yml | 2 + 3 files changed, 320 insertions(+), 302 deletions(-) diff --git a/docs/benchmark-results/index.md b/docs/benchmark-results/index.md index 77725f5..21c9d4c 100644 --- a/docs/benchmark-results/index.md +++ b/docs/benchmark-results/index.md @@ -22,10 +22,10 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr | Benchmark | Tuple | Samples | Combos | Best Combo | Accuracy | BF Cost | Arm Elim Savings | |:----------|:------|:--------|:-------|:-----------|:---------|:--------|:-----------------| -| GPQA Diamond | 1-tuple | 198 | 9 | Claude Opus 4.6 | **80.30%** | $4.13 | 49% | -| BFCL Multi-Turn | 1-tuple | 200 | 9 | Claude Opus 4.6 | **72.00%** | $85.42 | 11% | -| HotpotQA | 2-tuple | 199 | 81 | planner=Ministral 3 8B + solver=Claude Opus 4.6 | **74.78%** | $51.48 | 64% | -| MathQA | 2-tuple | 200 | 81 | answer=Claude Opus 4.6 + critic=Qwen3 Next 80B A3B | **98.83%** | $113.01 | 46% | +| GPQA Diamond | 1-tuple | 198 | 9 | Claude Opus 4.6 | **74.75%** | $4.71 | 24% | +| BFCL Multi-Turn | 1-tuple | 200 | 9 | Kimi K2.5 (tied with Opus, Qwen3 Next) | **70.00%** | $84.80 | 12% | +| HotpotQA | 2-tuple | 200 | 81 | planner=Ministral 3 8B + solver=Claude Opus 4.6 | **74.27%** | $51.90 | 67% | +| MathQA | 2-tuple | 200 | 81 | answer=Claude Opus 4.6 + critic=Claude Haiku 4.5 | **98.84%** | $123.87 | 58% | --- @@ -37,28 +37,43 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr | Rank | Model | Accuracy | Avg Latency (s) | Cost | |:-----|:------|:---------|:-----------------|:-----| -| 1 | Claude Opus 4.6 | **80.30%** | 86.4 | $2.43 | -| 2 | Kimi K2.5 | 72.02% | 97.2 | $0.72 | -| 3 | gpt-oss-120b | 68.02% | 88.5 | $0.19 | -| 4 | Claude Haiku 4.5 | 60.51% | 83.1 | $0.51 | -| 5 | gpt-oss-20b | 52.02% | 85.7 | $0.13 | -| 6 | Qwen3 Next 80B A3B | 51.04% | 90.0 | $0.06 | -| 7 | Qwen3 32B | 46.67% | 88.3 | $0.04 | -| 8 | Claude 3 Haiku | 37.31% | 80.1 | $0.06 | -| 9 | Ministral 3 8B | 36.87% | 84.7 | $0.01 | +| 1 | Claude Opus 4.6 | **74.75%** | 9.16 | $2.48 | +| 2 | Kimi K2.5 | 72.73% | 16.41 | $1.13 | +| 3 | gpt-oss-120b | 68.18% | 6.46 | $0.20 | +| 4 | Claude Haiku 4.5 | 59.60% | 3.70 | $0.51 | +| 5 | Qwen3 Next 80B A3B | 51.01% | 10.33 | $0.14 | +| 6 | gpt-oss-20b | 50.00% | 6.21 | $0.14 | +| 7 | Qwen3 32B | 46.97% | 1.54 | $0.08 | +| 8 | Ministral 3 8B | 36.87% | 0.25 | $0.00 | +| 9 | Claude 3 Haiku | 34.85% | 1.79 | $0.06 | ### Selector Comparison | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| -| Brute Force | 100% | 80.30% | 1,759 | $4.13 | -- | -| LM Proposal | 100% | 80.30% | 198 | $2.43 | 41% | -| Arm Elimination | 98% | 80.14% | 444 | $2.10 | **49%** | -| Hill Climbing | 92% | 79.64% | 1,552 | $3.75 | 9% | -| Epsilon LUCB | 90% | 79.47% | 361 | $2.32 | 44% | -| Bayesian Opt | 56% | 75.41% | 976 | $2.32 | 44% | -| Random Search | 36% | 70.53% | 587 | $1.53 | 63% | -| Threshold SE | 36% | 65.88% | 294 | $0.26 | 94% | +| Brute Force | 100% | 74.75% | 1,782 | $4.71 | -- | +| LM Proposal | 100% | 74.75% | 198 | $2.47 | 48% | +| Arm Elimination | 94% | 74.10% | 666 | $3.57 | **24%** | +| Hill Climbing | 90% | 74.55% | 1,501 | $4.03 | 14% | +| Epsilon LUCB | 72% | 73.14% | 380 | $2.51 | 47% | +| Bayesian Opt | 56% | 72.43% | 990 | $2.59 | 45% | +| Random Search | 36% | 68.57% | 594 | $1.73 | 63% | +| Threshold SE | 16% | 57.48% | 252 | $1.80 | 62% | + +### Thinking Effort Ablation + +Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) and Haiku 4.5 (budget_tokens). Baseline "none" rows use brute force results. + +| Model | Effort/Budget Tokens | Accuracy | Cost/Sample | Server Latency/Sample (s) | +|:------|:---------------------|:---------|:------------|:--------------------------| +| Opus | high | 83.90% | $0.113 | 70.4 | +| Opus | medium | 79.30% | $0.0341 | 23.9 | +| Opus | none | 74.75% | $0.0125 | 9.16 | +| Haiku 4.5 | 16K | 71.20% | $0.0192 | 30.6 | +| Haiku 4.5 | 32K | 71.20% | $0.0361 | 57.4 | +| Opus | low | 61.60% | $0.00302 | 3.06 | +| Haiku 4.5 | 5K | 60.10% | $0.00925 | 15.0 | +| Haiku 4.5 | none | 59.60% | $0.0026 | 3.70 | --- @@ -73,176 +88,176 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr | Rank | Model | Accuracy | Avg Latency (s) | Cost | |:-----|:------|:---------|:-----------------|:-----| -| 1 | Claude Opus 4.6 | **72.00%** | 222.8 | $60.78 | -| 2 | Qwen3 Next 80B A3B | 71.00% | 226.3 | $1.87 | -| 3 | Kimi K2.5 | 68.50% | 228.2 | $3.86 | -| 4 | Claude Haiku 4.5 | 65.00% | 208.7 | $11.97 | -| 5 | gpt-oss-120b | 61.00% | 208.9 | $1.13 | -| 6 | Qwen3 32B | 50.00% | 211.3 | $0.97 | -| 7 | Claude 3 Haiku | 45.00% | 205.2 | $3.42 | -| 8 | gpt-oss-20b | 39.00% | 204.2 | $0.44 | -| 9 | Ministral 3 8B | 33.50% | 213.8 | $0.98 | +| 1 | Kimi K2.5 | **70.00%** | 21.30 | $3.86 | +| 2 | Claude Opus 4.6 | 70.00% | 42.35 | $60.14 | +| 3 | Qwen3 Next 80B A3B | 70.00% | 60.54 | $1.90 | +| 4 | Claude Haiku 4.5 | 65.00% | 20.90 | $11.98 | +| 5 | gpt-oss-120b | 58.50% | 20.01 | $1.16 | +| 6 | Qwen3 32B | 47.00% | 10.78 | $1.00 | +| 7 | Claude 3 Haiku | 43.50% | 17.96 | $3.42 | +| 8 | gpt-oss-20b | 42.00% | 10.03 | $0.42 | +| 9 | Ministral 3 8B | 34.00% | 29.03 | $0.92 | ### Selector Comparison | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| -| Brute Force | 100% | 72.00% | 1,800 | $85.42 | -- | -| Arm Elimination | 100% | 72.00% | 922 | $76.33 | **11%** | -| Hill Climbing | 94% | 71.94% | 1,652 | $80.38 | 6% | -| Epsilon LUCB | 60% | 71.33% | 407 | $42.60 | 50% | -| Bayesian Opt | 56% | 70.61% | 1,000 | $50.98 | 40% | -| Random Search | 36% | 67.99% | 600 | $31.62 | 63% | -| Threshold SE | 12% | 57.52% | 285 | $6.45 | 92% | -| LM Proposal | 0% | 45.00% | 200 | $3.42 | 96% | +| Brute Force | 100% | 70.00% | 1,800 | $84.80 | -- | +| Hill Climbing | 100% | 70.00% | 1,664 | $72.12 | 15% | +| Arm Elimination | 88% | 69.37% | 912 | $74.39 | **12%** | +| Bayesian Opt | 44% | 69.27% | 1,000 | $50.64 | 40% | +| Random Search | 36% | 67.13% | 600 | $31.39 | 63% | +| Epsilon LUCB | 28% | 69.90% | 399 | $40.03 | 53% | +| Threshold SE | 10% | 58.19% | 186 | $18.82 | 78% | +| LM Proposal | 0% | 44.03% | 200 | $3.39 | 96% | --- ## HotpotQA -**Multi-hop question answering** — 199 samples from the HotpotQA distractor setting. Two-agent architecture: a **planner** proposes search steps, and a **solver** executes them with tool access. 81 model combinations (9 planners x 9 solvers). +**Multi-hop question answering** — 200 samples from the HotpotQA distractor setting. Two-agent architecture: a **planner** proposes search steps, and a **solver** executes them with tool access. 81 model combinations (9 planners x 9 solvers). ### Top 15 Combos | Rank | Planner | Solver | Accuracy | Cost | |:-----|:--------|:-------|:---------|:-----| -| 1 | Ministral 3 8B | Claude Opus 4.6 | **74.78%** | $2.71 | -| 2 | Qwen3 32B | Claude Opus 4.6 | 72.97% | $2.67 | -| 3 | Claude 3 Haiku | Claude Opus 4.6 | 72.58% | $2.67 | -| 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.32% | $2.66 | -| 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.29% | $0.13 | -| 6 | Kimi K2.5 | Claude Opus 4.6 | 71.09% | $2.53 | -| 7 | Qwen3 32B | gpt-oss-120b | 70.47% | $0.12 | -| 8 | Qwen3 32B | Qwen3 Next 80B A3B | 69.12% | $0.11 | -| 9 | Claude 3 Haiku | Qwen3 Next 80B A3B | 68.60% | $0.15 | -| 10 | Ministral 3 8B | gpt-oss-120b | 68.56% | $0.12 | -| 11 | Claude 3 Haiku | gpt-oss-120b | 68.44% | $0.16 | -| 12 | Kimi K2.5 | Qwen3 Next 80B A3B | 68.41% | $0.28 | -| 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.29% | $0.12 | -| 14 | Qwen3 32B | gpt-oss-20b | 67.75% | $0.09 | -| 15 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.46% | $0.11 | +| 1 | Ministral 3 8B | Claude Opus 4.6 | **74.27%** | $2.64 | +| 2 | Claude 3 Haiku | Claude Opus 4.6 | 73.25% | $2.79 | +| 3 | Qwen3 32B | Claude Opus 4.6 | 73.02% | $2.65 | +| 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.10% | $2.67 | +| 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.83% | $0.13 | +| 6 | Qwen3 32B | gpt-oss-120b | 70.04% | $0.13 | +| 7 | Kimi K2.5 | Claude Opus 4.6 | 69.96% | $2.43 | +| 8 | Claude 3 Haiku | gpt-oss-120b | 69.86% | $0.17 | +| 9 | Ministral 3 8B | gpt-oss-20b | 69.34% | $0.09 | +| 10 | Claude 3 Haiku | Qwen3 Next 80B A3B | 69.27% | $0.16 | +| 11 | Qwen3 Next 80B A3B | gpt-oss-20b | 68.89% | $0.09 | +| 12 | Ministral 3 8B | gpt-oss-120b | 68.70% | $0.12 | +| 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.15% | $0.11 | +| 14 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.98% | $0.11 | +| 15 | Qwen3 32B | Qwen3 Next 80B A3B | 67.53% | $0.11 | ### Bottom 15 Combos | Rank | Planner | Solver | Accuracy | Cost | |:-----|:--------|:-------|:---------|:-----| -| 67 | Claude Haiku 4.5 | Qwen3 32B | 35.64% | $0.46 | -| 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.10% | $0.48 | -| 69 | Ministral 3 8B | Claude Haiku 4.5 | 33.95% | $0.74 | -| 70 | Claude Opus 4.6 | Claude Opus 4.6 | 32.70% | $2.00 | -| 71 | Claude Opus 4.6 | Kimi K2.5 | 32.44% | $2.01 | -| 72 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 32.05% | $2.01 | -| 73 | Claude Opus 4.6 | gpt-oss-120b | 32.00% | $2.01 | -| 74 | Claude Opus 4.6 | Ministral 3 8B | 31.80% | $2.01 | -| 75 | Claude Opus 4.6 | Claude 3 Haiku | 31.80% | $2.01 | -| 76 | Claude Opus 4.6 | Qwen3 32B | 31.52% | $2.01 | -| 77 | Claude Opus 4.6 | gpt-oss-20b | 31.31% | $2.01 | -| 78 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 30.85% | $0.71 | -| 79 | Claude Opus 4.6 | Claude Haiku 4.5 | 30.81% | $2.01 | -| 80 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.57% | $0.79 | -| 81 | Qwen3 32B | Claude Haiku 4.5 | 25.11% | $0.72 | +| 67 | Kimi K2.5 | Claude Haiku 4.5 | 37.19% | $0.88 | +| 68 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | $0.46 | +| 69 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | $0.49 | +| 70 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | $0.70 | +| 71 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | $0.72 | +| 72 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | $2.02 | +| 73 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | $2.02 | +| 74 | Claude Opus 4.6 | Qwen3 32B | 31.96% | $2.02 | +| 75 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | $2.02 | +| 76 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | $2.02 | +| 77 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | $2.03 | +| 78 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | $2.02 | +| 79 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | $2.03 | +| 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | $0.69 | +| 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | $0.79 | !!! warning "Capability as Liability" - **Claude Opus 4.6 as planner achieves only ~32% accuracy** regardless of solver — the worst planner in the benchmark. Opus is "too smart" for the planner role: it calls `terminate()` and answers directly instead of delegating to the solver. The solver is never invoked. Meanwhile, the cheapest model (Ministral 3 8B) as planner with Opus as solver achieves the **best accuracy at 74.78%**. This demonstrates that stronger models can underperform in multi-agent architectures when the role requires delegation, not direct answering. + **Claude Opus 4.6 as planner achieves only ~32% accuracy** regardless of solver — the worst planner in the benchmark. Opus is "too smart" for the planner role: it calls `terminate()` and answers directly instead of delegating to the solver. The solver is never invoked. Meanwhile, the cheapest model (Ministral 3 8B) as planner with Opus as solver achieves the **best accuracy at 74.27%**. This demonstrates that stronger models can underperform in multi-agent architectures when the role requires delegation, not direct answering. ??? note "Full 81 Combo Results" | Rank | Planner | Solver | Accuracy | Cost | Note | |:-----|:--------|:-------|:---------|:-----|:-----| - | 1 | Ministral 3 8B | Claude Opus 4.6 | 74.78% | $2.71 | | - | 2 | Qwen3 32B | Claude Opus 4.6 | 72.97% | $2.67 | | - | 3 | Claude 3 Haiku | Claude Opus 4.6 | 72.58% | $2.67 | | - | 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.32% | $2.66 | | - | 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.29% | $0.13 | | - | 6 | Kimi K2.5 | Claude Opus 4.6 | 71.09% | $2.53 | | - | 7 | Qwen3 32B | gpt-oss-120b | 70.47% | $0.12 | | - | 8 | Qwen3 32B | Qwen3 Next 80B A3B | 69.12% | $0.11 | | - | 9 | Claude 3 Haiku | Qwen3 Next 80B A3B | 68.60% | $0.15 | | - | 10 | Ministral 3 8B | gpt-oss-120b | 68.56% | $0.12 | | - | 11 | Claude 3 Haiku | gpt-oss-120b | 68.44% | $0.16 | | - | 12 | Kimi K2.5 | Qwen3 Next 80B A3B | 68.41% | $0.28 | | - | 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.29% | $0.12 | | - | 14 | Qwen3 32B | gpt-oss-20b | 67.75% | $0.09 | | - | 15 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.46% | $0.11 | | - | 16 | gpt-oss-120b | Claude Opus 4.6 | 67.03% | $1.57 | | - | 17 | Kimi K2.5 | Ministral 3 8B | 66.79% | $0.27 | | - | 18 | Qwen3 Next 80B A3B | gpt-oss-20b | 66.79% | $0.09 | | - | 19 | Kimi K2.5 | gpt-oss-120b | 66.76% | $0.29 | | - | 20 | Ministral 3 8B | gpt-oss-20b | 66.55% | $0.09 | | - | 21 | Kimi K2.5 | gpt-oss-20b | 66.49% | $0.25 | | - | 22 | Claude 3 Haiku | gpt-oss-20b | 65.10% | $0.13 | | - | 23 | Claude 3 Haiku | Ministral 3 8B | 64.41% | $0.14 | | - | 24 | Qwen3 Next 80B A3B | Kimi K2.5 | 64.28% | $0.27 | | - | 25 | Ministral 3 8B | Kimi K2.5 | 64.03% | $0.26 | | - | 26 | Qwen3 32B | Kimi K2.5 | 63.77% | $0.26 | | - | 27 | gpt-oss-120b | Qwen3 Next 80B A3B | 63.62% | $0.09 | | - | 28 | Qwen3 Next 80B A3B | Ministral 3 8B | 63.29% | $0.10 | | - | 29 | Claude 3 Haiku | Kimi K2.5 | 62.86% | $0.31 | | - | 30 | gpt-oss-120b | Claude Haiku 4.5 | 62.38% | $0.36 | | - | 31 | Ministral 3 8B | Ministral 3 8B | 62.20% | $0.09 | | - | 32 | Qwen3 32B | Ministral 3 8B | 62.09% | $0.09 | | - | 33 | Kimi K2.5 | Kimi K2.5 | 61.96% | $0.45 | | - | 34 | gpt-oss-120b | Kimi K2.5 | 61.15% | $0.17 | | - | 35 | gpt-oss-120b | Claude 3 Haiku | 60.89% | $0.12 | | - | 36 | gpt-oss-120b | Ministral 3 8B | 60.64% | $0.09 | | - | 37 | gpt-oss-120b | gpt-oss-120b | 60.51% | $0.10 | | - | 38 | gpt-oss-120b | gpt-oss-20b | 59.10% | $0.08 | | - | 39 | gpt-oss-120b | Qwen3 32B | 58.54% | $0.09 | | - | 40 | Kimi K2.5 | Claude 3 Haiku | 57.18% | $0.32 | | - | 41 | Claude 3 Haiku | Qwen3 32B | 56.28% | $0.15 | | - | 42 | Kimi K2.5 | Qwen3 32B | 55.72% | $0.27 | | - | 43 | Ministral 3 8B | Qwen3 32B | 55.30% | $0.11 | | - | 44 | gpt-oss-20b | Claude Opus 4.6 | 55.13% | $0.90 | | - | 45 | gpt-oss-20b | Ministral 3 8B | 54.99% | $0.05 | | - | 46 | Qwen3 Next 80B A3B | Qwen3 32B | 54.88% | $0.11 | | - | 47 | gpt-oss-20b | Kimi K2.5 | 54.69% | $0.11 | | - | 48 | gpt-oss-20b | gpt-oss-120b | 54.26% | $0.06 | | - | 49 | Qwen3 32B | Qwen3 32B | 54.21% | $0.11 | | - | 50 | Claude 3 Haiku | Claude 3 Haiku | 54.13% | $0.20 | | - | 51 | gpt-oss-20b | Claude Haiku 4.5 | 54.06% | $0.25 | | - | 52 | gpt-oss-20b | Claude 3 Haiku | 53.08% | $0.08 | | - | 53 | gpt-oss-20b | Qwen3 Next 80B A3B | 52.87% | $0.05 | | - | 54 | gpt-oss-20b | gpt-oss-20b | 52.69% | $0.05 | | - | 55 | Ministral 3 8B | Claude 3 Haiku | 51.65% | $0.16 | | - | 56 | gpt-oss-20b | Qwen3 32B | 49.60% | $0.06 | | - | 57 | Qwen3 Next 80B A3B | Claude 3 Haiku | 48.86% | $0.17 | | - | 58 | Qwen3 32B | Claude 3 Haiku | 48.29% | $0.16 | | - | 59 | Claude 3 Haiku | Claude Haiku 4.5 | 47.28% | $0.71 | | - | 60 | Claude Haiku 4.5 | Claude Opus 4.6 | 43.40% | $1.80 | | - | 61 | Claude Haiku 4.5 | Kimi K2.5 | 41.51% | $0.55 | | - | 62 | Claude Haiku 4.5 | Ministral 3 8B | 41.21% | $0.45 | | - | 63 | Claude Haiku 4.5 | gpt-oss-20b | 41.18% | $0.45 | | - | 64 | Claude Haiku 4.5 | gpt-oss-120b | 40.83% | $0.47 | | - | 65 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 40.54% | $0.46 | | - | 66 | Kimi K2.5 | Claude Haiku 4.5 | 40.37% | $0.87 | | - | 67 | Claude Haiku 4.5 | Qwen3 32B | 35.64% | $0.46 | | - | 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.10% | $0.48 | | - | 69 | Ministral 3 8B | Claude Haiku 4.5 | 33.95% | $0.74 | | - | 70 | Claude Opus 4.6 | Claude Opus 4.6 | 32.70% | $2.00 | | - | 71 | Claude Opus 4.6 | Kimi K2.5 | 32.44% | $2.01 | role2_never_called | - | 72 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 32.05% | $2.01 | role2_never_called | - | 73 | Claude Opus 4.6 | gpt-oss-120b | 32.00% | $2.01 | role2_never_called | - | 74 | Claude Opus 4.6 | Ministral 3 8B | 31.80% | $2.01 | role2_never_called | - | 75 | Claude Opus 4.6 | Claude 3 Haiku | 31.80% | $2.01 | role2_never_called | - | 76 | Claude Opus 4.6 | Qwen3 32B | 31.52% | $2.01 | role2_never_called | - | 77 | Claude Opus 4.6 | gpt-oss-20b | 31.31% | $2.01 | role2_never_called | - | 78 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 30.85% | $0.71 | | - | 79 | Claude Opus 4.6 | Claude Haiku 4.5 | 30.81% | $2.01 | role2_never_called | - | 80 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.57% | $0.79 | | - | 81 | Qwen3 32B | Claude Haiku 4.5 | 25.11% | $0.72 | | + | 1 | Ministral 3 8B | Claude Opus 4.6 | 74.27% | $2.64 | | + | 2 | Claude 3 Haiku | Claude Opus 4.6 | 73.25% | $2.79 | | + | 3 | Qwen3 32B | Claude Opus 4.6 | 73.02% | $2.65 | | + | 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.10% | $2.67 | | + | 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.83% | $0.13 | | + | 6 | Qwen3 32B | gpt-oss-120b | 70.04% | $0.13 | | + | 7 | Kimi K2.5 | Claude Opus 4.6 | 69.96% | $2.43 | | + | 8 | Claude 3 Haiku | gpt-oss-120b | 69.86% | $0.17 | | + | 9 | Ministral 3 8B | gpt-oss-20b | 69.34% | $0.09 | | + | 10 | Claude 3 Haiku | Qwen3 Next 80B A3B | 69.27% | $0.16 | | + | 11 | Qwen3 Next 80B A3B | gpt-oss-20b | 68.89% | $0.09 | | + | 12 | Ministral 3 8B | gpt-oss-120b | 68.70% | $0.12 | | + | 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.15% | $0.11 | | + | 14 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.98% | $0.11 | | + | 15 | Qwen3 32B | Qwen3 Next 80B A3B | 67.53% | $0.11 | | + | 16 | Qwen3 32B | gpt-oss-20b | 66.95% | $0.09 | | + | 17 | Claude 3 Haiku | Ministral 3 8B | 65.98% | $0.14 | | + | 18 | Ministral 3 8B | Kimi K2.5 | 65.24% | $0.26 | | + | 19 | gpt-oss-120b | Qwen3 Next 80B A3B | 64.93% | $0.10 | | + | 20 | Ministral 3 8B | Ministral 3 8B | 64.89% | $0.09 | | + | 21 | Claude 3 Haiku | gpt-oss-20b | 64.79% | $0.13 | | + | 22 | Kimi K2.5 | gpt-oss-120b | 64.70% | $0.29 | | + | 23 | gpt-oss-120b | Claude Opus 4.6 | 64.59% | $1.61 | | + | 24 | gpt-oss-120b | Claude Haiku 4.5 | 64.11% | $0.38 | | + | 25 | Kimi K2.5 | Qwen3 Next 80B A3B | 63.99% | $0.30 | | + | 26 | Kimi K2.5 | Ministral 3 8B | 63.95% | $0.28 | | + | 27 | Claude 3 Haiku | Kimi K2.5 | 63.85% | $0.31 | | + | 28 | gpt-oss-120b | Ministral 3 8B | 63.70% | $0.09 | | + | 29 | Qwen3 Next 80B A3B | Kimi K2.5 | 63.69% | $0.27 | | + | 30 | Kimi K2.5 | gpt-oss-20b | 63.35% | $0.26 | | + | 31 | Qwen3 32B | Kimi K2.5 | 63.17% | $0.28 | | + | 32 | gpt-oss-120b | Claude 3 Haiku | 62.72% | $0.13 | | + | 33 | Kimi K2.5 | Kimi K2.5 | 62.28% | $0.44 | | + | 34 | gpt-oss-120b | gpt-oss-120b | 62.15% | $0.10 | | + | 35 | Qwen3 Next 80B A3B | Ministral 3 8B | 62.11% | $0.10 | | + | 36 | gpt-oss-120b | gpt-oss-20b | 61.51% | $0.08 | | + | 37 | Qwen3 32B | Ministral 3 8B | 61.17% | $0.09 | | + | 38 | gpt-oss-120b | Kimi K2.5 | 60.85% | $0.18 | | + | 39 | gpt-oss-120b | Qwen3 32B | 58.80% | $0.10 | | + | 40 | Claude 3 Haiku | Qwen3 32B | 56.02% | $0.15 | | + | 41 | Claude 3 Haiku | Claude 3 Haiku | 55.91% | $0.21 | | + | 42 | gpt-oss-20b | Claude Opus 4.6 | 55.86% | $1.04 | | + | 43 | Ministral 3 8B | Qwen3 32B | 55.02% | $0.11 | | + | 44 | Kimi K2.5 | Claude 3 Haiku | 54.90% | $0.34 | | + | 45 | Qwen3 32B | Qwen3 32B | 54.82% | $0.11 | | + | 46 | Kimi K2.5 | Qwen3 32B | 54.73% | $0.30 | | + | 47 | gpt-oss-20b | Claude Haiku 4.5 | 54.28% | $0.26 | | + | 48 | gpt-oss-20b | Ministral 3 8B | 54.25% | $0.05 | | + | 49 | Qwen3 Next 80B A3B | Qwen3 32B | 54.13% | $0.11 | | + | 50 | gpt-oss-20b | Qwen3 Next 80B A3B | 53.89% | $0.06 | | + | 51 | gpt-oss-20b | Claude 3 Haiku | 52.66% | $0.08 | | + | 52 | gpt-oss-20b | gpt-oss-120b | 52.17% | $0.06 | | + | 53 | Ministral 3 8B | Claude 3 Haiku | 51.33% | $0.16 | | + | 54 | gpt-oss-20b | Kimi K2.5 | 51.01% | $0.12 | | + | 55 | gpt-oss-20b | gpt-oss-20b | 50.09% | $0.05 | | + | 56 | Qwen3 Next 80B A3B | Claude 3 Haiku | 49.98% | $0.17 | | + | 57 | gpt-oss-20b | Qwen3 32B | 49.16% | $0.06 | | + | 58 | Qwen3 32B | Claude 3 Haiku | 48.77% | $0.16 | | + | 59 | Claude 3 Haiku | Claude Haiku 4.5 | 46.50% | $0.71 | | + | 60 | Claude Haiku 4.5 | Claude Opus 4.6 | 43.54% | $1.80 | | + | 61 | Claude Haiku 4.5 | gpt-oss-20b | 41.49% | $0.45 | | + | 62 | Claude Haiku 4.5 | gpt-oss-120b | 41.20% | $0.47 | | + | 63 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 41.17% | $0.46 | | + | 64 | Claude Haiku 4.5 | Ministral 3 8B | 41.09% | $0.45 | | + | 65 | Claude Haiku 4.5 | Kimi K2.5 | 41.00% | $0.54 | | + | 66 | Kimi K2.5 | Claude Haiku 4.5 | 37.19% | $0.88 | | + | 67 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | $0.46 | | + | 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | $0.49 | | + | 69 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | $0.70 | | + | 70 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | $0.72 | | + | 71 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | $2.02 | role2_never_called | + | 72 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | $2.02 | role2_never_called | + | 73 | Claude Opus 4.6 | Qwen3 32B | 31.96% | $2.02 | role2_never_called | + | 74 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | $2.02 | role2_never_called | + | 75 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | $2.02 | role2_never_called | + | 76 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | $2.03 | role2_never_called | + | 77 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | $2.02 | role2_never_called | + | 78 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | $2.03 | role2_never_called | + | 79 | Claude Opus 4.6 | Claude Opus 4.6 | 31.71% | $2.02 | | + | 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | $0.69 | | + | 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | $0.79 | | ### Selector Comparison | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| -| Brute Force | 100% | 74.78% | 16,108 | $51.48 | -- | -| Arm Elimination | 90% | 74.12% | 4,654 | $18.49 | **64%** | -| Hill Climbing | 44% | 73.38% | 5,031 | $19.21 | 63% | -| Bayesian Opt | 8% | 72.78% | 3,979 | $12.13 | 76% | -| Random Search | 30% | 72.34% | 4,176 | $13.26 | 74% | -| Epsilon LUCB | 14% | 69.96% | 477 | $1.86 | 96% | -| Threshold SE | 2% | 63.62% | 1,926 | $3.50 | 93% | -| LM Proposal | 0% | 34.41% | 199 | $1.86 | 96% | +| Brute Force | 100% | 74.27% | 16,168 | $51.90 | -- | +| Arm Elimination | 86% | 73.19% | 4,283 | $16.92 | **67%** | +| Hill Climbing | 52% | 73.13% | 4,635 | $19.39 | 63% | +| Random Search | 30% | 72.25% | 4,192 | $13.37 | 74% | +| Epsilon LUCB | 10% | 69.71% | 478 | $1.75 | 97% | +| Bayesian Opt | 8% | 73.33% | 3,996 | $12.29 | 76% | +| Threshold SE | 4% | 65.42% | 1,642 | $6.45 | 88% | +| LM Proposal | 0% | 34.13% | 200 | $1.84 | 96% | --- @@ -254,137 +269,137 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr | Rank | Answer Model | Critic Model | Accuracy | Cost | |:-----|:-------------|:-------------|:---------|:-----| -| 1 | Claude Opus 4.6 | Qwen3 Next 80B A3B | **98.83%** | $5.89 | -| 2 | Claude Opus 4.6 | Ministral 3 8B | 98.73% | $5.31 | -| 3 | Claude Opus 4.6 | Claude Haiku 4.5 | 98.27% | $6.09 | -| 4 | Claude Opus 4.6 | Qwen3 32B | 97.79% | $6.42 | -| 5 | Claude Opus 4.6 | Claude Opus 4.6 | 97.77% | $6.95 | -| 6 | Claude Opus 4.6 | gpt-oss-120b | 97.73% | $6.14 | -| 7 | Claude Opus 4.6 | Claude 3 Haiku | 97.26% | $5.26 | -| 8 | Claude Opus 4.6 | Kimi K2.5 | 97.25% | $6.66 | -| 9 | Claude Opus 4.6 | gpt-oss-20b | 97.13% | $6.10 | -| 10 | Claude Haiku 4.5 | Ministral 3 8B | 94.47% | $2.59 | -| 11 | Claude Haiku 4.5 | Claude Haiku 4.5 | 94.00% | $3.17 | -| 12 | Claude Haiku 4.5 | Claude Opus 4.6 | 94.00% | $3.89 | -| 13 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 94.00% | $2.50 | -| 14 | Ministral 3 8B | Claude 3 Haiku | 93.98% | $0.05 | -| 15 | Claude Haiku 4.5 | Kimi K2.5 | 93.97% | $2.92 | +| 1 | Claude Opus 4.6 | Claude Haiku 4.5 | **98.84%** | $6.19 | +| 2 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.82% | $5.77 | +| 3 | Claude Opus 4.6 | Ministral 3 8B | 98.72% | $5.26 | +| 4 | Claude Opus 4.6 | gpt-oss-20b | 98.28% | $5.93 | +| 5 | Claude Opus 4.6 | gpt-oss-120b | 97.77% | $6.30 | +| 6 | Claude Opus 4.6 | Qwen3 32B | 97.28% | $6.68 | +| 7 | Claude Opus 4.6 | Claude Opus 4.6 | 97.24% | $6.97 | +| 8 | Claude Opus 4.6 | Kimi K2.5 | 97.24% | $6.58 | +| 9 | Claude Opus 4.6 | Claude 3 Haiku | 95.95% | $5.37 | +| 10 | gpt-oss-20b | Claude Opus 4.6 | 94.57% | $0.97 | +| 11 | gpt-oss-20b | Kimi K2.5 | 94.57% | $0.26 | +| 12 | gpt-oss-20b | gpt-oss-20b | 94.54% | $0.08 | +| 13 | Claude Haiku 4.5 | Qwen3 32B | 94.50% | $2.51 | +| 14 | gpt-oss-20b | Claude Haiku 4.5 | 94.05% | $0.37 | +| 15 | gpt-oss-20b | gpt-oss-120b | 94.02% | $0.11 | ### Bottom 15 Combos | Rank | Answer Model | Critic Model | Accuracy | Cost | |:-----|:-------------|:-------------|:---------|:-----| -| 67 | Kimi K2.5 | Claude Haiku 4.5 | 78.01% | $1.20 | -| 68 | Claude 3 Haiku | Ministral 3 8B | 77.64% | $0.30 | -| 69 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 77.60% | $2.03 | -| 70 | Kimi K2.5 | Claude Opus 4.6 | 77.55% | $2.55 | -| 71 | Claude 3 Haiku | Kimi K2.5 | 77.48% | $0.46 | -| 72 | gpt-oss-120b | gpt-oss-20b | 77.32% | $0.17 | -| 73 | Kimi K2.5 | Claude 3 Haiku | 76.96% | $0.92 | -| 74 | Qwen3 Next 80B A3B | Kimi K2.5 | 76.96% | $0.46 | -| 75 | Qwen3 Next 80B A3B | gpt-oss-120b | 76.84% | $0.32 | -| 76 | gpt-oss-120b | Qwen3 Next 80B A3B | 74.74% | $0.20 | -| 77 | Claude 3 Haiku | gpt-oss-20b | 72.96% | $0.31 | -| 78 | Kimi K2.5 | Qwen3 32B | 72.77% | $0.67 | -| 79 | Claude 3 Haiku | Qwen3 Next 80B A3B | 68.94% | $0.36 | -| 80 | Claude 3 Haiku | Qwen3 32B | 63.86% | $0.27 | -| 81 | Claude 3 Haiku | Claude 3 Haiku | 59.88% | $0.30 | +| 67 | Qwen3 Next 80B A3B | Kimi K2.5 | 75.50% | $0.79 | +| 68 | Qwen3 Next 80B A3B | gpt-oss-20b | 75.00% | $0.48 | +| 69 | Kimi K2.5 | gpt-oss-120b | 74.49% | $0.95 | +| 70 | Kimi K2.5 | gpt-oss-20b | 74.09% | $0.77 | +| 71 | Kimi K2.5 | Kimi K2.5 | 73.58% | $1.34 | +| 72 | Kimi K2.5 | Claude Opus 4.6 | 73.33% | $2.79 | +| 73 | Kimi K2.5 | Claude Haiku 4.5 | 73.20% | $1.36 | +| 74 | Claude 3 Haiku | gpt-oss-120b | 72.19% | $0.32 | +| 75 | Kimi K2.5 | Qwen3 32B | 72.16% | $0.92 | +| 76 | Claude 3 Haiku | gpt-oss-20b | 71.43% | $0.32 | +| 77 | Claude 3 Haiku | Qwen3 Next 80B A3B | 71.07% | $0.39 | +| 78 | Claude 3 Haiku | Kimi K2.5 | 71.01% | $0.53 | +| 79 | Claude 3 Haiku | Ministral 3 8B | 69.28% | $0.32 | +| 80 | Claude 3 Haiku | Qwen3 32B | 59.30% | $0.29 | +| 81 | Claude 3 Haiku | Claude 3 Haiku | 54.37% | $0.30 | ??? note "Full 81 Combo Results" | Rank | Answer Model | Critic Model | Accuracy | Cost | Note | |:-----|:-------------|:-------------|:---------|:-----|:-----| - | 1 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.83% | $5.89 | | - | 2 | Claude Opus 4.6 | Ministral 3 8B | 98.73% | $5.31 | | - | 3 | Claude Opus 4.6 | Claude Haiku 4.5 | 98.27% | $6.09 | | - | 4 | Claude Opus 4.6 | Qwen3 32B | 97.79% | $6.42 | | - | 5 | Claude Opus 4.6 | Claude Opus 4.6 | 97.77% | $6.95 | | - | 6 | Claude Opus 4.6 | gpt-oss-120b | 97.73% | $6.14 | | - | 7 | Claude Opus 4.6 | Claude 3 Haiku | 97.26% | $5.26 | | - | 8 | Claude Opus 4.6 | Kimi K2.5 | 97.25% | $6.66 | | - | 9 | Claude Opus 4.6 | gpt-oss-20b | 97.13% | $6.10 | | - | 10 | Claude Haiku 4.5 | Ministral 3 8B | 94.47% | $2.59 | | - | 11 | Claude Haiku 4.5 | Claude Haiku 4.5 | 94.00% | $3.17 | | - | 12 | Claude Haiku 4.5 | Claude Opus 4.6 | 94.00% | $3.89 | | - | 13 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 94.00% | $2.50 | | - | 14 | Ministral 3 8B | Claude 3 Haiku | 93.98% | $0.05 | | - | 15 | Claude Haiku 4.5 | Kimi K2.5 | 93.97% | $2.92 | | - | 16 | gpt-oss-20b | gpt-oss-120b | 93.96% | $0.12 | | - | 17 | Claude Haiku 4.5 | Qwen3 32B | 93.50% | $2.72 | | - | 18 | Claude Haiku 4.5 | gpt-oss-20b | 93.50% | $2.93 | | - | 19 | gpt-oss-20b | Kimi K2.5 | 93.44% | $0.23 | | - | 20 | gpt-oss-20b | Claude Haiku 4.5 | 92.97% | $0.36 | | - | 21 | Claude 3 Haiku | Claude Opus 4.6 | 92.94% | $2.04 | | - | 22 | Claude Haiku 4.5 | gpt-oss-120b | 92.50% | $2.35 | | - | 23 | gpt-oss-20b | Qwen3 Next 80B A3B | 92.43% | $0.15 | | - | 24 | gpt-oss-20b | Claude Opus 4.6 | 92.35% | $0.99 | | - | 25 | gpt-oss-20b | gpt-oss-20b | 91.94% | $0.09 | | - | 26 | Claude Haiku 4.5 | Claude 3 Haiku | 91.50% | $2.95 | | - | 27 | gpt-oss-20b | Qwen3 32B | 91.21% | $0.08 | | - | 28 | gpt-oss-20b | Claude 3 Haiku | 90.76% | $0.16 | | - | 29 | Ministral 3 8B | gpt-oss-120b | 90.59% | $0.07 | | - | 30 | gpt-oss-20b | Ministral 3 8B | 90.43% | $0.13 | | - | 31 | Ministral 3 8B | Qwen3 Next 80B A3B | 90.20% | $0.03 | | - | 32 | Ministral 3 8B | Claude Opus 4.6 | 89.53% | $0.87 | | - | 33 | Ministral 3 8B | Claude Haiku 4.5 | 88.89% | $0.30 | | - | 34 | Ministral 3 8B | Kimi K2.5 | 88.82% | $0.09 | | - | 35 | Ministral 3 8B | gpt-oss-20b | 88.76% | $0.04 | | - | 36 | Qwen3 32B | Qwen3 Next 80B A3B | 88.72% | $0.21 | | - | 37 | Ministral 3 8B | Ministral 3 8B | 88.19% | $0.03 | | - | 38 | Claude 3 Haiku | Claude Haiku 4.5 | 87.21% | $0.69 | | - | 39 | Ministral 3 8B | Qwen3 32B | 86.98% | $0.04 | | - | 40 | Qwen3 32B | Ministral 3 8B | 86.73% | $0.35 | | - | 41 | Qwen3 32B | gpt-oss-120b | 86.67% | $0.25 | | - | 42 | Qwen3 32B | Claude Opus 4.6 | 85.35% | $2.01 | | - | 43 | Qwen3 32B | gpt-oss-20b | 85.05% | $0.19 | | - | 44 | Qwen3 32B | Claude Haiku 4.5 | 84.02% | $0.53 | | - | 45 | Qwen3 32B | Kimi K2.5 | 82.74% | $1.11 | | - | 46 | Qwen3 32B | Qwen3 32B | 82.56% | $0.17 | | - | 47 | Qwen3 32B | Claude 3 Haiku | 82.47% | $0.27 | | - | 48 | Qwen3 Next 80B A3B | Claude 3 Haiku | 82.29% | $0.42 | | - | 49 | Qwen3 Next 80B A3B | Qwen3 32B | 81.87% | $0.29 | | - | 50 | Kimi K2.5 | gpt-oss-120b | 81.44% | $0.82 | | - | 51 | Kimi K2.5 | gpt-oss-20b | 81.35% | $0.85 | | - | 52 | Kimi K2.5 | Kimi K2.5 | 81.25% | $1.13 | | - | 53 | gpt-oss-120b | Qwen3 32B | 80.41% | $0.15 | | - | 54 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 80.32% | $0.31 | | - | 55 | gpt-oss-120b | Claude Haiku 4.5 | 80.31% | $0.47 | | - | 56 | gpt-oss-120b | Kimi K2.5 | 80.10% | $0.27 | | - | 57 | gpt-oss-120b | Ministral 3 8B | 80.00% | $0.18 | | - | 58 | Qwen3 Next 80B A3B | gpt-oss-20b | 79.79% | $0.32 | | - | 59 | gpt-oss-120b | Claude 3 Haiku | 79.69% | $0.19 | | - | 60 | Kimi K2.5 | Ministral 3 8B | 79.49% | $0.86 | | - | 61 | gpt-oss-120b | Claude Opus 4.6 | 79.49% | $1.17 | | - | 62 | Kimi K2.5 | Qwen3 Next 80B A3B | 79.06% | $0.82 | | - | 63 | gpt-oss-120b | gpt-oss-120b | 78.87% | $0.20 | | - | 64 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 78.65% | $0.81 | | - | 65 | Qwen3 Next 80B A3B | Ministral 3 8B | 78.65% | $0.38 | | - | 66 | Claude 3 Haiku | gpt-oss-120b | 78.62% | $0.36 | | - | 67 | Kimi K2.5 | Claude Haiku 4.5 | 78.01% | $1.20 | | - | 68 | Claude 3 Haiku | Ministral 3 8B | 77.64% | $0.30 | | - | 69 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 77.60% | $2.03 | | - | 70 | Kimi K2.5 | Claude Opus 4.6 | 77.55% | $2.55 | | - | 71 | Claude 3 Haiku | Kimi K2.5 | 77.48% | $0.46 | | - | 72 | gpt-oss-120b | gpt-oss-20b | 77.32% | $0.17 | | - | 73 | Kimi K2.5 | Claude 3 Haiku | 76.96% | $0.92 | | - | 74 | Qwen3 Next 80B A3B | Kimi K2.5 | 76.96% | $0.46 | | - | 75 | Qwen3 Next 80B A3B | gpt-oss-120b | 76.84% | $0.32 | | - | 76 | gpt-oss-120b | Qwen3 Next 80B A3B | 74.74% | $0.20 | | - | 77 | Claude 3 Haiku | gpt-oss-20b | 72.96% | $0.31 | | - | 78 | Kimi K2.5 | Qwen3 32B | 72.77% | $0.67 | | - | 79 | Claude 3 Haiku | Qwen3 Next 80B A3B | 68.94% | $0.36 | | - | 80 | Claude 3 Haiku | Qwen3 32B | 63.86% | $0.27 | | - | 81 | Claude 3 Haiku | Claude 3 Haiku | 59.88% | $0.30 | | + | 1 | Claude Opus 4.6 | Claude Haiku 4.5 | 98.84% | $6.19 | | + | 2 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.82% | $5.77 | | + | 3 | Claude Opus 4.6 | Ministral 3 8B | 98.72% | $5.26 | | + | 4 | Claude Opus 4.6 | gpt-oss-20b | 98.28% | $5.93 | | + | 5 | Claude Opus 4.6 | gpt-oss-120b | 97.77% | $6.30 | | + | 6 | Claude Opus 4.6 | Qwen3 32B | 97.28% | $6.68 | | + | 7 | Claude Opus 4.6 | Claude Opus 4.6 | 97.24% | $6.97 | | + | 8 | Claude Opus 4.6 | Kimi K2.5 | 97.24% | $6.58 | | + | 9 | Claude Opus 4.6 | Claude 3 Haiku | 95.95% | $5.37 | | + | 10 | gpt-oss-20b | Claude Opus 4.6 | 94.57% | $0.97 | | + | 11 | gpt-oss-20b | Kimi K2.5 | 94.57% | $0.26 | | + | 12 | gpt-oss-20b | gpt-oss-20b | 94.54% | $0.08 | | + | 13 | Claude Haiku 4.5 | Qwen3 32B | 94.50% | $2.51 | | + | 14 | gpt-oss-20b | Claude Haiku 4.5 | 94.05% | $0.37 | | + | 15 | gpt-oss-20b | gpt-oss-120b | 94.02% | $0.11 | | + | 16 | gpt-oss-20b | Qwen3 Next 80B A3B | 94.02% | $0.14 | | + | 17 | Claude Haiku 4.5 | Claude Haiku 4.5 | 94.00% | $2.59 | | + | 18 | gpt-oss-20b | Ministral 3 8B | 93.99% | $0.10 | | + | 19 | gpt-oss-120b | Claude Opus 4.6 | 93.81% | $1.25 | | + | 20 | Claude Haiku 4.5 | gpt-oss-20b | 93.50% | $2.20 | | + | 21 | Claude Haiku 4.5 | Claude Opus 4.6 | 93.50% | $3.77 | | + | 22 | Claude Haiku 4.5 | Ministral 3 8B | 93.50% | $2.57 | | + | 23 | Claude Haiku 4.5 | Kimi K2.5 | 93.50% | $2.60 | | + | 24 | gpt-oss-20b | Qwen3 32B | 93.48% | $0.09 | | + | 25 | gpt-oss-20b | Claude 3 Haiku | 93.44% | $0.15 | | + | 26 | gpt-oss-120b | Ministral 3 8B | 93.26% | $0.19 | | + | 27 | gpt-oss-120b | Qwen3 32B | 93.26% | $0.16 | | + | 28 | Claude Haiku 4.5 | gpt-oss-120b | 93.00% | $2.90 | | + | 29 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 93.00% | $7.81 | | + | 30 | gpt-oss-120b | Claude Haiku 4.5 | 92.82% | $0.47 | | + | 31 | gpt-oss-120b | gpt-oss-20b | 92.78% | $0.18 | | + | 32 | gpt-oss-120b | gpt-oss-120b | 92.78% | $0.19 | | + | 33 | gpt-oss-120b | Kimi K2.5 | 92.78% | $0.32 | | + | 34 | gpt-oss-120b | Qwen3 Next 80B A3B | 92.78% | $0.23 | | + | 35 | gpt-oss-120b | Claude 3 Haiku | 92.75% | $0.20 | | + | 36 | Claude Haiku 4.5 | Claude 3 Haiku | 92.50% | $2.46 | | + | 37 | Claude 3 Haiku | Claude Opus 4.6 | 89.66% | $2.26 | | + | 38 | Qwen3 32B | Qwen3 Next 80B A3B | 88.83% | $0.24 | | + | 39 | Ministral 3 8B | Claude 3 Haiku | 88.15% | $0.05 | | + | 40 | Qwen3 32B | gpt-oss-120b | 87.83% | $0.47 | | + | 41 | Ministral 3 8B | Qwen3 Next 80B A3B | 87.82% | $0.03 | | + | 42 | Qwen3 32B | Claude Opus 4.6 | 87.56% | $3.43 | | + | 43 | Ministral 3 8B | Kimi K2.5 | 87.04% | $0.09 | | + | 44 | Ministral 3 8B | gpt-oss-120b | 86.63% | $0.07 | | + | 45 | Claude 3 Haiku | Claude Haiku 4.5 | 86.55% | $0.69 | | + | 46 | Ministral 3 8B | Ministral 3 8B | 86.52% | $0.03 | | + | 47 | Ministral 3 8B | Claude Opus 4.6 | 86.47% | $0.93 | | + | 48 | Qwen3 32B | Claude Haiku 4.5 | 86.46% | $0.90 | | + | 49 | Ministral 3 8B | Claude Haiku 4.5 | 86.23% | $0.30 | | + | 50 | Ministral 3 8B | gpt-oss-20b | 86.13% | $0.05 | | + | 51 | Qwen3 32B | Ministral 3 8B | 86.10% | $0.21 | | + | 52 | Qwen3 32B | Kimi K2.5 | 85.94% | $0.78 | | + | 53 | Qwen3 32B | gpt-oss-20b | 85.86% | $0.49 | | + | 54 | Ministral 3 8B | Qwen3 32B | 85.80% | $0.04 | | + | 55 | Qwen3 32B | Qwen3 32B | 84.82% | $0.62 | | + | 56 | Kimi K2.5 | Claude 3 Haiku | 80.41% | $0.98 | | + | 57 | Qwen3 32B | Claude 3 Haiku | 80.00% | $0.67 | | + | 58 | Qwen3 Next 80B A3B | Claude 3 Haiku | 80.00% | $0.59 | | + | 59 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 78.00% | $2.96 | | + | 60 | Kimi K2.5 | Ministral 3 8B | 77.84% | $0.97 | | + | 61 | Kimi K2.5 | Qwen3 Next 80B A3B | 77.20% | $1.00 | | + | 62 | Qwen3 Next 80B A3B | Ministral 3 8B | 77.00% | $0.55 | | + | 63 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 76.50% | $1.21 | | + | 64 | Qwen3 Next 80B A3B | gpt-oss-120b | 76.50% | $0.52 | | + | 65 | Qwen3 Next 80B A3B | Qwen3 32B | 76.00% | $0.42 | | + | 66 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 76.00% | $0.54 | | + | 67 | Qwen3 Next 80B A3B | Kimi K2.5 | 75.50% | $0.79 | | + | 68 | Qwen3 Next 80B A3B | gpt-oss-20b | 75.00% | $0.48 | | + | 69 | Kimi K2.5 | gpt-oss-120b | 74.49% | $0.95 | | + | 70 | Kimi K2.5 | gpt-oss-20b | 74.09% | $0.77 | | + | 71 | Kimi K2.5 | Kimi K2.5 | 73.58% | $1.34 | | + | 72 | Kimi K2.5 | Claude Opus 4.6 | 73.33% | $2.79 | | + | 73 | Kimi K2.5 | Claude Haiku 4.5 | 73.20% | $1.36 | | + | 74 | Claude 3 Haiku | gpt-oss-120b | 72.19% | $0.32 | | + | 75 | Kimi K2.5 | Qwen3 32B | 72.16% | $0.92 | | + | 76 | Claude 3 Haiku | gpt-oss-20b | 71.43% | $0.32 | | + | 77 | Claude 3 Haiku | Qwen3 Next 80B A3B | 71.07% | $0.39 | | + | 78 | Claude 3 Haiku | Kimi K2.5 | 71.01% | $0.53 | | + | 79 | Claude 3 Haiku | Ministral 3 8B | 69.28% | $0.32 | | + | 80 | Claude 3 Haiku | Qwen3 32B | 59.30% | $0.29 | | + | 81 | Claude 3 Haiku | Claude 3 Haiku | 54.37% | $0.30 | | ### Selector Comparison | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| -| Brute Force | 100% | 98.83% | 14,855 | $113.01 | -- | -| Arm Elimination | 96% | 98.80% | 3,632 | $61.22 | **46%** | -| Random Search | 28% | 98.04% | 3,850 | $28.83 | 74% | -| Hill Climbing | 72% | 97.81% | 4,058 | $45.72 | 60% | -| Epsilon LUCB | 0% | 97.46% | 443 | $5.55 | 95% | -| LM Proposal | 0% | 96.87% | 149 | $5.15 | 95% | -| Bayesian Opt | 4% | 95.39% | 3,608 | $31.05 | 73% | -| Threshold SE | 0% | 77.23% | 369 | $1.95 | 98% | +| Brute Force | 100% | 98.84% | 14,961 | $123.87 | -- | +| Arm Elimination | 86% | 98.83% | 3,356 | $51.86 | **58%** | +| Hill Climbing | 80% | 98.76% | 3,926 | $54.22 | 56% | +| Random Search | 28% | 98.17% | 3,880 | $31.77 | 74% | +| Epsilon LUCB | 4% | 96.99% | 447 | $6.10 | 95% | +| Bayesian Opt | 4% | 95.41% | 3,666 | $35.56 | 71% | +| LM Proposal | 0% | 95.82% | 158 | $5.61 | 95% | +| Threshold SE | 0% | 74.52% | 1,355 | $6.90 | 94% | diff --git a/docs/blog/posts/technical-deep-dive.md b/docs/blog/posts/technical-deep-dive.md index ab3cb19..5258d7e 100644 --- a/docs/blog/posts/technical-deep-dive.md +++ b/docs/blog/posts/technical-deep-dive.md @@ -59,11 +59,12 @@ And the impact is enormous. Here's what we found across three benchmarks, compar | Benchmark | Expensive Combo | Acc | Cost | Budget Combo | Acc | Cost | Savings | |-----------|----------------|-----|------|-------------|-----|------|---------| -| HotpotQA | Opus + Opus | ~73% | $2.71 | Qwen3 Next + gpt-oss-120b | 71.3% | $0.13 | **21x** | -| MathQA | Opus + Opus | ~98.5% | $5.89 | Ministral + C3 Haiku | 94.0% | $0.05 | **118x** | -| BFCL | Opus | 72% | $60.78 | Qwen3 Next | 71% | $1.87 | **32x** | +| GPQA | Opus | 74.75% | $2.47 | gpt-oss-120b | 68.18% | $0.19 | **13x** | +| HotpotQA | Opus + Opus | ~32% | $2.02 | Qwen3 Next + gpt-oss-120b | 71.8% | $0.13 | **16x** | +| MathQA | Opus + Haiku 4.5 | 98.8% | $6.19 | gpt-oss-20b + Kimi | 94.6% | $0.26 | **24x** | +| BFCL | Opus | 70% | $60.13 | Qwen3 Next | 70% | $1.90 | **32x** | -These are real numbers from real benchmarks. Same accuracy band, 20-100x cost difference. No amount of caching or request batching can close a 32x gap. The model choice *is* the optimization. +These are real numbers from real benchmarks. Same accuracy band, 13-32x cost difference. No amount of caching or request batching can close a 32x gap. The model choice *is* the optimization. ## Agent Routing Is Not LLM Routing @@ -83,7 +84,7 @@ And the results prove why this matters. On HotpotQA (multi-hop question answerin **The weakest planner + the strongest solver beats the strongest planner + any solver.** -Ministral 3 8B (the cheapest, smallest model) as planner paired with Claude Opus as solver achieves 74.8% accuracy. Claude Opus as *both* planner and solver? Only ~73%. Why? Because Opus as planner is *too capable*: it answers the question directly, bypassing the solver's search tools entirely. The "worse" planner correctly delegates to the tool-augmented solver, producing better results. +Ministral 3 8B (the cheapest, smallest model) as planner paired with Claude Opus as solver achieves 74.3% accuracy. Claude Opus as *both* planner and solver? Only ~32%. Why? Because Opus as planner is *too capable*: it answers the question directly, bypassing the solver's search tools entirely. The "worse" planner correctly delegates to the tool-augmented solver, producing better results. You'd never find this by picking "the best model" for each layer independently. The best combo doesn't contain the best individual models. This is the credit assignment problem in action. @@ -128,14 +129,14 @@ The catch: Hill Climbing requires **topology information**. It needs a notion of ### How Much Do These Save? -Across our four benchmarks, Arm Elimination consistently achieves near-optimal accuracy while using 40-60% less budget than brute force: +Across our four benchmarks, Arm Elimination consistently achieves near-optimal accuracy while using up to 67% less budget than brute force: | Benchmark | Brute Force Accuracy | Arm Elimination Accuracy | Cumulative cost savings | |-----------|---------------------|------------------------|-------------| -| HotpotQA | 74.78% | 74.12% | 64% | -| GPQA | 80.30% | 80.14% | 49% | -| MathQA | 98.83% | 98.80% | 46% | -| BFCL | 72.00% | 72.00% | 11% | +| HotpotQA | 74.27% | 73.19% | 67% | +| MathQA | 98.84% | 98.83% | 58% | +| GPQA | 74.75% | 74.10% | 24% | +| BFCL | 70.00% | 69.37% | 12% | Nearly identical accuracy to exhaustive search, at roughly half the cumulative evaluation cost. These algorithms don't just save budget. They find the right combo with statistical guarantees. @@ -154,32 +155,32 @@ We validated AgentOpt across four diverse benchmarks using 9 models on Amazon Be | Benchmark | Best Combo | Why It's Surprising | |-----------|-----------|-------------------| -| HotpotQA | Ministral 3 8B + Opus | Weakest planner wins. Opus as planner bypasses search tools | -| MathQA | Opus + Qwen3 Next | Critic barely matters. Opus solves math correctly on the first try | -| BFCL | Opus (single) | Qwen3 Next ties at 32x lower cost. Statistical difference is ~1% | -| GPQA | Opus (single) | Straightforward. Raw capability wins for grad-level science | +| HotpotQA | Ministral 3 8B + Opus | Weakest planner wins. Opus as planner bypasses search tools and scores only ~32% | +| MathQA | Opus + Haiku 4.5 | Critic barely matters. Opus solves math correctly on the first try | +| BFCL | Opus / Kimi / Qwen3 Next (tied) | Three models tie at 70%. Qwen3 Next costs 32x less than Opus | +| GPQA | Opus | Kimi is within 2pp at less than half the cost | ### Algorithm Comparison (50 random seeds each) - + - - - - - - - - + + + + + + + +
AlgorithmHotpotQAGPQAMathQABFCL
AlgorithmGPQABFCLHotpotQAMathQA
Brute Force74.78% / 0%80.30% / 0%98.83% / 0%72.00% / 0%
Arm Elimination74.12% / 64%80.14% / 49%98.80% / 46%72.00% / 11%
Hill Climbing73.38% / 63%79.64% / 9%97.81% / 60%71.94% / 6%
Bayesian Opt72.78% / 76%75.41% / 44%95.39% / 73%70.61% / 40%
Random Search72.34% / 74%70.53% / 63%98.04% / 74%67.99% / 63%
Epsilon-LUCB69.96% / 96%79.47% / 44%97.46% / 95%71.33% / 50%
Threshold SE63.62% / 93%65.88% / 94%77.23% / 98%57.52% / 92%
LM Proposal34.41% / 96%80.30% / 41%96.87% / 95%45.00% / 96%
Brute Force74.75% / 0%70.00% / 0%74.27% / 0%98.84% / 0%
Arm Elimination74.10% / 24%69.37% / 12%73.19% / 67%98.83% / 58%
Hill Climbing74.55% / 14%70.00% / 15%73.13% / 63%98.76% / 56%
Bayesian Opt72.43% / 45%69.27% / 40%73.33% / 76%95.41% / 71%
Random Search68.57% / 63%67.13% / 63%72.25% / 74%98.17% / 74%
Epsilon-LUCB73.14% / 47%69.90% / 53%69.71% / 97%96.99% / 95%
Threshold SE57.83% / 62%58.19% / 78%65.42% / 88%74.52% / 94%
LM Proposal74.75% / 48%44.03% / 96%34.13% / 97%95.82% / 96%
*Format: obtained accuracy / cumulative evaluation cost savings vs brute force. Averaged over 50 seeds. Green = within 5% of brute force accuracy AND >50% savings on that metric.* -Arm Elimination is the consistent winner: it achieves near-brute-force accuracy across all four benchmarks while using 40-60% less budget. LM Proposal (asking GPT-4.1 to predict the best combo) matches brute force on GPQA (where the answer is intuitive) but collapses to 34% on HotpotQA and 45% on BFCL. It can't predict that Ministral outperforms Opus as a planner. +Arm Elimination and Hill Climbing achieve comparable mean accuracy (within 1 percentage point of brute force across all four benchmarks), with Arm Elimination offering modestly higher cost savings on average (40% vs 37%). No single selector dominates all benchmarks. Hill Climbing excels when top models are tightly clustered (BFCL), while Arm Elimination performs best when there is clear separation between the best combo and the rest (HotpotQA, MathQA). LM Proposal (asking GPT-4.1 to predict the best combo) matches brute force on GPQA (where the answer is intuitive) but collapses to 34% on HotpotQA and 44% on BFCL. It cannot predict that Ministral outperforms Opus as a planner. ## Get Started diff --git a/mkdocs.yml b/mkdocs.yml index 48d35f7..32b32e4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -58,6 +58,8 @@ nav: - LangChain: examples/langchain.md - CrewAI: examples/crewai.md - LlamaIndex: examples/llamaindex.md + - Benchmark Results: + - benchmark-results/index.md - API Reference: - Selectors: api/selectors.md - Results: api/results.md From a12ac9134eb9d60b141c66ce28fdf8712520d1ca Mon Sep 17 00:00:00 2001 From: Sripad Karne Date: Thu, 2 Apr 2026 13:31:12 -0700 Subject: [PATCH 2/5] Add topology assumption tradeoff for Hill Climbing vs Arm Elimination Arm Elimination is assumption-free (uses only observed data), while Hill Climbing requires a hand-crafted model ranking upfront. Also split LM Proposal into its own paragraph. Co-Authored-By: Claude Opus 4.6 --- docs/blog/posts/technical-deep-dive.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/blog/posts/technical-deep-dive.md b/docs/blog/posts/technical-deep-dive.md index 5258d7e..4a34834 100644 --- a/docs/blog/posts/technical-deep-dive.md +++ b/docs/blog/posts/technical-deep-dive.md @@ -180,7 +180,9 @@ We validated AgentOpt across four diverse benchmarks using 9 models on Amazon Be *Format: obtained accuracy / cumulative evaluation cost savings vs brute force. Averaged over 50 seeds. Green = within 5% of brute force accuracy AND >50% savings on that metric.* -Arm Elimination and Hill Climbing achieve comparable mean accuracy (within 1 percentage point of brute force across all four benchmarks), with Arm Elimination offering modestly higher cost savings on average (40% vs 37%). No single selector dominates all benchmarks. Hill Climbing excels when top models are tightly clustered (BFCL), while Arm Elimination performs best when there is clear separation between the best combo and the rest (HotpotQA, MathQA). LM Proposal (asking GPT-4.1 to predict the best combo) matches brute force on GPQA (where the answer is intuitive) but collapses to 34% on HotpotQA and 44% on BFCL. It cannot predict that Ministral outperforms Opus as a planner. +Arm Elimination and Hill Climbing achieve comparable mean accuracy (within 1 percentage point of brute force across all four benchmarks), with Arm Elimination offering modestly higher cost savings on average (40% vs 37%). No single selector dominates all benchmarks. Hill Climbing excels when top models are tightly clustered (BFCL), while Arm Elimination performs best when there is clear separation between the best combo and the rest (HotpotQA, MathQA). However, Hill Climbing requires a hand-crafted topology ranking of the models upfront — you need prior knowledge about model quality and speed ordering for it to search effectively. Arm Elimination is fully assumption-free: it uses only the observed evaluation data to eliminate dominated combos, making it more practical when you don't have reliable priors about model capabilities. + +LM Proposal (asking GPT-4.1 to predict the best combo) matches brute force on GPQA (where the answer is intuitive) but collapses to 34% on HotpotQA and 44% on BFCL. It cannot predict that Ministral outperforms Opus as a planner. ## Get Started From 1aebd34934e15a6ff6f06a5aad5aea595031223a Mon Sep 17 00:00:00 2001 From: Sripad Karne Date: Thu, 2 Apr 2026 19:14:40 -0700 Subject: [PATCH 3/5] Add server latency column to HotpotQA and MathQA tables Added Avg Latency (s) column to all 6 tables (Top 15, Bottom 15, Full 81) for both 2-tuple benchmarks using server-side latency from cache.db results. Co-Authored-By: Claude Opus 4.6 --- docs/benchmark-results/index.md | 468 ++++++++++++++++---------------- 1 file changed, 234 insertions(+), 234 deletions(-) diff --git a/docs/benchmark-results/index.md b/docs/benchmark-results/index.md index 21c9d4c..2b4009a 100644 --- a/docs/benchmark-results/index.md +++ b/docs/benchmark-results/index.md @@ -119,132 +119,132 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) ### Top 15 Combos -| Rank | Planner | Solver | Accuracy | Cost | -|:-----|:--------|:-------|:---------|:-----| -| 1 | Ministral 3 8B | Claude Opus 4.6 | **74.27%** | $2.64 | -| 2 | Claude 3 Haiku | Claude Opus 4.6 | 73.25% | $2.79 | -| 3 | Qwen3 32B | Claude Opus 4.6 | 73.02% | $2.65 | -| 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.10% | $2.67 | -| 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.83% | $0.13 | -| 6 | Qwen3 32B | gpt-oss-120b | 70.04% | $0.13 | -| 7 | Kimi K2.5 | Claude Opus 4.6 | 69.96% | $2.43 | -| 8 | Claude 3 Haiku | gpt-oss-120b | 69.86% | $0.17 | -| 9 | Ministral 3 8B | gpt-oss-20b | 69.34% | $0.09 | -| 10 | Claude 3 Haiku | Qwen3 Next 80B A3B | 69.27% | $0.16 | -| 11 | Qwen3 Next 80B A3B | gpt-oss-20b | 68.89% | $0.09 | -| 12 | Ministral 3 8B | gpt-oss-120b | 68.70% | $0.12 | -| 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.15% | $0.11 | -| 14 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.98% | $0.11 | -| 15 | Qwen3 32B | Qwen3 Next 80B A3B | 67.53% | $0.11 | +| Rank | Planner | Solver | Accuracy | Avg Latency (s) | Cost | +|:-----|:--------|:-------|:---------|:----------------|:-----| +| 1 | Ministral 3 8B | Claude Opus 4.6 | **74.27%** | 4.97 | $2.64 | +| 2 | Claude 3 Haiku | Claude Opus 4.6 | 73.25% | 4.52 | $2.79 | +| 3 | Qwen3 32B | Claude Opus 4.6 | 73.02% | 4.26 | $2.65 | +| 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.10% | 4.67 | $2.67 | +| 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.83% | 3.07 | $0.13 | +| 6 | Qwen3 32B | gpt-oss-120b | 70.04% | 2.66 | $0.13 | +| 7 | Kimi K2.5 | Claude Opus 4.6 | 69.96% | 4.49 | $2.43 | +| 8 | Claude 3 Haiku | gpt-oss-120b | 69.86% | 3.21 | $0.17 | +| 9 | Ministral 3 8B | gpt-oss-20b | 69.34% | 5.66 | $0.09 | +| 10 | Claude 3 Haiku | Qwen3 Next 80B A3B | 69.27% | 3.00 | $0.16 | +| 11 | Qwen3 Next 80B A3B | gpt-oss-20b | 68.89% | 2.82 | $0.09 | +| 12 | Ministral 3 8B | gpt-oss-120b | 68.70% | 3.65 | $0.12 | +| 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.15% | 2.69 | $0.11 | +| 14 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.98% | 3.85 | $0.11 | +| 15 | Qwen3 32B | Qwen3 Next 80B A3B | 67.53% | 3.51 | $0.11 | ### Bottom 15 Combos -| Rank | Planner | Solver | Accuracy | Cost | -|:-----|:--------|:-------|:---------|:-----| -| 67 | Kimi K2.5 | Claude Haiku 4.5 | 37.19% | $0.88 | -| 68 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | $0.46 | -| 69 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | $0.49 | -| 70 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | $0.70 | -| 71 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | $0.72 | -| 72 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | $2.02 | -| 73 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | $2.02 | -| 74 | Claude Opus 4.6 | Qwen3 32B | 31.96% | $2.02 | -| 75 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | $2.02 | -| 76 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | $2.02 | -| 77 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | $2.03 | -| 78 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | $2.02 | -| 79 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | $2.03 | -| 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | $0.69 | -| 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | $0.79 | +| Rank | Planner | Solver | Accuracy | Avg Latency (s) | Cost | +|:-----|:--------|:-------|:---------|:----------------|:-----| +| 67 | Kimi K2.5 | Claude Haiku 4.5 | 37.19% | 4.23 | $0.88 | +| 68 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | 2.89 | $0.46 | +| 69 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | 2.63 | $0.49 | +| 70 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | 4.14 | $0.70 | +| 71 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | 3.92 | $0.72 | +| 72 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | 4.72 | $2.02 | +| 73 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | 4.72 | $2.02 | +| 74 | Claude Opus 4.6 | Qwen3 32B | 31.96% | 4.72 | $2.02 | +| 75 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | 4.72 | $2.02 | +| 76 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | 4.60 | $2.02 | +| 77 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | 4.57 | $2.03 | +| 78 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | 4.22 | $2.02 | +| 79 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | 4.16 | $2.03 | +| 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | 3.47 | $0.69 | +| 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | 3.40 | $0.79 | !!! warning "Capability as Liability" **Claude Opus 4.6 as planner achieves only ~32% accuracy** regardless of solver — the worst planner in the benchmark. Opus is "too smart" for the planner role: it calls `terminate()` and answers directly instead of delegating to the solver. The solver is never invoked. Meanwhile, the cheapest model (Ministral 3 8B) as planner with Opus as solver achieves the **best accuracy at 74.27%**. This demonstrates that stronger models can underperform in multi-agent architectures when the role requires delegation, not direct answering. ??? note "Full 81 Combo Results" - | Rank | Planner | Solver | Accuracy | Cost | Note | - |:-----|:--------|:-------|:---------|:-----|:-----| - | 1 | Ministral 3 8B | Claude Opus 4.6 | 74.27% | $2.64 | | - | 2 | Claude 3 Haiku | Claude Opus 4.6 | 73.25% | $2.79 | | - | 3 | Qwen3 32B | Claude Opus 4.6 | 73.02% | $2.65 | | - | 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.10% | $2.67 | | - | 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.83% | $0.13 | | - | 6 | Qwen3 32B | gpt-oss-120b | 70.04% | $0.13 | | - | 7 | Kimi K2.5 | Claude Opus 4.6 | 69.96% | $2.43 | | - | 8 | Claude 3 Haiku | gpt-oss-120b | 69.86% | $0.17 | | - | 9 | Ministral 3 8B | gpt-oss-20b | 69.34% | $0.09 | | - | 10 | Claude 3 Haiku | Qwen3 Next 80B A3B | 69.27% | $0.16 | | - | 11 | Qwen3 Next 80B A3B | gpt-oss-20b | 68.89% | $0.09 | | - | 12 | Ministral 3 8B | gpt-oss-120b | 68.70% | $0.12 | | - | 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.15% | $0.11 | | - | 14 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.98% | $0.11 | | - | 15 | Qwen3 32B | Qwen3 Next 80B A3B | 67.53% | $0.11 | | - | 16 | Qwen3 32B | gpt-oss-20b | 66.95% | $0.09 | | - | 17 | Claude 3 Haiku | Ministral 3 8B | 65.98% | $0.14 | | - | 18 | Ministral 3 8B | Kimi K2.5 | 65.24% | $0.26 | | - | 19 | gpt-oss-120b | Qwen3 Next 80B A3B | 64.93% | $0.10 | | - | 20 | Ministral 3 8B | Ministral 3 8B | 64.89% | $0.09 | | - | 21 | Claude 3 Haiku | gpt-oss-20b | 64.79% | $0.13 | | - | 22 | Kimi K2.5 | gpt-oss-120b | 64.70% | $0.29 | | - | 23 | gpt-oss-120b | Claude Opus 4.6 | 64.59% | $1.61 | | - | 24 | gpt-oss-120b | Claude Haiku 4.5 | 64.11% | $0.38 | | - | 25 | Kimi K2.5 | Qwen3 Next 80B A3B | 63.99% | $0.30 | | - | 26 | Kimi K2.5 | Ministral 3 8B | 63.95% | $0.28 | | - | 27 | Claude 3 Haiku | Kimi K2.5 | 63.85% | $0.31 | | - | 28 | gpt-oss-120b | Ministral 3 8B | 63.70% | $0.09 | | - | 29 | Qwen3 Next 80B A3B | Kimi K2.5 | 63.69% | $0.27 | | - | 30 | Kimi K2.5 | gpt-oss-20b | 63.35% | $0.26 | | - | 31 | Qwen3 32B | Kimi K2.5 | 63.17% | $0.28 | | - | 32 | gpt-oss-120b | Claude 3 Haiku | 62.72% | $0.13 | | - | 33 | Kimi K2.5 | Kimi K2.5 | 62.28% | $0.44 | | - | 34 | gpt-oss-120b | gpt-oss-120b | 62.15% | $0.10 | | - | 35 | Qwen3 Next 80B A3B | Ministral 3 8B | 62.11% | $0.10 | | - | 36 | gpt-oss-120b | gpt-oss-20b | 61.51% | $0.08 | | - | 37 | Qwen3 32B | Ministral 3 8B | 61.17% | $0.09 | | - | 38 | gpt-oss-120b | Kimi K2.5 | 60.85% | $0.18 | | - | 39 | gpt-oss-120b | Qwen3 32B | 58.80% | $0.10 | | - | 40 | Claude 3 Haiku | Qwen3 32B | 56.02% | $0.15 | | - | 41 | Claude 3 Haiku | Claude 3 Haiku | 55.91% | $0.21 | | - | 42 | gpt-oss-20b | Claude Opus 4.6 | 55.86% | $1.04 | | - | 43 | Ministral 3 8B | Qwen3 32B | 55.02% | $0.11 | | - | 44 | Kimi K2.5 | Claude 3 Haiku | 54.90% | $0.34 | | - | 45 | Qwen3 32B | Qwen3 32B | 54.82% | $0.11 | | - | 46 | Kimi K2.5 | Qwen3 32B | 54.73% | $0.30 | | - | 47 | gpt-oss-20b | Claude Haiku 4.5 | 54.28% | $0.26 | | - | 48 | gpt-oss-20b | Ministral 3 8B | 54.25% | $0.05 | | - | 49 | Qwen3 Next 80B A3B | Qwen3 32B | 54.13% | $0.11 | | - | 50 | gpt-oss-20b | Qwen3 Next 80B A3B | 53.89% | $0.06 | | - | 51 | gpt-oss-20b | Claude 3 Haiku | 52.66% | $0.08 | | - | 52 | gpt-oss-20b | gpt-oss-120b | 52.17% | $0.06 | | - | 53 | Ministral 3 8B | Claude 3 Haiku | 51.33% | $0.16 | | - | 54 | gpt-oss-20b | Kimi K2.5 | 51.01% | $0.12 | | - | 55 | gpt-oss-20b | gpt-oss-20b | 50.09% | $0.05 | | - | 56 | Qwen3 Next 80B A3B | Claude 3 Haiku | 49.98% | $0.17 | | - | 57 | gpt-oss-20b | Qwen3 32B | 49.16% | $0.06 | | - | 58 | Qwen3 32B | Claude 3 Haiku | 48.77% | $0.16 | | - | 59 | Claude 3 Haiku | Claude Haiku 4.5 | 46.50% | $0.71 | | - | 60 | Claude Haiku 4.5 | Claude Opus 4.6 | 43.54% | $1.80 | | - | 61 | Claude Haiku 4.5 | gpt-oss-20b | 41.49% | $0.45 | | - | 62 | Claude Haiku 4.5 | gpt-oss-120b | 41.20% | $0.47 | | - | 63 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 41.17% | $0.46 | | - | 64 | Claude Haiku 4.5 | Ministral 3 8B | 41.09% | $0.45 | | - | 65 | Claude Haiku 4.5 | Kimi K2.5 | 41.00% | $0.54 | | - | 66 | Kimi K2.5 | Claude Haiku 4.5 | 37.19% | $0.88 | | - | 67 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | $0.46 | | - | 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | $0.49 | | - | 69 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | $0.70 | | - | 70 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | $0.72 | | - | 71 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | $2.02 | role2_never_called | - | 72 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | $2.02 | role2_never_called | - | 73 | Claude Opus 4.6 | Qwen3 32B | 31.96% | $2.02 | role2_never_called | - | 74 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | $2.02 | role2_never_called | - | 75 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | $2.02 | role2_never_called | - | 76 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | $2.03 | role2_never_called | - | 77 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | $2.02 | role2_never_called | - | 78 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | $2.03 | role2_never_called | - | 79 | Claude Opus 4.6 | Claude Opus 4.6 | 31.71% | $2.02 | | - | 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | $0.69 | | - | 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | $0.79 | | + | Rank | Planner | Solver | Accuracy | Avg Latency (s) | Cost | Note | + |:-----|:--------|:-------|:---------|:----------------|:-----|:-----| + | 1 | Ministral 3 8B | Claude Opus 4.6 | 74.27% | 4.97 | $2.64 | | + | 2 | Claude 3 Haiku | Claude Opus 4.6 | 73.25% | 4.52 | $2.79 | | + | 3 | Qwen3 32B | Claude Opus 4.6 | 73.02% | 4.26 | $2.65 | | + | 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.10% | 4.67 | $2.67 | | + | 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.83% | 3.07 | $0.13 | | + | 6 | Qwen3 32B | gpt-oss-120b | 70.04% | 2.66 | $0.13 | | + | 7 | Kimi K2.5 | Claude Opus 4.6 | 69.96% | 4.49 | $2.43 | | + | 8 | Claude 3 Haiku | gpt-oss-120b | 69.86% | 3.21 | $0.17 | | + | 9 | Ministral 3 8B | gpt-oss-20b | 69.34% | 5.66 | $0.09 | | + | 10 | Claude 3 Haiku | Qwen3 Next 80B A3B | 69.27% | 3.00 | $0.16 | | + | 11 | Qwen3 Next 80B A3B | gpt-oss-20b | 68.89% | 2.82 | $0.09 | | + | 12 | Ministral 3 8B | gpt-oss-120b | 68.70% | 3.65 | $0.12 | | + | 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.15% | 2.69 | $0.11 | | + | 14 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.98% | 3.85 | $0.11 | | + | 15 | Qwen3 32B | Qwen3 Next 80B A3B | 67.53% | 3.51 | $0.11 | | + | 16 | Qwen3 32B | gpt-oss-20b | 66.95% | 2.48 | $0.09 | | + | 17 | Claude 3 Haiku | Ministral 3 8B | 65.98% | 3.73 | $0.14 | | + | 18 | Ministral 3 8B | Kimi K2.5 | 65.24% | 3.27 | $0.26 | | + | 19 | gpt-oss-120b | Qwen3 Next 80B A3B | 64.93% | 4.68 | $0.10 | | + | 20 | Ministral 3 8B | Ministral 3 8B | 64.89% | 3.55 | $0.09 | | + | 21 | Claude 3 Haiku | gpt-oss-20b | 64.79% | 2.90 | $0.13 | | + | 22 | Kimi K2.5 | gpt-oss-120b | 64.70% | 4.16 | $0.29 | | + | 23 | gpt-oss-120b | Claude Opus 4.6 | 64.59% | 4.57 | $1.61 | | + | 24 | gpt-oss-120b | Claude Haiku 4.5 | 64.11% | 4.26 | $0.38 | | + | 25 | Kimi K2.5 | Qwen3 Next 80B A3B | 63.99% | 4.39 | $0.30 | | + | 26 | Kimi K2.5 | Ministral 3 8B | 63.95% | 6.42 | $0.28 | | + | 27 | Claude 3 Haiku | Kimi K2.5 | 63.85% | 2.89 | $0.31 | | + | 28 | gpt-oss-120b | Ministral 3 8B | 63.70% | 7.37 | $0.09 | | + | 29 | Qwen3 Next 80B A3B | Kimi K2.5 | 63.69% | 2.89 | $0.27 | | + | 30 | Kimi K2.5 | gpt-oss-20b | 63.35% | 6.80 | $0.26 | | + | 31 | Qwen3 32B | Kimi K2.5 | 63.17% | 3.26 | $0.28 | | + | 32 | gpt-oss-120b | Claude 3 Haiku | 62.72% | 3.72 | $0.13 | | + | 33 | Kimi K2.5 | Kimi K2.5 | 62.28% | 4.56 | $0.44 | | + | 34 | gpt-oss-120b | gpt-oss-120b | 62.15% | 4.59 | $0.10 | | + | 35 | Qwen3 Next 80B A3B | Ministral 3 8B | 62.11% | 4.27 | $0.10 | | + | 36 | gpt-oss-120b | gpt-oss-20b | 61.51% | 2.71 | $0.08 | | + | 37 | Qwen3 32B | Ministral 3 8B | 61.17% | 2.89 | $0.09 | | + | 38 | gpt-oss-120b | Kimi K2.5 | 60.85% | 4.09 | $0.18 | | + | 39 | gpt-oss-120b | Qwen3 32B | 58.80% | 4.06 | $0.10 | | + | 40 | Claude 3 Haiku | Qwen3 32B | 56.02% | 2.87 | $0.15 | | + | 41 | Claude 3 Haiku | Claude 3 Haiku | 55.91% | 2.41 | $0.21 | | + | 42 | gpt-oss-20b | Claude Opus 4.6 | 55.86% | 2.84 | $1.04 | | + | 43 | Ministral 3 8B | Qwen3 32B | 55.02% | 3.63 | $0.11 | | + | 44 | Kimi K2.5 | Claude 3 Haiku | 54.90% | 3.42 | $0.34 | | + | 45 | Qwen3 32B | Qwen3 32B | 54.82% | 2.53 | $0.11 | | + | 46 | Kimi K2.5 | Qwen3 32B | 54.73% | 4.57 | $0.30 | | + | 47 | gpt-oss-20b | Claude Haiku 4.5 | 54.28% | 2.19 | $0.26 | | + | 48 | gpt-oss-20b | Ministral 3 8B | 54.25% | 4.35 | $0.05 | | + | 49 | Qwen3 Next 80B A3B | Qwen3 32B | 54.13% | 2.83 | $0.11 | | + | 50 | gpt-oss-20b | Qwen3 Next 80B A3B | 53.89% | 2.11 | $0.06 | | + | 51 | gpt-oss-20b | Claude 3 Haiku | 52.66% | 2.04 | $0.08 | | + | 52 | gpt-oss-20b | gpt-oss-120b | 52.17% | 2.11 | $0.06 | | + | 53 | Ministral 3 8B | Claude 3 Haiku | 51.33% | 4.10 | $0.16 | | + | 54 | gpt-oss-20b | Kimi K2.5 | 51.01% | 1.96 | $0.12 | | + | 55 | gpt-oss-20b | gpt-oss-20b | 50.09% | 2.12 | $0.05 | | + | 56 | Qwen3 Next 80B A3B | Claude 3 Haiku | 49.98% | 2.56 | $0.17 | | + | 57 | gpt-oss-20b | Qwen3 32B | 49.16% | 2.05 | $0.06 | | + | 58 | Qwen3 32B | Claude 3 Haiku | 48.77% | 2.23 | $0.16 | | + | 59 | Claude 3 Haiku | Claude Haiku 4.5 | 46.50% | 3.35 | $0.71 | | + | 60 | Claude Haiku 4.5 | Claude Opus 4.6 | 43.54% | 4.06 | $1.80 | | + | 61 | Claude Haiku 4.5 | gpt-oss-20b | 41.49% | 3.03 | $0.45 | | + | 62 | Claude Haiku 4.5 | gpt-oss-120b | 41.20% | 3.14 | $0.47 | | + | 63 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 41.17% | 2.95 | $0.46 | | + | 64 | Claude Haiku 4.5 | Ministral 3 8B | 41.09% | 3.75 | $0.45 | | + | 65 | Claude Haiku 4.5 | Kimi K2.5 | 41.00% | 6.16 | $0.54 | | + | 66 | Kimi K2.5 | Claude Haiku 4.5 | 37.19% | 4.23 | $0.88 | | + | 67 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | 2.89 | $0.46 | | + | 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | 2.63 | $0.49 | | + | 69 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | 4.14 | $0.70 | | + | 70 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | 3.92 | $0.72 | | + | 71 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | 4.72 | $2.02 | role2_never_called | + | 72 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | 4.72 | $2.02 | role2_never_called | + | 73 | Claude Opus 4.6 | Qwen3 32B | 31.96% | 4.72 | $2.02 | role2_never_called | + | 74 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | 4.72 | $2.02 | role2_never_called | + | 75 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | 4.60 | $2.02 | role2_never_called | + | 76 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | 4.57 | $2.03 | role2_never_called | + | 77 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | 4.22 | $2.02 | role2_never_called | + | 78 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | 4.16 | $2.03 | role2_never_called | + | 79 | Claude Opus 4.6 | Claude Opus 4.6 | 31.71% | 4.19 | $2.02 | | + | 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | 3.47 | $0.69 | | + | 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | 3.40 | $0.79 | | ### Selector Comparison @@ -267,129 +267,129 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) ### Top 15 Combos -| Rank | Answer Model | Critic Model | Accuracy | Cost | -|:-----|:-------------|:-------------|:---------|:-----| -| 1 | Claude Opus 4.6 | Claude Haiku 4.5 | **98.84%** | $6.19 | -| 2 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.82% | $5.77 | -| 3 | Claude Opus 4.6 | Ministral 3 8B | 98.72% | $5.26 | -| 4 | Claude Opus 4.6 | gpt-oss-20b | 98.28% | $5.93 | -| 5 | Claude Opus 4.6 | gpt-oss-120b | 97.77% | $6.30 | -| 6 | Claude Opus 4.6 | Qwen3 32B | 97.28% | $6.68 | -| 7 | Claude Opus 4.6 | Claude Opus 4.6 | 97.24% | $6.97 | -| 8 | Claude Opus 4.6 | Kimi K2.5 | 97.24% | $6.58 | -| 9 | Claude Opus 4.6 | Claude 3 Haiku | 95.95% | $5.37 | -| 10 | gpt-oss-20b | Claude Opus 4.6 | 94.57% | $0.97 | -| 11 | gpt-oss-20b | Kimi K2.5 | 94.57% | $0.26 | -| 12 | gpt-oss-20b | gpt-oss-20b | 94.54% | $0.08 | -| 13 | Claude Haiku 4.5 | Qwen3 32B | 94.50% | $2.51 | -| 14 | gpt-oss-20b | Claude Haiku 4.5 | 94.05% | $0.37 | -| 15 | gpt-oss-20b | gpt-oss-120b | 94.02% | $0.11 | +| Rank | Answer Model | Critic Model | Accuracy | Avg Latency (s) | Cost | +|:-----|:-------------|:-------------|:---------|:----------------|:-----| +| 1 | Claude Opus 4.6 | Claude Haiku 4.5 | **98.84%** | 16.15 | $6.19 | +| 2 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.82% | 14.30 | $5.77 | +| 3 | Claude Opus 4.6 | Ministral 3 8B | 98.72% | 14.03 | $5.26 | +| 4 | Claude Opus 4.6 | gpt-oss-20b | 98.28% | 16.50 | $5.93 | +| 5 | Claude Opus 4.6 | gpt-oss-120b | 97.77% | 15.40 | $6.30 | +| 6 | Claude Opus 4.6 | Qwen3 32B | 97.28% | 15.05 | $6.68 | +| 7 | Claude Opus 4.6 | Claude Opus 4.6 | 97.24% | 15.94 | $6.97 | +| 8 | Claude Opus 4.6 | Kimi K2.5 | 97.24% | 18.37 | $6.58 | +| 9 | Claude Opus 4.6 | Claude 3 Haiku | 95.95% | 14.85 | $5.37 | +| 10 | gpt-oss-20b | Claude Opus 4.6 | 94.57% | 6.81 | $0.97 | +| 11 | gpt-oss-20b | Kimi K2.5 | 94.57% | 12.45 | $0.26 | +| 12 | gpt-oss-20b | gpt-oss-20b | 94.54% | 4.04 | $0.08 | +| 13 | Claude Haiku 4.5 | Qwen3 32B | 94.50% | 12.68 | $2.51 | +| 14 | gpt-oss-20b | Claude Haiku 4.5 | 94.05% | 6.19 | $0.37 | +| 15 | gpt-oss-20b | gpt-oss-120b | 94.02% | 4.94 | $0.11 | ### Bottom 15 Combos -| Rank | Answer Model | Critic Model | Accuracy | Cost | -|:-----|:-------------|:-------------|:---------|:-----| -| 67 | Qwen3 Next 80B A3B | Kimi K2.5 | 75.50% | $0.79 | -| 68 | Qwen3 Next 80B A3B | gpt-oss-20b | 75.00% | $0.48 | -| 69 | Kimi K2.5 | gpt-oss-120b | 74.49% | $0.95 | -| 70 | Kimi K2.5 | gpt-oss-20b | 74.09% | $0.77 | -| 71 | Kimi K2.5 | Kimi K2.5 | 73.58% | $1.34 | -| 72 | Kimi K2.5 | Claude Opus 4.6 | 73.33% | $2.79 | -| 73 | Kimi K2.5 | Claude Haiku 4.5 | 73.20% | $1.36 | -| 74 | Claude 3 Haiku | gpt-oss-120b | 72.19% | $0.32 | -| 75 | Kimi K2.5 | Qwen3 32B | 72.16% | $0.92 | -| 76 | Claude 3 Haiku | gpt-oss-20b | 71.43% | $0.32 | -| 77 | Claude 3 Haiku | Qwen3 Next 80B A3B | 71.07% | $0.39 | -| 78 | Claude 3 Haiku | Kimi K2.5 | 71.01% | $0.53 | -| 79 | Claude 3 Haiku | Ministral 3 8B | 69.28% | $0.32 | -| 80 | Claude 3 Haiku | Qwen3 32B | 59.30% | $0.29 | -| 81 | Claude 3 Haiku | Claude 3 Haiku | 54.37% | $0.30 | +| Rank | Answer Model | Critic Model | Accuracy | Avg Latency (s) | Cost | +|:-----|:-------------|:-------------|:---------|:----------------|:-----| +| 67 | Qwen3 Next 80B A3B | Kimi K2.5 | 75.50% | 36.37 | $0.79 | +| 68 | Qwen3 Next 80B A3B | gpt-oss-20b | 75.00% | 32.70 | $0.48 | +| 69 | Kimi K2.5 | gpt-oss-120b | 74.49% | 32.23 | $0.95 | +| 70 | Kimi K2.5 | gpt-oss-20b | 74.09% | 25.65 | $0.77 | +| 71 | Kimi K2.5 | Kimi K2.5 | 73.58% | 44.39 | $1.34 | +| 72 | Kimi K2.5 | Claude Opus 4.6 | 73.33% | 28.62 | $2.79 | +| 73 | Kimi K2.5 | Claude Haiku 4.5 | 73.20% | 26.98 | $1.36 | +| 74 | Claude 3 Haiku | gpt-oss-120b | 72.19% | 8.39 | $0.32 | +| 75 | Kimi K2.5 | Qwen3 32B | 72.16% | 30.32 | $0.92 | +| 76 | Claude 3 Haiku | gpt-oss-20b | 71.43% | 8.42 | $0.32 | +| 77 | Claude 3 Haiku | Qwen3 Next 80B A3B | 71.07% | 17.12 | $0.39 | +| 78 | Claude 3 Haiku | Kimi K2.5 | 71.01% | 14.23 | $0.53 | +| 79 | Claude 3 Haiku | Ministral 3 8B | 69.28% | 12.40 | $0.32 | +| 80 | Claude 3 Haiku | Qwen3 32B | 59.30% | 6.29 | $0.29 | +| 81 | Claude 3 Haiku | Claude 3 Haiku | 54.37% | 7.28 | $0.30 | ??? note "Full 81 Combo Results" - | Rank | Answer Model | Critic Model | Accuracy | Cost | Note | - |:-----|:-------------|:-------------|:---------|:-----|:-----| - | 1 | Claude Opus 4.6 | Claude Haiku 4.5 | 98.84% | $6.19 | | - | 2 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.82% | $5.77 | | - | 3 | Claude Opus 4.6 | Ministral 3 8B | 98.72% | $5.26 | | - | 4 | Claude Opus 4.6 | gpt-oss-20b | 98.28% | $5.93 | | - | 5 | Claude Opus 4.6 | gpt-oss-120b | 97.77% | $6.30 | | - | 6 | Claude Opus 4.6 | Qwen3 32B | 97.28% | $6.68 | | - | 7 | Claude Opus 4.6 | Claude Opus 4.6 | 97.24% | $6.97 | | - | 8 | Claude Opus 4.6 | Kimi K2.5 | 97.24% | $6.58 | | - | 9 | Claude Opus 4.6 | Claude 3 Haiku | 95.95% | $5.37 | | - | 10 | gpt-oss-20b | Claude Opus 4.6 | 94.57% | $0.97 | | - | 11 | gpt-oss-20b | Kimi K2.5 | 94.57% | $0.26 | | - | 12 | gpt-oss-20b | gpt-oss-20b | 94.54% | $0.08 | | - | 13 | Claude Haiku 4.5 | Qwen3 32B | 94.50% | $2.51 | | - | 14 | gpt-oss-20b | Claude Haiku 4.5 | 94.05% | $0.37 | | - | 15 | gpt-oss-20b | gpt-oss-120b | 94.02% | $0.11 | | - | 16 | gpt-oss-20b | Qwen3 Next 80B A3B | 94.02% | $0.14 | | - | 17 | Claude Haiku 4.5 | Claude Haiku 4.5 | 94.00% | $2.59 | | - | 18 | gpt-oss-20b | Ministral 3 8B | 93.99% | $0.10 | | - | 19 | gpt-oss-120b | Claude Opus 4.6 | 93.81% | $1.25 | | - | 20 | Claude Haiku 4.5 | gpt-oss-20b | 93.50% | $2.20 | | - | 21 | Claude Haiku 4.5 | Claude Opus 4.6 | 93.50% | $3.77 | | - | 22 | Claude Haiku 4.5 | Ministral 3 8B | 93.50% | $2.57 | | - | 23 | Claude Haiku 4.5 | Kimi K2.5 | 93.50% | $2.60 | | - | 24 | gpt-oss-20b | Qwen3 32B | 93.48% | $0.09 | | - | 25 | gpt-oss-20b | Claude 3 Haiku | 93.44% | $0.15 | | - | 26 | gpt-oss-120b | Ministral 3 8B | 93.26% | $0.19 | | - | 27 | gpt-oss-120b | Qwen3 32B | 93.26% | $0.16 | | - | 28 | Claude Haiku 4.5 | gpt-oss-120b | 93.00% | $2.90 | | - | 29 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 93.00% | $7.81 | | - | 30 | gpt-oss-120b | Claude Haiku 4.5 | 92.82% | $0.47 | | - | 31 | gpt-oss-120b | gpt-oss-20b | 92.78% | $0.18 | | - | 32 | gpt-oss-120b | gpt-oss-120b | 92.78% | $0.19 | | - | 33 | gpt-oss-120b | Kimi K2.5 | 92.78% | $0.32 | | - | 34 | gpt-oss-120b | Qwen3 Next 80B A3B | 92.78% | $0.23 | | - | 35 | gpt-oss-120b | Claude 3 Haiku | 92.75% | $0.20 | | - | 36 | Claude Haiku 4.5 | Claude 3 Haiku | 92.50% | $2.46 | | - | 37 | Claude 3 Haiku | Claude Opus 4.6 | 89.66% | $2.26 | | - | 38 | Qwen3 32B | Qwen3 Next 80B A3B | 88.83% | $0.24 | | - | 39 | Ministral 3 8B | Claude 3 Haiku | 88.15% | $0.05 | | - | 40 | Qwen3 32B | gpt-oss-120b | 87.83% | $0.47 | | - | 41 | Ministral 3 8B | Qwen3 Next 80B A3B | 87.82% | $0.03 | | - | 42 | Qwen3 32B | Claude Opus 4.6 | 87.56% | $3.43 | | - | 43 | Ministral 3 8B | Kimi K2.5 | 87.04% | $0.09 | | - | 44 | Ministral 3 8B | gpt-oss-120b | 86.63% | $0.07 | | - | 45 | Claude 3 Haiku | Claude Haiku 4.5 | 86.55% | $0.69 | | - | 46 | Ministral 3 8B | Ministral 3 8B | 86.52% | $0.03 | | - | 47 | Ministral 3 8B | Claude Opus 4.6 | 86.47% | $0.93 | | - | 48 | Qwen3 32B | Claude Haiku 4.5 | 86.46% | $0.90 | | - | 49 | Ministral 3 8B | Claude Haiku 4.5 | 86.23% | $0.30 | | - | 50 | Ministral 3 8B | gpt-oss-20b | 86.13% | $0.05 | | - | 51 | Qwen3 32B | Ministral 3 8B | 86.10% | $0.21 | | - | 52 | Qwen3 32B | Kimi K2.5 | 85.94% | $0.78 | | - | 53 | Qwen3 32B | gpt-oss-20b | 85.86% | $0.49 | | - | 54 | Ministral 3 8B | Qwen3 32B | 85.80% | $0.04 | | - | 55 | Qwen3 32B | Qwen3 32B | 84.82% | $0.62 | | - | 56 | Kimi K2.5 | Claude 3 Haiku | 80.41% | $0.98 | | - | 57 | Qwen3 32B | Claude 3 Haiku | 80.00% | $0.67 | | - | 58 | Qwen3 Next 80B A3B | Claude 3 Haiku | 80.00% | $0.59 | | - | 59 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 78.00% | $2.96 | | - | 60 | Kimi K2.5 | Ministral 3 8B | 77.84% | $0.97 | | - | 61 | Kimi K2.5 | Qwen3 Next 80B A3B | 77.20% | $1.00 | | - | 62 | Qwen3 Next 80B A3B | Ministral 3 8B | 77.00% | $0.55 | | - | 63 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 76.50% | $1.21 | | - | 64 | Qwen3 Next 80B A3B | gpt-oss-120b | 76.50% | $0.52 | | - | 65 | Qwen3 Next 80B A3B | Qwen3 32B | 76.00% | $0.42 | | - | 66 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 76.00% | $0.54 | | - | 67 | Qwen3 Next 80B A3B | Kimi K2.5 | 75.50% | $0.79 | | - | 68 | Qwen3 Next 80B A3B | gpt-oss-20b | 75.00% | $0.48 | | - | 69 | Kimi K2.5 | gpt-oss-120b | 74.49% | $0.95 | | - | 70 | Kimi K2.5 | gpt-oss-20b | 74.09% | $0.77 | | - | 71 | Kimi K2.5 | Kimi K2.5 | 73.58% | $1.34 | | - | 72 | Kimi K2.5 | Claude Opus 4.6 | 73.33% | $2.79 | | - | 73 | Kimi K2.5 | Claude Haiku 4.5 | 73.20% | $1.36 | | - | 74 | Claude 3 Haiku | gpt-oss-120b | 72.19% | $0.32 | | - | 75 | Kimi K2.5 | Qwen3 32B | 72.16% | $0.92 | | - | 76 | Claude 3 Haiku | gpt-oss-20b | 71.43% | $0.32 | | - | 77 | Claude 3 Haiku | Qwen3 Next 80B A3B | 71.07% | $0.39 | | - | 78 | Claude 3 Haiku | Kimi K2.5 | 71.01% | $0.53 | | - | 79 | Claude 3 Haiku | Ministral 3 8B | 69.28% | $0.32 | | - | 80 | Claude 3 Haiku | Qwen3 32B | 59.30% | $0.29 | | - | 81 | Claude 3 Haiku | Claude 3 Haiku | 54.37% | $0.30 | | + | Rank | Answer Model | Critic Model | Accuracy | Avg Latency (s) | Cost | Note | + |:-----|:-------------|:-------------|:---------|:----------------|:-----|:-----| + | 1 | Claude Opus 4.6 | Claude Haiku 4.5 | 98.84% | 16.15 | $6.19 | | + | 2 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.82% | 14.30 | $5.77 | | + | 3 | Claude Opus 4.6 | Ministral 3 8B | 98.72% | 14.03 | $5.26 | | + | 4 | Claude Opus 4.6 | gpt-oss-20b | 98.28% | 16.50 | $5.93 | | + | 5 | Claude Opus 4.6 | gpt-oss-120b | 97.77% | 15.40 | $6.30 | | + | 6 | Claude Opus 4.6 | Qwen3 32B | 97.28% | 15.05 | $6.68 | | + | 7 | Claude Opus 4.6 | Claude Opus 4.6 | 97.24% | 15.94 | $6.97 | | + | 8 | Claude Opus 4.6 | Kimi K2.5 | 97.24% | 18.37 | $6.58 | | + | 9 | Claude Opus 4.6 | Claude 3 Haiku | 95.95% | 14.85 | $5.37 | | + | 10 | gpt-oss-20b | Claude Opus 4.6 | 94.57% | 6.81 | $0.97 | | + | 11 | gpt-oss-20b | Kimi K2.5 | 94.57% | 12.45 | $0.26 | | + | 12 | gpt-oss-20b | gpt-oss-20b | 94.54% | 4.04 | $0.08 | | + | 13 | Claude Haiku 4.5 | Qwen3 32B | 94.50% | 12.68 | $2.51 | | + | 14 | gpt-oss-20b | Claude Haiku 4.5 | 94.05% | 6.19 | $0.37 | | + | 15 | gpt-oss-20b | gpt-oss-120b | 94.02% | 4.94 | $0.11 | | + | 16 | gpt-oss-20b | Qwen3 Next 80B A3B | 94.02% | 8.67 | $0.14 | | + | 17 | Claude Haiku 4.5 | Claude Haiku 4.5 | 94.00% | 14.31 | $2.59 | | + | 18 | gpt-oss-20b | Ministral 3 8B | 93.99% | 8.27 | $0.10 | | + | 19 | gpt-oss-120b | Claude Opus 4.6 | 93.81% | 9.10 | $1.25 | | + | 20 | Claude Haiku 4.5 | gpt-oss-20b | 93.50% | 12.51 | $2.20 | | + | 21 | Claude Haiku 4.5 | Claude Opus 4.6 | 93.50% | 15.82 | $3.77 | | + | 22 | Claude Haiku 4.5 | Ministral 3 8B | 93.50% | 14.70 | $2.57 | | + | 23 | Claude Haiku 4.5 | Kimi K2.5 | 93.50% | 17.50 | $2.60 | | + | 24 | gpt-oss-20b | Qwen3 32B | 93.48% | 4.30 | $0.09 | | + | 25 | gpt-oss-20b | Claude 3 Haiku | 93.44% | 6.10 | $0.15 | | + | 26 | gpt-oss-120b | Ministral 3 8B | 93.26% | 10.42 | $0.19 | | + | 27 | gpt-oss-120b | Qwen3 32B | 93.26% | 5.53 | $0.16 | | + | 28 | Claude Haiku 4.5 | gpt-oss-120b | 93.00% | 14.65 | $2.90 | | + | 29 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 93.00% | 20.98 | $7.81 | | + | 30 | gpt-oss-120b | Claude Haiku 4.5 | 92.82% | 7.77 | $0.47 | | + | 31 | gpt-oss-120b | gpt-oss-20b | 92.78% | 6.45 | $0.18 | | + | 32 | gpt-oss-120b | gpt-oss-120b | 92.78% | 6.94 | $0.19 | | + | 33 | gpt-oss-120b | Kimi K2.5 | 92.78% | 12.09 | $0.32 | | + | 34 | gpt-oss-120b | Qwen3 Next 80B A3B | 92.78% | 10.98 | $0.23 | | + | 35 | gpt-oss-120b | Claude 3 Haiku | 92.75% | 6.42 | $0.20 | | + | 36 | Claude Haiku 4.5 | Claude 3 Haiku | 92.50% | 13.43 | $2.46 | | + | 37 | Claude 3 Haiku | Claude Opus 4.6 | 89.66% | 13.32 | $2.26 | | + | 38 | Qwen3 32B | Qwen3 Next 80B A3B | 88.83% | 8.02 | $0.24 | | + | 39 | Ministral 3 8B | Claude 3 Haiku | 88.15% | 10.24 | $0.05 | | + | 40 | Qwen3 32B | gpt-oss-120b | 87.83% | 7.11 | $0.47 | | + | 41 | Ministral 3 8B | Qwen3 Next 80B A3B | 87.82% | 9.22 | $0.03 | | + | 42 | Qwen3 32B | Claude Opus 4.6 | 87.56% | 12.33 | $3.43 | | + | 43 | Ministral 3 8B | Kimi K2.5 | 87.04% | 14.43 | $0.09 | | + | 44 | Ministral 3 8B | gpt-oss-120b | 86.63% | 10.58 | $0.07 | | + | 45 | Claude 3 Haiku | Claude Haiku 4.5 | 86.55% | 9.32 | $0.69 | | + | 46 | Ministral 3 8B | Ministral 3 8B | 86.52% | 7.29 | $0.03 | | + | 47 | Ministral 3 8B | Claude Opus 4.6 | 86.47% | 11.46 | $0.93 | | + | 48 | Qwen3 32B | Claude Haiku 4.5 | 86.46% | 7.47 | $0.90 | | + | 49 | Ministral 3 8B | Claude Haiku 4.5 | 86.23% | 11.66 | $0.30 | | + | 50 | Ministral 3 8B | gpt-oss-20b | 86.13% | 12.33 | $0.05 | | + | 51 | Qwen3 32B | Ministral 3 8B | 86.10% | 17.57 | $0.21 | | + | 52 | Qwen3 32B | Kimi K2.5 | 85.94% | 13.50 | $0.78 | | + | 53 | Qwen3 32B | gpt-oss-20b | 85.86% | 6.43 | $0.49 | | + | 54 | Ministral 3 8B | Qwen3 32B | 85.80% | 9.41 | $0.04 | | + | 55 | Qwen3 32B | Qwen3 32B | 84.82% | 5.98 | $0.62 | | + | 56 | Kimi K2.5 | Claude 3 Haiku | 80.41% | 35.09 | $0.98 | | + | 57 | Qwen3 32B | Claude 3 Haiku | 80.00% | 7.86 | $0.67 | | + | 58 | Qwen3 Next 80B A3B | Claude 3 Haiku | 80.00% | 35.17 | $0.59 | | + | 59 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 78.00% | 31.01 | $2.96 | | + | 60 | Kimi K2.5 | Ministral 3 8B | 77.84% | 40.79 | $0.97 | | + | 61 | Kimi K2.5 | Qwen3 Next 80B A3B | 77.20% | 37.64 | $1.00 | | + | 62 | Qwen3 Next 80B A3B | Ministral 3 8B | 77.00% | 38.55 | $0.55 | | + | 63 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 76.50% | 32.33 | $1.21 | | + | 64 | Qwen3 Next 80B A3B | gpt-oss-120b | 76.50% | 34.72 | $0.52 | | + | 65 | Qwen3 Next 80B A3B | Qwen3 32B | 76.00% | 30.64 | $0.42 | | + | 66 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 76.00% | 36.44 | $0.54 | | + | 67 | Qwen3 Next 80B A3B | Kimi K2.5 | 75.50% | 36.37 | $0.79 | | + | 68 | Qwen3 Next 80B A3B | gpt-oss-20b | 75.00% | 32.70 | $0.48 | | + | 69 | Kimi K2.5 | gpt-oss-120b | 74.49% | 32.23 | $0.95 | | + | 70 | Kimi K2.5 | gpt-oss-20b | 74.09% | 25.65 | $0.77 | | + | 71 | Kimi K2.5 | Kimi K2.5 | 73.58% | 44.39 | $1.34 | | + | 72 | Kimi K2.5 | Claude Opus 4.6 | 73.33% | 28.62 | $2.79 | | + | 73 | Kimi K2.5 | Claude Haiku 4.5 | 73.20% | 26.98 | $1.36 | | + | 74 | Claude 3 Haiku | gpt-oss-120b | 72.19% | 8.39 | $0.32 | | + | 75 | Kimi K2.5 | Qwen3 32B | 72.16% | 30.32 | $0.92 | | + | 76 | Claude 3 Haiku | gpt-oss-20b | 71.43% | 8.42 | $0.32 | | + | 77 | Claude 3 Haiku | Qwen3 Next 80B A3B | 71.07% | 17.12 | $0.39 | | + | 78 | Claude 3 Haiku | Kimi K2.5 | 71.01% | 14.23 | $0.53 | | + | 79 | Claude 3 Haiku | Ministral 3 8B | 69.28% | 12.40 | $0.32 | | + | 80 | Claude 3 Haiku | Qwen3 32B | 59.30% | 6.29 | $0.29 | | + | 81 | Claude 3 Haiku | Claude 3 Haiku | 54.37% | 7.28 | $0.30 | | ### Selector Comparison From edb84ecfe319a03367351680b77b1e5fb437a992 Mon Sep 17 00:00:00 2001 From: Sripad Karne Date: Thu, 2 Apr 2026 23:38:41 -0700 Subject: [PATCH 4/5] Fix HotpotQA Bottom 15 rank offset (was off by 1 vs Full 81) Kimi + Haiku 4.5 is rank 66, not 67. Bottom 15 now correctly starts at rank 67 and matches the Full 81 table. Co-Authored-By: Claude Opus 4.6 --- docs/benchmark-results/index.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/benchmark-results/index.md b/docs/benchmark-results/index.md index 2b4009a..e4932f7 100644 --- a/docs/benchmark-results/index.md +++ b/docs/benchmark-results/index.md @@ -141,19 +141,19 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) | Rank | Planner | Solver | Accuracy | Avg Latency (s) | Cost | |:-----|:--------|:-------|:---------|:----------------|:-----| -| 67 | Kimi K2.5 | Claude Haiku 4.5 | 37.19% | 4.23 | $0.88 | -| 68 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | 2.89 | $0.46 | -| 69 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | 2.63 | $0.49 | -| 70 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | 4.14 | $0.70 | -| 71 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | 3.92 | $0.72 | -| 72 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | 4.72 | $2.02 | -| 73 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | 4.72 | $2.02 | -| 74 | Claude Opus 4.6 | Qwen3 32B | 31.96% | 4.72 | $2.02 | -| 75 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | 4.72 | $2.02 | -| 76 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | 4.60 | $2.02 | -| 77 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | 4.57 | $2.03 | -| 78 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | 4.22 | $2.02 | -| 79 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | 4.16 | $2.03 | +| 67 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | 2.89 | $0.46 | +| 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | 2.63 | $0.49 | +| 69 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | 4.14 | $0.70 | +| 70 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | 3.92 | $0.72 | +| 71 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | 4.72 | $2.02 | +| 72 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | 4.72 | $2.02 | +| 73 | Claude Opus 4.6 | Qwen3 32B | 31.96% | 4.72 | $2.02 | +| 74 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | 4.72 | $2.02 | +| 75 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | 4.60 | $2.02 | +| 76 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | 4.57 | $2.03 | +| 77 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | 4.22 | $2.02 | +| 78 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | 4.16 | $2.03 | +| 79 | Claude Opus 4.6 | Claude Opus 4.6 | 31.71% | 4.19 | $2.02 | | 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | 3.47 | $0.69 | | 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | 3.40 | $0.79 | From b57a3aa75de75d690c1270650eaa73805a1a5aa1 Mon Sep 17 00:00:00 2001 From: Sripad Karne Date: Thu, 2 Apr 2026 23:44:25 -0700 Subject: [PATCH 5/5] Sort all selector comparison tables by Mean Accuracy descending Co-Authored-By: Claude Opus 4.6 --- docs/benchmark-results/index.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/benchmark-results/index.md b/docs/benchmark-results/index.md index e4932f7..e18a3d8 100644 --- a/docs/benchmark-results/index.md +++ b/docs/benchmark-results/index.md @@ -53,8 +53,8 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr |:---------|:----------|:--------------|:------------|:-----|:--------| | Brute Force | 100% | 74.75% | 1,782 | $4.71 | -- | | LM Proposal | 100% | 74.75% | 198 | $2.47 | 48% | -| Arm Elimination | 94% | 74.10% | 666 | $3.57 | **24%** | | Hill Climbing | 90% | 74.55% | 1,501 | $4.03 | 14% | +| Arm Elimination | 94% | 74.10% | 666 | $3.57 | **24%** | | Epsilon LUCB | 72% | 73.14% | 380 | $2.51 | 47% | | Bayesian Opt | 56% | 72.43% | 990 | $2.59 | 45% | | Random Search | 36% | 68.57% | 594 | $1.73 | 63% | @@ -104,10 +104,10 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) |:---------|:----------|:--------------|:------------|:-----|:--------| | Brute Force | 100% | 70.00% | 1,800 | $84.80 | -- | | Hill Climbing | 100% | 70.00% | 1,664 | $72.12 | 15% | +| Epsilon LUCB | 28% | 69.90% | 399 | $40.03 | 53% | | Arm Elimination | 88% | 69.37% | 912 | $74.39 | **12%** | | Bayesian Opt | 44% | 69.27% | 1,000 | $50.64 | 40% | | Random Search | 36% | 67.13% | 600 | $31.39 | 63% | -| Epsilon LUCB | 28% | 69.90% | 399 | $40.03 | 53% | | Threshold SE | 10% | 58.19% | 186 | $18.82 | 78% | | LM Proposal | 0% | 44.03% | 200 | $3.39 | 96% | @@ -251,11 +251,11 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| | Brute Force | 100% | 74.27% | 16,168 | $51.90 | -- | +| Bayesian Opt | 8% | 73.33% | 3,996 | $12.29 | 76% | | Arm Elimination | 86% | 73.19% | 4,283 | $16.92 | **67%** | | Hill Climbing | 52% | 73.13% | 4,635 | $19.39 | 63% | | Random Search | 30% | 72.25% | 4,192 | $13.37 | 74% | | Epsilon LUCB | 10% | 69.71% | 478 | $1.75 | 97% | -| Bayesian Opt | 8% | 73.33% | 3,996 | $12.29 | 76% | | Threshold SE | 4% | 65.42% | 1,642 | $6.45 | 88% | | LM Proposal | 0% | 34.13% | 200 | $1.84 | 96% | @@ -400,6 +400,6 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) | Hill Climbing | 80% | 98.76% | 3,926 | $54.22 | 56% | | Random Search | 28% | 98.17% | 3,880 | $31.77 | 74% | | Epsilon LUCB | 4% | 96.99% | 447 | $6.10 | 95% | -| Bayesian Opt | 4% | 95.41% | 3,666 | $35.56 | 71% | | LM Proposal | 0% | 95.82% | 158 | $5.61 | 95% | +| Bayesian Opt | 4% | 95.41% | 3,666 | $35.56 | 71% | | Threshold SE | 0% | 74.52% | 1,355 | $6.90 | 94% |