diff --git a/docs/benchmark-results/index.md b/docs/benchmark-results/index.md index 77725f5..e18a3d8 100644 --- a/docs/benchmark-results/index.md +++ b/docs/benchmark-results/index.md @@ -22,10 +22,10 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr | Benchmark | Tuple | Samples | Combos | Best Combo | Accuracy | BF Cost | Arm Elim Savings | |:----------|:------|:--------|:-------|:-----------|:---------|:--------|:-----------------| -| GPQA Diamond | 1-tuple | 198 | 9 | Claude Opus 4.6 | **80.30%** | $4.13 | 49% | -| BFCL Multi-Turn | 1-tuple | 200 | 9 | Claude Opus 4.6 | **72.00%** | $85.42 | 11% | -| HotpotQA | 2-tuple | 199 | 81 | planner=Ministral 3 8B + solver=Claude Opus 4.6 | **74.78%** | $51.48 | 64% | -| MathQA | 2-tuple | 200 | 81 | answer=Claude Opus 4.6 + critic=Qwen3 Next 80B A3B | **98.83%** | $113.01 | 46% | +| GPQA Diamond | 1-tuple | 198 | 9 | Claude Opus 4.6 | **74.75%** | $4.71 | 24% | +| BFCL Multi-Turn | 1-tuple | 200 | 9 | Kimi K2.5 (tied with Opus, Qwen3 Next) | **70.00%** | $84.80 | 12% | +| HotpotQA | 2-tuple | 200 | 81 | planner=Ministral 3 8B + solver=Claude Opus 4.6 | **74.27%** | $51.90 | 67% | +| MathQA | 2-tuple | 200 | 81 | answer=Claude Opus 4.6 + critic=Claude Haiku 4.5 | **98.84%** | $123.87 | 58% | --- @@ -37,28 +37,43 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr | Rank | Model | Accuracy | Avg Latency (s) | Cost | |:-----|:------|:---------|:-----------------|:-----| -| 1 | Claude Opus 4.6 | **80.30%** | 86.4 | $2.43 | -| 2 | Kimi K2.5 | 72.02% | 97.2 | $0.72 | -| 3 | gpt-oss-120b | 68.02% | 88.5 | $0.19 | -| 4 | Claude Haiku 4.5 | 60.51% | 83.1 | $0.51 | -| 5 | gpt-oss-20b | 52.02% | 85.7 | $0.13 | -| 6 | Qwen3 Next 80B A3B | 51.04% | 90.0 | $0.06 | -| 7 | Qwen3 32B | 46.67% | 88.3 | $0.04 | -| 8 | Claude 3 Haiku | 37.31% | 80.1 | $0.06 | -| 9 | Ministral 3 8B | 36.87% | 84.7 | $0.01 | +| 1 | Claude Opus 4.6 | **74.75%** | 9.16 | $2.48 | +| 2 | Kimi K2.5 | 72.73% | 16.41 | $1.13 | +| 3 | gpt-oss-120b | 68.18% | 6.46 | $0.20 | +| 4 | Claude Haiku 4.5 | 59.60% | 3.70 | $0.51 | +| 5 | Qwen3 Next 80B A3B | 51.01% | 10.33 | $0.14 | +| 6 | gpt-oss-20b | 50.00% | 6.21 | $0.14 | +| 7 | Qwen3 32B | 46.97% | 1.54 | $0.08 | +| 8 | Ministral 3 8B | 36.87% | 0.25 | $0.00 | +| 9 | Claude 3 Haiku | 34.85% | 1.79 | $0.06 | ### Selector Comparison | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| -| Brute Force | 100% | 80.30% | 1,759 | $4.13 | -- | -| LM Proposal | 100% | 80.30% | 198 | $2.43 | 41% | -| Arm Elimination | 98% | 80.14% | 444 | $2.10 | **49%** | -| Hill Climbing | 92% | 79.64% | 1,552 | $3.75 | 9% | -| Epsilon LUCB | 90% | 79.47% | 361 | $2.32 | 44% | -| Bayesian Opt | 56% | 75.41% | 976 | $2.32 | 44% | -| Random Search | 36% | 70.53% | 587 | $1.53 | 63% | -| Threshold SE | 36% | 65.88% | 294 | $0.26 | 94% | +| Brute Force | 100% | 74.75% | 1,782 | $4.71 | -- | +| LM Proposal | 100% | 74.75% | 198 | $2.47 | 48% | +| Hill Climbing | 90% | 74.55% | 1,501 | $4.03 | 14% | +| Arm Elimination | 94% | 74.10% | 666 | $3.57 | **24%** | +| Epsilon LUCB | 72% | 73.14% | 380 | $2.51 | 47% | +| Bayesian Opt | 56% | 72.43% | 990 | $2.59 | 45% | +| Random Search | 36% | 68.57% | 594 | $1.73 | 63% | +| Threshold SE | 16% | 57.48% | 252 | $1.80 | 62% | + +### Thinking Effort Ablation + +Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort) and Haiku 4.5 (budget_tokens). Baseline "none" rows use brute force results. + +| Model | Effort/Budget Tokens | Accuracy | Cost/Sample | Server Latency/Sample (s) | +|:------|:---------------------|:---------|:------------|:--------------------------| +| Opus | high | 83.90% | $0.113 | 70.4 | +| Opus | medium | 79.30% | $0.0341 | 23.9 | +| Opus | none | 74.75% | $0.0125 | 9.16 | +| Haiku 4.5 | 16K | 71.20% | $0.0192 | 30.6 | +| Haiku 4.5 | 32K | 71.20% | $0.0361 | 57.4 | +| Opus | low | 61.60% | $0.00302 | 3.06 | +| Haiku 4.5 | 5K | 60.10% | $0.00925 | 15.0 | +| Haiku 4.5 | none | 59.60% | $0.0026 | 3.70 | --- @@ -73,176 +88,176 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr | Rank | Model | Accuracy | Avg Latency (s) | Cost | |:-----|:------|:---------|:-----------------|:-----| -| 1 | Claude Opus 4.6 | **72.00%** | 222.8 | $60.78 | -| 2 | Qwen3 Next 80B A3B | 71.00% | 226.3 | $1.87 | -| 3 | Kimi K2.5 | 68.50% | 228.2 | $3.86 | -| 4 | Claude Haiku 4.5 | 65.00% | 208.7 | $11.97 | -| 5 | gpt-oss-120b | 61.00% | 208.9 | $1.13 | -| 6 | Qwen3 32B | 50.00% | 211.3 | $0.97 | -| 7 | Claude 3 Haiku | 45.00% | 205.2 | $3.42 | -| 8 | gpt-oss-20b | 39.00% | 204.2 | $0.44 | -| 9 | Ministral 3 8B | 33.50% | 213.8 | $0.98 | +| 1 | Kimi K2.5 | **70.00%** | 21.30 | $3.86 | +| 2 | Claude Opus 4.6 | 70.00% | 42.35 | $60.14 | +| 3 | Qwen3 Next 80B A3B | 70.00% | 60.54 | $1.90 | +| 4 | Claude Haiku 4.5 | 65.00% | 20.90 | $11.98 | +| 5 | gpt-oss-120b | 58.50% | 20.01 | $1.16 | +| 6 | Qwen3 32B | 47.00% | 10.78 | $1.00 | +| 7 | Claude 3 Haiku | 43.50% | 17.96 | $3.42 | +| 8 | gpt-oss-20b | 42.00% | 10.03 | $0.42 | +| 9 | Ministral 3 8B | 34.00% | 29.03 | $0.92 | ### Selector Comparison | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| -| Brute Force | 100% | 72.00% | 1,800 | $85.42 | -- | -| Arm Elimination | 100% | 72.00% | 922 | $76.33 | **11%** | -| Hill Climbing | 94% | 71.94% | 1,652 | $80.38 | 6% | -| Epsilon LUCB | 60% | 71.33% | 407 | $42.60 | 50% | -| Bayesian Opt | 56% | 70.61% | 1,000 | $50.98 | 40% | -| Random Search | 36% | 67.99% | 600 | $31.62 | 63% | -| Threshold SE | 12% | 57.52% | 285 | $6.45 | 92% | -| LM Proposal | 0% | 45.00% | 200 | $3.42 | 96% | +| Brute Force | 100% | 70.00% | 1,800 | $84.80 | -- | +| Hill Climbing | 100% | 70.00% | 1,664 | $72.12 | 15% | +| Epsilon LUCB | 28% | 69.90% | 399 | $40.03 | 53% | +| Arm Elimination | 88% | 69.37% | 912 | $74.39 | **12%** | +| Bayesian Opt | 44% | 69.27% | 1,000 | $50.64 | 40% | +| Random Search | 36% | 67.13% | 600 | $31.39 | 63% | +| Threshold SE | 10% | 58.19% | 186 | $18.82 | 78% | +| LM Proposal | 0% | 44.03% | 200 | $3.39 | 96% | --- ## HotpotQA -**Multi-hop question answering** — 199 samples from the HotpotQA distractor setting. Two-agent architecture: a **planner** proposes search steps, and a **solver** executes them with tool access. 81 model combinations (9 planners x 9 solvers). +**Multi-hop question answering** — 200 samples from the HotpotQA distractor setting. Two-agent architecture: a **planner** proposes search steps, and a **solver** executes them with tool access. 81 model combinations (9 planners x 9 solvers). ### Top 15 Combos -| Rank | Planner | Solver | Accuracy | Cost | -|:-----|:--------|:-------|:---------|:-----| -| 1 | Ministral 3 8B | Claude Opus 4.6 | **74.78%** | $2.71 | -| 2 | Qwen3 32B | Claude Opus 4.6 | 72.97% | $2.67 | -| 3 | Claude 3 Haiku | Claude Opus 4.6 | 72.58% | $2.67 | -| 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.32% | $2.66 | -| 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.29% | $0.13 | -| 6 | Kimi K2.5 | Claude Opus 4.6 | 71.09% | $2.53 | -| 7 | Qwen3 32B | gpt-oss-120b | 70.47% | $0.12 | -| 8 | Qwen3 32B | Qwen3 Next 80B A3B | 69.12% | $0.11 | -| 9 | Claude 3 Haiku | Qwen3 Next 80B A3B | 68.60% | $0.15 | -| 10 | Ministral 3 8B | gpt-oss-120b | 68.56% | $0.12 | -| 11 | Claude 3 Haiku | gpt-oss-120b | 68.44% | $0.16 | -| 12 | Kimi K2.5 | Qwen3 Next 80B A3B | 68.41% | $0.28 | -| 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.29% | $0.12 | -| 14 | Qwen3 32B | gpt-oss-20b | 67.75% | $0.09 | -| 15 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.46% | $0.11 | +| Rank | Planner | Solver | Accuracy | Avg Latency (s) | Cost | +|:-----|:--------|:-------|:---------|:----------------|:-----| +| 1 | Ministral 3 8B | Claude Opus 4.6 | **74.27%** | 4.97 | $2.64 | +| 2 | Claude 3 Haiku | Claude Opus 4.6 | 73.25% | 4.52 | $2.79 | +| 3 | Qwen3 32B | Claude Opus 4.6 | 73.02% | 4.26 | $2.65 | +| 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.10% | 4.67 | $2.67 | +| 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.83% | 3.07 | $0.13 | +| 6 | Qwen3 32B | gpt-oss-120b | 70.04% | 2.66 | $0.13 | +| 7 | Kimi K2.5 | Claude Opus 4.6 | 69.96% | 4.49 | $2.43 | +| 8 | Claude 3 Haiku | gpt-oss-120b | 69.86% | 3.21 | $0.17 | +| 9 | Ministral 3 8B | gpt-oss-20b | 69.34% | 5.66 | $0.09 | +| 10 | Claude 3 Haiku | Qwen3 Next 80B A3B | 69.27% | 3.00 | $0.16 | +| 11 | Qwen3 Next 80B A3B | gpt-oss-20b | 68.89% | 2.82 | $0.09 | +| 12 | Ministral 3 8B | gpt-oss-120b | 68.70% | 3.65 | $0.12 | +| 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.15% | 2.69 | $0.11 | +| 14 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.98% | 3.85 | $0.11 | +| 15 | Qwen3 32B | Qwen3 Next 80B A3B | 67.53% | 3.51 | $0.11 | ### Bottom 15 Combos -| Rank | Planner | Solver | Accuracy | Cost | -|:-----|:--------|:-------|:---------|:-----| -| 67 | Claude Haiku 4.5 | Qwen3 32B | 35.64% | $0.46 | -| 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.10% | $0.48 | -| 69 | Ministral 3 8B | Claude Haiku 4.5 | 33.95% | $0.74 | -| 70 | Claude Opus 4.6 | Claude Opus 4.6 | 32.70% | $2.00 | -| 71 | Claude Opus 4.6 | Kimi K2.5 | 32.44% | $2.01 | -| 72 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 32.05% | $2.01 | -| 73 | Claude Opus 4.6 | gpt-oss-120b | 32.00% | $2.01 | -| 74 | Claude Opus 4.6 | Ministral 3 8B | 31.80% | $2.01 | -| 75 | Claude Opus 4.6 | Claude 3 Haiku | 31.80% | $2.01 | -| 76 | Claude Opus 4.6 | Qwen3 32B | 31.52% | $2.01 | -| 77 | Claude Opus 4.6 | gpt-oss-20b | 31.31% | $2.01 | -| 78 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 30.85% | $0.71 | -| 79 | Claude Opus 4.6 | Claude Haiku 4.5 | 30.81% | $2.01 | -| 80 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.57% | $0.79 | -| 81 | Qwen3 32B | Claude Haiku 4.5 | 25.11% | $0.72 | +| Rank | Planner | Solver | Accuracy | Avg Latency (s) | Cost | +|:-----|:--------|:-------|:---------|:----------------|:-----| +| 67 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | 2.89 | $0.46 | +| 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | 2.63 | $0.49 | +| 69 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | 4.14 | $0.70 | +| 70 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | 3.92 | $0.72 | +| 71 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | 4.72 | $2.02 | +| 72 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | 4.72 | $2.02 | +| 73 | Claude Opus 4.6 | Qwen3 32B | 31.96% | 4.72 | $2.02 | +| 74 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | 4.72 | $2.02 | +| 75 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | 4.60 | $2.02 | +| 76 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | 4.57 | $2.03 | +| 77 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | 4.22 | $2.02 | +| 78 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | 4.16 | $2.03 | +| 79 | Claude Opus 4.6 | Claude Opus 4.6 | 31.71% | 4.19 | $2.02 | +| 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | 3.47 | $0.69 | +| 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | 3.40 | $0.79 | !!! warning "Capability as Liability" - **Claude Opus 4.6 as planner achieves only ~32% accuracy** regardless of solver — the worst planner in the benchmark. Opus is "too smart" for the planner role: it calls `terminate()` and answers directly instead of delegating to the solver. The solver is never invoked. Meanwhile, the cheapest model (Ministral 3 8B) as planner with Opus as solver achieves the **best accuracy at 74.78%**. This demonstrates that stronger models can underperform in multi-agent architectures when the role requires delegation, not direct answering. + **Claude Opus 4.6 as planner achieves only ~32% accuracy** regardless of solver — the worst planner in the benchmark. Opus is "too smart" for the planner role: it calls `terminate()` and answers directly instead of delegating to the solver. The solver is never invoked. Meanwhile, the cheapest model (Ministral 3 8B) as planner with Opus as solver achieves the **best accuracy at 74.27%**. This demonstrates that stronger models can underperform in multi-agent architectures when the role requires delegation, not direct answering. ??? note "Full 81 Combo Results" - | Rank | Planner | Solver | Accuracy | Cost | Note | - |:-----|:--------|:-------|:---------|:-----|:-----| - | 1 | Ministral 3 8B | Claude Opus 4.6 | 74.78% | $2.71 | | - | 2 | Qwen3 32B | Claude Opus 4.6 | 72.97% | $2.67 | | - | 3 | Claude 3 Haiku | Claude Opus 4.6 | 72.58% | $2.67 | | - | 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.32% | $2.66 | | - | 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.29% | $0.13 | | - | 6 | Kimi K2.5 | Claude Opus 4.6 | 71.09% | $2.53 | | - | 7 | Qwen3 32B | gpt-oss-120b | 70.47% | $0.12 | | - | 8 | Qwen3 32B | Qwen3 Next 80B A3B | 69.12% | $0.11 | | - | 9 | Claude 3 Haiku | Qwen3 Next 80B A3B | 68.60% | $0.15 | | - | 10 | Ministral 3 8B | gpt-oss-120b | 68.56% | $0.12 | | - | 11 | Claude 3 Haiku | gpt-oss-120b | 68.44% | $0.16 | | - | 12 | Kimi K2.5 | Qwen3 Next 80B A3B | 68.41% | $0.28 | | - | 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.29% | $0.12 | | - | 14 | Qwen3 32B | gpt-oss-20b | 67.75% | $0.09 | | - | 15 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.46% | $0.11 | | - | 16 | gpt-oss-120b | Claude Opus 4.6 | 67.03% | $1.57 | | - | 17 | Kimi K2.5 | Ministral 3 8B | 66.79% | $0.27 | | - | 18 | Qwen3 Next 80B A3B | gpt-oss-20b | 66.79% | $0.09 | | - | 19 | Kimi K2.5 | gpt-oss-120b | 66.76% | $0.29 | | - | 20 | Ministral 3 8B | gpt-oss-20b | 66.55% | $0.09 | | - | 21 | Kimi K2.5 | gpt-oss-20b | 66.49% | $0.25 | | - | 22 | Claude 3 Haiku | gpt-oss-20b | 65.10% | $0.13 | | - | 23 | Claude 3 Haiku | Ministral 3 8B | 64.41% | $0.14 | | - | 24 | Qwen3 Next 80B A3B | Kimi K2.5 | 64.28% | $0.27 | | - | 25 | Ministral 3 8B | Kimi K2.5 | 64.03% | $0.26 | | - | 26 | Qwen3 32B | Kimi K2.5 | 63.77% | $0.26 | | - | 27 | gpt-oss-120b | Qwen3 Next 80B A3B | 63.62% | $0.09 | | - | 28 | Qwen3 Next 80B A3B | Ministral 3 8B | 63.29% | $0.10 | | - | 29 | Claude 3 Haiku | Kimi K2.5 | 62.86% | $0.31 | | - | 30 | gpt-oss-120b | Claude Haiku 4.5 | 62.38% | $0.36 | | - | 31 | Ministral 3 8B | Ministral 3 8B | 62.20% | $0.09 | | - | 32 | Qwen3 32B | Ministral 3 8B | 62.09% | $0.09 | | - | 33 | Kimi K2.5 | Kimi K2.5 | 61.96% | $0.45 | | - | 34 | gpt-oss-120b | Kimi K2.5 | 61.15% | $0.17 | | - | 35 | gpt-oss-120b | Claude 3 Haiku | 60.89% | $0.12 | | - | 36 | gpt-oss-120b | Ministral 3 8B | 60.64% | $0.09 | | - | 37 | gpt-oss-120b | gpt-oss-120b | 60.51% | $0.10 | | - | 38 | gpt-oss-120b | gpt-oss-20b | 59.10% | $0.08 | | - | 39 | gpt-oss-120b | Qwen3 32B | 58.54% | $0.09 | | - | 40 | Kimi K2.5 | Claude 3 Haiku | 57.18% | $0.32 | | - | 41 | Claude 3 Haiku | Qwen3 32B | 56.28% | $0.15 | | - | 42 | Kimi K2.5 | Qwen3 32B | 55.72% | $0.27 | | - | 43 | Ministral 3 8B | Qwen3 32B | 55.30% | $0.11 | | - | 44 | gpt-oss-20b | Claude Opus 4.6 | 55.13% | $0.90 | | - | 45 | gpt-oss-20b | Ministral 3 8B | 54.99% | $0.05 | | - | 46 | Qwen3 Next 80B A3B | Qwen3 32B | 54.88% | $0.11 | | - | 47 | gpt-oss-20b | Kimi K2.5 | 54.69% | $0.11 | | - | 48 | gpt-oss-20b | gpt-oss-120b | 54.26% | $0.06 | | - | 49 | Qwen3 32B | Qwen3 32B | 54.21% | $0.11 | | - | 50 | Claude 3 Haiku | Claude 3 Haiku | 54.13% | $0.20 | | - | 51 | gpt-oss-20b | Claude Haiku 4.5 | 54.06% | $0.25 | | - | 52 | gpt-oss-20b | Claude 3 Haiku | 53.08% | $0.08 | | - | 53 | gpt-oss-20b | Qwen3 Next 80B A3B | 52.87% | $0.05 | | - | 54 | gpt-oss-20b | gpt-oss-20b | 52.69% | $0.05 | | - | 55 | Ministral 3 8B | Claude 3 Haiku | 51.65% | $0.16 | | - | 56 | gpt-oss-20b | Qwen3 32B | 49.60% | $0.06 | | - | 57 | Qwen3 Next 80B A3B | Claude 3 Haiku | 48.86% | $0.17 | | - | 58 | Qwen3 32B | Claude 3 Haiku | 48.29% | $0.16 | | - | 59 | Claude 3 Haiku | Claude Haiku 4.5 | 47.28% | $0.71 | | - | 60 | Claude Haiku 4.5 | Claude Opus 4.6 | 43.40% | $1.80 | | - | 61 | Claude Haiku 4.5 | Kimi K2.5 | 41.51% | $0.55 | | - | 62 | Claude Haiku 4.5 | Ministral 3 8B | 41.21% | $0.45 | | - | 63 | Claude Haiku 4.5 | gpt-oss-20b | 41.18% | $0.45 | | - | 64 | Claude Haiku 4.5 | gpt-oss-120b | 40.83% | $0.47 | | - | 65 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 40.54% | $0.46 | | - | 66 | Kimi K2.5 | Claude Haiku 4.5 | 40.37% | $0.87 | | - | 67 | Claude Haiku 4.5 | Qwen3 32B | 35.64% | $0.46 | | - | 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.10% | $0.48 | | - | 69 | Ministral 3 8B | Claude Haiku 4.5 | 33.95% | $0.74 | | - | 70 | Claude Opus 4.6 | Claude Opus 4.6 | 32.70% | $2.00 | | - | 71 | Claude Opus 4.6 | Kimi K2.5 | 32.44% | $2.01 | role2_never_called | - | 72 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 32.05% | $2.01 | role2_never_called | - | 73 | Claude Opus 4.6 | gpt-oss-120b | 32.00% | $2.01 | role2_never_called | - | 74 | Claude Opus 4.6 | Ministral 3 8B | 31.80% | $2.01 | role2_never_called | - | 75 | Claude Opus 4.6 | Claude 3 Haiku | 31.80% | $2.01 | role2_never_called | - | 76 | Claude Opus 4.6 | Qwen3 32B | 31.52% | $2.01 | role2_never_called | - | 77 | Claude Opus 4.6 | gpt-oss-20b | 31.31% | $2.01 | role2_never_called | - | 78 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 30.85% | $0.71 | | - | 79 | Claude Opus 4.6 | Claude Haiku 4.5 | 30.81% | $2.01 | role2_never_called | - | 80 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.57% | $0.79 | | - | 81 | Qwen3 32B | Claude Haiku 4.5 | 25.11% | $0.72 | | + | Rank | Planner | Solver | Accuracy | Avg Latency (s) | Cost | Note | + |:-----|:--------|:-------|:---------|:----------------|:-----|:-----| + | 1 | Ministral 3 8B | Claude Opus 4.6 | 74.27% | 4.97 | $2.64 | | + | 2 | Claude 3 Haiku | Claude Opus 4.6 | 73.25% | 4.52 | $2.79 | | + | 3 | Qwen3 32B | Claude Opus 4.6 | 73.02% | 4.26 | $2.65 | | + | 4 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 72.10% | 4.67 | $2.67 | | + | 5 | Qwen3 Next 80B A3B | gpt-oss-120b | 71.83% | 3.07 | $0.13 | | + | 6 | Qwen3 32B | gpt-oss-120b | 70.04% | 2.66 | $0.13 | | + | 7 | Kimi K2.5 | Claude Opus 4.6 | 69.96% | 4.49 | $2.43 | | + | 8 | Claude 3 Haiku | gpt-oss-120b | 69.86% | 3.21 | $0.17 | | + | 9 | Ministral 3 8B | gpt-oss-20b | 69.34% | 5.66 | $0.09 | | + | 10 | Claude 3 Haiku | Qwen3 Next 80B A3B | 69.27% | 3.00 | $0.16 | | + | 11 | Qwen3 Next 80B A3B | gpt-oss-20b | 68.89% | 2.82 | $0.09 | | + | 12 | Ministral 3 8B | gpt-oss-120b | 68.70% | 3.65 | $0.12 | | + | 13 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 68.15% | 2.69 | $0.11 | | + | 14 | Ministral 3 8B | Qwen3 Next 80B A3B | 67.98% | 3.85 | $0.11 | | + | 15 | Qwen3 32B | Qwen3 Next 80B A3B | 67.53% | 3.51 | $0.11 | | + | 16 | Qwen3 32B | gpt-oss-20b | 66.95% | 2.48 | $0.09 | | + | 17 | Claude 3 Haiku | Ministral 3 8B | 65.98% | 3.73 | $0.14 | | + | 18 | Ministral 3 8B | Kimi K2.5 | 65.24% | 3.27 | $0.26 | | + | 19 | gpt-oss-120b | Qwen3 Next 80B A3B | 64.93% | 4.68 | $0.10 | | + | 20 | Ministral 3 8B | Ministral 3 8B | 64.89% | 3.55 | $0.09 | | + | 21 | Claude 3 Haiku | gpt-oss-20b | 64.79% | 2.90 | $0.13 | | + | 22 | Kimi K2.5 | gpt-oss-120b | 64.70% | 4.16 | $0.29 | | + | 23 | gpt-oss-120b | Claude Opus 4.6 | 64.59% | 4.57 | $1.61 | | + | 24 | gpt-oss-120b | Claude Haiku 4.5 | 64.11% | 4.26 | $0.38 | | + | 25 | Kimi K2.5 | Qwen3 Next 80B A3B | 63.99% | 4.39 | $0.30 | | + | 26 | Kimi K2.5 | Ministral 3 8B | 63.95% | 6.42 | $0.28 | | + | 27 | Claude 3 Haiku | Kimi K2.5 | 63.85% | 2.89 | $0.31 | | + | 28 | gpt-oss-120b | Ministral 3 8B | 63.70% | 7.37 | $0.09 | | + | 29 | Qwen3 Next 80B A3B | Kimi K2.5 | 63.69% | 2.89 | $0.27 | | + | 30 | Kimi K2.5 | gpt-oss-20b | 63.35% | 6.80 | $0.26 | | + | 31 | Qwen3 32B | Kimi K2.5 | 63.17% | 3.26 | $0.28 | | + | 32 | gpt-oss-120b | Claude 3 Haiku | 62.72% | 3.72 | $0.13 | | + | 33 | Kimi K2.5 | Kimi K2.5 | 62.28% | 4.56 | $0.44 | | + | 34 | gpt-oss-120b | gpt-oss-120b | 62.15% | 4.59 | $0.10 | | + | 35 | Qwen3 Next 80B A3B | Ministral 3 8B | 62.11% | 4.27 | $0.10 | | + | 36 | gpt-oss-120b | gpt-oss-20b | 61.51% | 2.71 | $0.08 | | + | 37 | Qwen3 32B | Ministral 3 8B | 61.17% | 2.89 | $0.09 | | + | 38 | gpt-oss-120b | Kimi K2.5 | 60.85% | 4.09 | $0.18 | | + | 39 | gpt-oss-120b | Qwen3 32B | 58.80% | 4.06 | $0.10 | | + | 40 | Claude 3 Haiku | Qwen3 32B | 56.02% | 2.87 | $0.15 | | + | 41 | Claude 3 Haiku | Claude 3 Haiku | 55.91% | 2.41 | $0.21 | | + | 42 | gpt-oss-20b | Claude Opus 4.6 | 55.86% | 2.84 | $1.04 | | + | 43 | Ministral 3 8B | Qwen3 32B | 55.02% | 3.63 | $0.11 | | + | 44 | Kimi K2.5 | Claude 3 Haiku | 54.90% | 3.42 | $0.34 | | + | 45 | Qwen3 32B | Qwen3 32B | 54.82% | 2.53 | $0.11 | | + | 46 | Kimi K2.5 | Qwen3 32B | 54.73% | 4.57 | $0.30 | | + | 47 | gpt-oss-20b | Claude Haiku 4.5 | 54.28% | 2.19 | $0.26 | | + | 48 | gpt-oss-20b | Ministral 3 8B | 54.25% | 4.35 | $0.05 | | + | 49 | Qwen3 Next 80B A3B | Qwen3 32B | 54.13% | 2.83 | $0.11 | | + | 50 | gpt-oss-20b | Qwen3 Next 80B A3B | 53.89% | 2.11 | $0.06 | | + | 51 | gpt-oss-20b | Claude 3 Haiku | 52.66% | 2.04 | $0.08 | | + | 52 | gpt-oss-20b | gpt-oss-120b | 52.17% | 2.11 | $0.06 | | + | 53 | Ministral 3 8B | Claude 3 Haiku | 51.33% | 4.10 | $0.16 | | + | 54 | gpt-oss-20b | Kimi K2.5 | 51.01% | 1.96 | $0.12 | | + | 55 | gpt-oss-20b | gpt-oss-20b | 50.09% | 2.12 | $0.05 | | + | 56 | Qwen3 Next 80B A3B | Claude 3 Haiku | 49.98% | 2.56 | $0.17 | | + | 57 | gpt-oss-20b | Qwen3 32B | 49.16% | 2.05 | $0.06 | | + | 58 | Qwen3 32B | Claude 3 Haiku | 48.77% | 2.23 | $0.16 | | + | 59 | Claude 3 Haiku | Claude Haiku 4.5 | 46.50% | 3.35 | $0.71 | | + | 60 | Claude Haiku 4.5 | Claude Opus 4.6 | 43.54% | 4.06 | $1.80 | | + | 61 | Claude Haiku 4.5 | gpt-oss-20b | 41.49% | 3.03 | $0.45 | | + | 62 | Claude Haiku 4.5 | gpt-oss-120b | 41.20% | 3.14 | $0.47 | | + | 63 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 41.17% | 2.95 | $0.46 | | + | 64 | Claude Haiku 4.5 | Ministral 3 8B | 41.09% | 3.75 | $0.45 | | + | 65 | Claude Haiku 4.5 | Kimi K2.5 | 41.00% | 6.16 | $0.54 | | + | 66 | Kimi K2.5 | Claude Haiku 4.5 | 37.19% | 4.23 | $0.88 | | + | 67 | Claude Haiku 4.5 | Qwen3 32B | 36.13% | 2.89 | $0.46 | | + | 68 | Claude Haiku 4.5 | Claude 3 Haiku | 34.34% | 2.63 | $0.49 | | + | 69 | Ministral 3 8B | Claude Haiku 4.5 | 32.42% | 4.14 | $0.70 | | + | 70 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 32.19% | 3.92 | $0.72 | | + | 71 | Claude Opus 4.6 | Kimi K2.5 | 31.96% | 4.72 | $2.02 | role2_never_called | + | 72 | Claude Opus 4.6 | Ministral 3 8B | 31.96% | 4.72 | $2.02 | role2_never_called | + | 73 | Claude Opus 4.6 | Qwen3 32B | 31.96% | 4.72 | $2.02 | role2_never_called | + | 74 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 31.96% | 4.72 | $2.02 | role2_never_called | + | 75 | Claude Opus 4.6 | gpt-oss-120b | 31.95% | 4.60 | $2.02 | role2_never_called | + | 76 | Claude Opus 4.6 | gpt-oss-20b | 31.88% | 4.57 | $2.03 | role2_never_called | + | 77 | Claude Opus 4.6 | Claude 3 Haiku | 31.78% | 4.22 | $2.02 | role2_never_called | + | 78 | Claude Opus 4.6 | Claude Haiku 4.5 | 31.77% | 4.16 | $2.03 | role2_never_called | + | 79 | Claude Opus 4.6 | Claude Opus 4.6 | 31.71% | 4.19 | $2.02 | | + | 80 | Qwen3 32B | Claude Haiku 4.5 | 26.63% | 3.47 | $0.69 | | + | 81 | Claude Haiku 4.5 | Claude Haiku 4.5 | 26.49% | 3.40 | $0.79 | | ### Selector Comparison | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| -| Brute Force | 100% | 74.78% | 16,108 | $51.48 | -- | -| Arm Elimination | 90% | 74.12% | 4,654 | $18.49 | **64%** | -| Hill Climbing | 44% | 73.38% | 5,031 | $19.21 | 63% | -| Bayesian Opt | 8% | 72.78% | 3,979 | $12.13 | 76% | -| Random Search | 30% | 72.34% | 4,176 | $13.26 | 74% | -| Epsilon LUCB | 14% | 69.96% | 477 | $1.86 | 96% | -| Threshold SE | 2% | 63.62% | 1,926 | $3.50 | 93% | -| LM Proposal | 0% | 34.41% | 199 | $1.86 | 96% | +| Brute Force | 100% | 74.27% | 16,168 | $51.90 | -- | +| Bayesian Opt | 8% | 73.33% | 3,996 | $12.29 | 76% | +| Arm Elimination | 86% | 73.19% | 4,283 | $16.92 | **67%** | +| Hill Climbing | 52% | 73.13% | 4,635 | $19.39 | 63% | +| Random Search | 30% | 72.25% | 4,192 | $13.37 | 74% | +| Epsilon LUCB | 10% | 69.71% | 478 | $1.75 | 97% | +| Threshold SE | 4% | 65.42% | 1,642 | $6.45 | 88% | +| LM Proposal | 0% | 34.13% | 200 | $1.84 | 96% | --- @@ -252,139 +267,139 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr ### Top 15 Combos -| Rank | Answer Model | Critic Model | Accuracy | Cost | -|:-----|:-------------|:-------------|:---------|:-----| -| 1 | Claude Opus 4.6 | Qwen3 Next 80B A3B | **98.83%** | $5.89 | -| 2 | Claude Opus 4.6 | Ministral 3 8B | 98.73% | $5.31 | -| 3 | Claude Opus 4.6 | Claude Haiku 4.5 | 98.27% | $6.09 | -| 4 | Claude Opus 4.6 | Qwen3 32B | 97.79% | $6.42 | -| 5 | Claude Opus 4.6 | Claude Opus 4.6 | 97.77% | $6.95 | -| 6 | Claude Opus 4.6 | gpt-oss-120b | 97.73% | $6.14 | -| 7 | Claude Opus 4.6 | Claude 3 Haiku | 97.26% | $5.26 | -| 8 | Claude Opus 4.6 | Kimi K2.5 | 97.25% | $6.66 | -| 9 | Claude Opus 4.6 | gpt-oss-20b | 97.13% | $6.10 | -| 10 | Claude Haiku 4.5 | Ministral 3 8B | 94.47% | $2.59 | -| 11 | Claude Haiku 4.5 | Claude Haiku 4.5 | 94.00% | $3.17 | -| 12 | Claude Haiku 4.5 | Claude Opus 4.6 | 94.00% | $3.89 | -| 13 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 94.00% | $2.50 | -| 14 | Ministral 3 8B | Claude 3 Haiku | 93.98% | $0.05 | -| 15 | Claude Haiku 4.5 | Kimi K2.5 | 93.97% | $2.92 | +| Rank | Answer Model | Critic Model | Accuracy | Avg Latency (s) | Cost | +|:-----|:-------------|:-------------|:---------|:----------------|:-----| +| 1 | Claude Opus 4.6 | Claude Haiku 4.5 | **98.84%** | 16.15 | $6.19 | +| 2 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.82% | 14.30 | $5.77 | +| 3 | Claude Opus 4.6 | Ministral 3 8B | 98.72% | 14.03 | $5.26 | +| 4 | Claude Opus 4.6 | gpt-oss-20b | 98.28% | 16.50 | $5.93 | +| 5 | Claude Opus 4.6 | gpt-oss-120b | 97.77% | 15.40 | $6.30 | +| 6 | Claude Opus 4.6 | Qwen3 32B | 97.28% | 15.05 | $6.68 | +| 7 | Claude Opus 4.6 | Claude Opus 4.6 | 97.24% | 15.94 | $6.97 | +| 8 | Claude Opus 4.6 | Kimi K2.5 | 97.24% | 18.37 | $6.58 | +| 9 | Claude Opus 4.6 | Claude 3 Haiku | 95.95% | 14.85 | $5.37 | +| 10 | gpt-oss-20b | Claude Opus 4.6 | 94.57% | 6.81 | $0.97 | +| 11 | gpt-oss-20b | Kimi K2.5 | 94.57% | 12.45 | $0.26 | +| 12 | gpt-oss-20b | gpt-oss-20b | 94.54% | 4.04 | $0.08 | +| 13 | Claude Haiku 4.5 | Qwen3 32B | 94.50% | 12.68 | $2.51 | +| 14 | gpt-oss-20b | Claude Haiku 4.5 | 94.05% | 6.19 | $0.37 | +| 15 | gpt-oss-20b | gpt-oss-120b | 94.02% | 4.94 | $0.11 | ### Bottom 15 Combos -| Rank | Answer Model | Critic Model | Accuracy | Cost | -|:-----|:-------------|:-------------|:---------|:-----| -| 67 | Kimi K2.5 | Claude Haiku 4.5 | 78.01% | $1.20 | -| 68 | Claude 3 Haiku | Ministral 3 8B | 77.64% | $0.30 | -| 69 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 77.60% | $2.03 | -| 70 | Kimi K2.5 | Claude Opus 4.6 | 77.55% | $2.55 | -| 71 | Claude 3 Haiku | Kimi K2.5 | 77.48% | $0.46 | -| 72 | gpt-oss-120b | gpt-oss-20b | 77.32% | $0.17 | -| 73 | Kimi K2.5 | Claude 3 Haiku | 76.96% | $0.92 | -| 74 | Qwen3 Next 80B A3B | Kimi K2.5 | 76.96% | $0.46 | -| 75 | Qwen3 Next 80B A3B | gpt-oss-120b | 76.84% | $0.32 | -| 76 | gpt-oss-120b | Qwen3 Next 80B A3B | 74.74% | $0.20 | -| 77 | Claude 3 Haiku | gpt-oss-20b | 72.96% | $0.31 | -| 78 | Kimi K2.5 | Qwen3 32B | 72.77% | $0.67 | -| 79 | Claude 3 Haiku | Qwen3 Next 80B A3B | 68.94% | $0.36 | -| 80 | Claude 3 Haiku | Qwen3 32B | 63.86% | $0.27 | -| 81 | Claude 3 Haiku | Claude 3 Haiku | 59.88% | $0.30 | +| Rank | Answer Model | Critic Model | Accuracy | Avg Latency (s) | Cost | +|:-----|:-------------|:-------------|:---------|:----------------|:-----| +| 67 | Qwen3 Next 80B A3B | Kimi K2.5 | 75.50% | 36.37 | $0.79 | +| 68 | Qwen3 Next 80B A3B | gpt-oss-20b | 75.00% | 32.70 | $0.48 | +| 69 | Kimi K2.5 | gpt-oss-120b | 74.49% | 32.23 | $0.95 | +| 70 | Kimi K2.5 | gpt-oss-20b | 74.09% | 25.65 | $0.77 | +| 71 | Kimi K2.5 | Kimi K2.5 | 73.58% | 44.39 | $1.34 | +| 72 | Kimi K2.5 | Claude Opus 4.6 | 73.33% | 28.62 | $2.79 | +| 73 | Kimi K2.5 | Claude Haiku 4.5 | 73.20% | 26.98 | $1.36 | +| 74 | Claude 3 Haiku | gpt-oss-120b | 72.19% | 8.39 | $0.32 | +| 75 | Kimi K2.5 | Qwen3 32B | 72.16% | 30.32 | $0.92 | +| 76 | Claude 3 Haiku | gpt-oss-20b | 71.43% | 8.42 | $0.32 | +| 77 | Claude 3 Haiku | Qwen3 Next 80B A3B | 71.07% | 17.12 | $0.39 | +| 78 | Claude 3 Haiku | Kimi K2.5 | 71.01% | 14.23 | $0.53 | +| 79 | Claude 3 Haiku | Ministral 3 8B | 69.28% | 12.40 | $0.32 | +| 80 | Claude 3 Haiku | Qwen3 32B | 59.30% | 6.29 | $0.29 | +| 81 | Claude 3 Haiku | Claude 3 Haiku | 54.37% | 7.28 | $0.30 | ??? note "Full 81 Combo Results" - | Rank | Answer Model | Critic Model | Accuracy | Cost | Note | - |:-----|:-------------|:-------------|:---------|:-----|:-----| - | 1 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.83% | $5.89 | | - | 2 | Claude Opus 4.6 | Ministral 3 8B | 98.73% | $5.31 | | - | 3 | Claude Opus 4.6 | Claude Haiku 4.5 | 98.27% | $6.09 | | - | 4 | Claude Opus 4.6 | Qwen3 32B | 97.79% | $6.42 | | - | 5 | Claude Opus 4.6 | Claude Opus 4.6 | 97.77% | $6.95 | | - | 6 | Claude Opus 4.6 | gpt-oss-120b | 97.73% | $6.14 | | - | 7 | Claude Opus 4.6 | Claude 3 Haiku | 97.26% | $5.26 | | - | 8 | Claude Opus 4.6 | Kimi K2.5 | 97.25% | $6.66 | | - | 9 | Claude Opus 4.6 | gpt-oss-20b | 97.13% | $6.10 | | - | 10 | Claude Haiku 4.5 | Ministral 3 8B | 94.47% | $2.59 | | - | 11 | Claude Haiku 4.5 | Claude Haiku 4.5 | 94.00% | $3.17 | | - | 12 | Claude Haiku 4.5 | Claude Opus 4.6 | 94.00% | $3.89 | | - | 13 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 94.00% | $2.50 | | - | 14 | Ministral 3 8B | Claude 3 Haiku | 93.98% | $0.05 | | - | 15 | Claude Haiku 4.5 | Kimi K2.5 | 93.97% | $2.92 | | - | 16 | gpt-oss-20b | gpt-oss-120b | 93.96% | $0.12 | | - | 17 | Claude Haiku 4.5 | Qwen3 32B | 93.50% | $2.72 | | - | 18 | Claude Haiku 4.5 | gpt-oss-20b | 93.50% | $2.93 | | - | 19 | gpt-oss-20b | Kimi K2.5 | 93.44% | $0.23 | | - | 20 | gpt-oss-20b | Claude Haiku 4.5 | 92.97% | $0.36 | | - | 21 | Claude 3 Haiku | Claude Opus 4.6 | 92.94% | $2.04 | | - | 22 | Claude Haiku 4.5 | gpt-oss-120b | 92.50% | $2.35 | | - | 23 | gpt-oss-20b | Qwen3 Next 80B A3B | 92.43% | $0.15 | | - | 24 | gpt-oss-20b | Claude Opus 4.6 | 92.35% | $0.99 | | - | 25 | gpt-oss-20b | gpt-oss-20b | 91.94% | $0.09 | | - | 26 | Claude Haiku 4.5 | Claude 3 Haiku | 91.50% | $2.95 | | - | 27 | gpt-oss-20b | Qwen3 32B | 91.21% | $0.08 | | - | 28 | gpt-oss-20b | Claude 3 Haiku | 90.76% | $0.16 | | - | 29 | Ministral 3 8B | gpt-oss-120b | 90.59% | $0.07 | | - | 30 | gpt-oss-20b | Ministral 3 8B | 90.43% | $0.13 | | - | 31 | Ministral 3 8B | Qwen3 Next 80B A3B | 90.20% | $0.03 | | - | 32 | Ministral 3 8B | Claude Opus 4.6 | 89.53% | $0.87 | | - | 33 | Ministral 3 8B | Claude Haiku 4.5 | 88.89% | $0.30 | | - | 34 | Ministral 3 8B | Kimi K2.5 | 88.82% | $0.09 | | - | 35 | Ministral 3 8B | gpt-oss-20b | 88.76% | $0.04 | | - | 36 | Qwen3 32B | Qwen3 Next 80B A3B | 88.72% | $0.21 | | - | 37 | Ministral 3 8B | Ministral 3 8B | 88.19% | $0.03 | | - | 38 | Claude 3 Haiku | Claude Haiku 4.5 | 87.21% | $0.69 | | - | 39 | Ministral 3 8B | Qwen3 32B | 86.98% | $0.04 | | - | 40 | Qwen3 32B | Ministral 3 8B | 86.73% | $0.35 | | - | 41 | Qwen3 32B | gpt-oss-120b | 86.67% | $0.25 | | - | 42 | Qwen3 32B | Claude Opus 4.6 | 85.35% | $2.01 | | - | 43 | Qwen3 32B | gpt-oss-20b | 85.05% | $0.19 | | - | 44 | Qwen3 32B | Claude Haiku 4.5 | 84.02% | $0.53 | | - | 45 | Qwen3 32B | Kimi K2.5 | 82.74% | $1.11 | | - | 46 | Qwen3 32B | Qwen3 32B | 82.56% | $0.17 | | - | 47 | Qwen3 32B | Claude 3 Haiku | 82.47% | $0.27 | | - | 48 | Qwen3 Next 80B A3B | Claude 3 Haiku | 82.29% | $0.42 | | - | 49 | Qwen3 Next 80B A3B | Qwen3 32B | 81.87% | $0.29 | | - | 50 | Kimi K2.5 | gpt-oss-120b | 81.44% | $0.82 | | - | 51 | Kimi K2.5 | gpt-oss-20b | 81.35% | $0.85 | | - | 52 | Kimi K2.5 | Kimi K2.5 | 81.25% | $1.13 | | - | 53 | gpt-oss-120b | Qwen3 32B | 80.41% | $0.15 | | - | 54 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 80.32% | $0.31 | | - | 55 | gpt-oss-120b | Claude Haiku 4.5 | 80.31% | $0.47 | | - | 56 | gpt-oss-120b | Kimi K2.5 | 80.10% | $0.27 | | - | 57 | gpt-oss-120b | Ministral 3 8B | 80.00% | $0.18 | | - | 58 | Qwen3 Next 80B A3B | gpt-oss-20b | 79.79% | $0.32 | | - | 59 | gpt-oss-120b | Claude 3 Haiku | 79.69% | $0.19 | | - | 60 | Kimi K2.5 | Ministral 3 8B | 79.49% | $0.86 | | - | 61 | gpt-oss-120b | Claude Opus 4.6 | 79.49% | $1.17 | | - | 62 | Kimi K2.5 | Qwen3 Next 80B A3B | 79.06% | $0.82 | | - | 63 | gpt-oss-120b | gpt-oss-120b | 78.87% | $0.20 | | - | 64 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 78.65% | $0.81 | | - | 65 | Qwen3 Next 80B A3B | Ministral 3 8B | 78.65% | $0.38 | | - | 66 | Claude 3 Haiku | gpt-oss-120b | 78.62% | $0.36 | | - | 67 | Kimi K2.5 | Claude Haiku 4.5 | 78.01% | $1.20 | | - | 68 | Claude 3 Haiku | Ministral 3 8B | 77.64% | $0.30 | | - | 69 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 77.60% | $2.03 | | - | 70 | Kimi K2.5 | Claude Opus 4.6 | 77.55% | $2.55 | | - | 71 | Claude 3 Haiku | Kimi K2.5 | 77.48% | $0.46 | | - | 72 | gpt-oss-120b | gpt-oss-20b | 77.32% | $0.17 | | - | 73 | Kimi K2.5 | Claude 3 Haiku | 76.96% | $0.92 | | - | 74 | Qwen3 Next 80B A3B | Kimi K2.5 | 76.96% | $0.46 | | - | 75 | Qwen3 Next 80B A3B | gpt-oss-120b | 76.84% | $0.32 | | - | 76 | gpt-oss-120b | Qwen3 Next 80B A3B | 74.74% | $0.20 | | - | 77 | Claude 3 Haiku | gpt-oss-20b | 72.96% | $0.31 | | - | 78 | Kimi K2.5 | Qwen3 32B | 72.77% | $0.67 | | - | 79 | Claude 3 Haiku | Qwen3 Next 80B A3B | 68.94% | $0.36 | | - | 80 | Claude 3 Haiku | Qwen3 32B | 63.86% | $0.27 | | - | 81 | Claude 3 Haiku | Claude 3 Haiku | 59.88% | $0.30 | | + | Rank | Answer Model | Critic Model | Accuracy | Avg Latency (s) | Cost | Note | + |:-----|:-------------|:-------------|:---------|:----------------|:-----|:-----| + | 1 | Claude Opus 4.6 | Claude Haiku 4.5 | 98.84% | 16.15 | $6.19 | | + | 2 | Claude Opus 4.6 | Qwen3 Next 80B A3B | 98.82% | 14.30 | $5.77 | | + | 3 | Claude Opus 4.6 | Ministral 3 8B | 98.72% | 14.03 | $5.26 | | + | 4 | Claude Opus 4.6 | gpt-oss-20b | 98.28% | 16.50 | $5.93 | | + | 5 | Claude Opus 4.6 | gpt-oss-120b | 97.77% | 15.40 | $6.30 | | + | 6 | Claude Opus 4.6 | Qwen3 32B | 97.28% | 15.05 | $6.68 | | + | 7 | Claude Opus 4.6 | Claude Opus 4.6 | 97.24% | 15.94 | $6.97 | | + | 8 | Claude Opus 4.6 | Kimi K2.5 | 97.24% | 18.37 | $6.58 | | + | 9 | Claude Opus 4.6 | Claude 3 Haiku | 95.95% | 14.85 | $5.37 | | + | 10 | gpt-oss-20b | Claude Opus 4.6 | 94.57% | 6.81 | $0.97 | | + | 11 | gpt-oss-20b | Kimi K2.5 | 94.57% | 12.45 | $0.26 | | + | 12 | gpt-oss-20b | gpt-oss-20b | 94.54% | 4.04 | $0.08 | | + | 13 | Claude Haiku 4.5 | Qwen3 32B | 94.50% | 12.68 | $2.51 | | + | 14 | gpt-oss-20b | Claude Haiku 4.5 | 94.05% | 6.19 | $0.37 | | + | 15 | gpt-oss-20b | gpt-oss-120b | 94.02% | 4.94 | $0.11 | | + | 16 | gpt-oss-20b | Qwen3 Next 80B A3B | 94.02% | 8.67 | $0.14 | | + | 17 | Claude Haiku 4.5 | Claude Haiku 4.5 | 94.00% | 14.31 | $2.59 | | + | 18 | gpt-oss-20b | Ministral 3 8B | 93.99% | 8.27 | $0.10 | | + | 19 | gpt-oss-120b | Claude Opus 4.6 | 93.81% | 9.10 | $1.25 | | + | 20 | Claude Haiku 4.5 | gpt-oss-20b | 93.50% | 12.51 | $2.20 | | + | 21 | Claude Haiku 4.5 | Claude Opus 4.6 | 93.50% | 15.82 | $3.77 | | + | 22 | Claude Haiku 4.5 | Ministral 3 8B | 93.50% | 14.70 | $2.57 | | + | 23 | Claude Haiku 4.5 | Kimi K2.5 | 93.50% | 17.50 | $2.60 | | + | 24 | gpt-oss-20b | Qwen3 32B | 93.48% | 4.30 | $0.09 | | + | 25 | gpt-oss-20b | Claude 3 Haiku | 93.44% | 6.10 | $0.15 | | + | 26 | gpt-oss-120b | Ministral 3 8B | 93.26% | 10.42 | $0.19 | | + | 27 | gpt-oss-120b | Qwen3 32B | 93.26% | 5.53 | $0.16 | | + | 28 | Claude Haiku 4.5 | gpt-oss-120b | 93.00% | 14.65 | $2.90 | | + | 29 | Claude Haiku 4.5 | Qwen3 Next 80B A3B | 93.00% | 20.98 | $7.81 | | + | 30 | gpt-oss-120b | Claude Haiku 4.5 | 92.82% | 7.77 | $0.47 | | + | 31 | gpt-oss-120b | gpt-oss-20b | 92.78% | 6.45 | $0.18 | | + | 32 | gpt-oss-120b | gpt-oss-120b | 92.78% | 6.94 | $0.19 | | + | 33 | gpt-oss-120b | Kimi K2.5 | 92.78% | 12.09 | $0.32 | | + | 34 | gpt-oss-120b | Qwen3 Next 80B A3B | 92.78% | 10.98 | $0.23 | | + | 35 | gpt-oss-120b | Claude 3 Haiku | 92.75% | 6.42 | $0.20 | | + | 36 | Claude Haiku 4.5 | Claude 3 Haiku | 92.50% | 13.43 | $2.46 | | + | 37 | Claude 3 Haiku | Claude Opus 4.6 | 89.66% | 13.32 | $2.26 | | + | 38 | Qwen3 32B | Qwen3 Next 80B A3B | 88.83% | 8.02 | $0.24 | | + | 39 | Ministral 3 8B | Claude 3 Haiku | 88.15% | 10.24 | $0.05 | | + | 40 | Qwen3 32B | gpt-oss-120b | 87.83% | 7.11 | $0.47 | | + | 41 | Ministral 3 8B | Qwen3 Next 80B A3B | 87.82% | 9.22 | $0.03 | | + | 42 | Qwen3 32B | Claude Opus 4.6 | 87.56% | 12.33 | $3.43 | | + | 43 | Ministral 3 8B | Kimi K2.5 | 87.04% | 14.43 | $0.09 | | + | 44 | Ministral 3 8B | gpt-oss-120b | 86.63% | 10.58 | $0.07 | | + | 45 | Claude 3 Haiku | Claude Haiku 4.5 | 86.55% | 9.32 | $0.69 | | + | 46 | Ministral 3 8B | Ministral 3 8B | 86.52% | 7.29 | $0.03 | | + | 47 | Ministral 3 8B | Claude Opus 4.6 | 86.47% | 11.46 | $0.93 | | + | 48 | Qwen3 32B | Claude Haiku 4.5 | 86.46% | 7.47 | $0.90 | | + | 49 | Ministral 3 8B | Claude Haiku 4.5 | 86.23% | 11.66 | $0.30 | | + | 50 | Ministral 3 8B | gpt-oss-20b | 86.13% | 12.33 | $0.05 | | + | 51 | Qwen3 32B | Ministral 3 8B | 86.10% | 17.57 | $0.21 | | + | 52 | Qwen3 32B | Kimi K2.5 | 85.94% | 13.50 | $0.78 | | + | 53 | Qwen3 32B | gpt-oss-20b | 85.86% | 6.43 | $0.49 | | + | 54 | Ministral 3 8B | Qwen3 32B | 85.80% | 9.41 | $0.04 | | + | 55 | Qwen3 32B | Qwen3 32B | 84.82% | 5.98 | $0.62 | | + | 56 | Kimi K2.5 | Claude 3 Haiku | 80.41% | 35.09 | $0.98 | | + | 57 | Qwen3 32B | Claude 3 Haiku | 80.00% | 7.86 | $0.67 | | + | 58 | Qwen3 Next 80B A3B | Claude 3 Haiku | 80.00% | 35.17 | $0.59 | | + | 59 | Qwen3 Next 80B A3B | Claude Opus 4.6 | 78.00% | 31.01 | $2.96 | | + | 60 | Kimi K2.5 | Ministral 3 8B | 77.84% | 40.79 | $0.97 | | + | 61 | Kimi K2.5 | Qwen3 Next 80B A3B | 77.20% | 37.64 | $1.00 | | + | 62 | Qwen3 Next 80B A3B | Ministral 3 8B | 77.00% | 38.55 | $0.55 | | + | 63 | Qwen3 Next 80B A3B | Claude Haiku 4.5 | 76.50% | 32.33 | $1.21 | | + | 64 | Qwen3 Next 80B A3B | gpt-oss-120b | 76.50% | 34.72 | $0.52 | | + | 65 | Qwen3 Next 80B A3B | Qwen3 32B | 76.00% | 30.64 | $0.42 | | + | 66 | Qwen3 Next 80B A3B | Qwen3 Next 80B A3B | 76.00% | 36.44 | $0.54 | | + | 67 | Qwen3 Next 80B A3B | Kimi K2.5 | 75.50% | 36.37 | $0.79 | | + | 68 | Qwen3 Next 80B A3B | gpt-oss-20b | 75.00% | 32.70 | $0.48 | | + | 69 | Kimi K2.5 | gpt-oss-120b | 74.49% | 32.23 | $0.95 | | + | 70 | Kimi K2.5 | gpt-oss-20b | 74.09% | 25.65 | $0.77 | | + | 71 | Kimi K2.5 | Kimi K2.5 | 73.58% | 44.39 | $1.34 | | + | 72 | Kimi K2.5 | Claude Opus 4.6 | 73.33% | 28.62 | $2.79 | | + | 73 | Kimi K2.5 | Claude Haiku 4.5 | 73.20% | 26.98 | $1.36 | | + | 74 | Claude 3 Haiku | gpt-oss-120b | 72.19% | 8.39 | $0.32 | | + | 75 | Kimi K2.5 | Qwen3 32B | 72.16% | 30.32 | $0.92 | | + | 76 | Claude 3 Haiku | gpt-oss-20b | 71.43% | 8.42 | $0.32 | | + | 77 | Claude 3 Haiku | Qwen3 Next 80B A3B | 71.07% | 17.12 | $0.39 | | + | 78 | Claude 3 Haiku | Kimi K2.5 | 71.01% | 14.23 | $0.53 | | + | 79 | Claude 3 Haiku | Ministral 3 8B | 69.28% | 12.40 | $0.32 | | + | 80 | Claude 3 Haiku | Qwen3 32B | 59.30% | 6.29 | $0.29 | | + | 81 | Claude 3 Haiku | Claude 3 Haiku | 54.37% | 7.28 | $0.30 | | ### Selector Comparison | Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings | |:---------|:----------|:--------------|:------------|:-----|:--------| -| Brute Force | 100% | 98.83% | 14,855 | $113.01 | -- | -| Arm Elimination | 96% | 98.80% | 3,632 | $61.22 | **46%** | -| Random Search | 28% | 98.04% | 3,850 | $28.83 | 74% | -| Hill Climbing | 72% | 97.81% | 4,058 | $45.72 | 60% | -| Epsilon LUCB | 0% | 97.46% | 443 | $5.55 | 95% | -| LM Proposal | 0% | 96.87% | 149 | $5.15 | 95% | -| Bayesian Opt | 4% | 95.39% | 3,608 | $31.05 | 73% | -| Threshold SE | 0% | 77.23% | 369 | $1.95 | 98% | +| Brute Force | 100% | 98.84% | 14,961 | $123.87 | -- | +| Arm Elimination | 86% | 98.83% | 3,356 | $51.86 | **58%** | +| Hill Climbing | 80% | 98.76% | 3,926 | $54.22 | 56% | +| Random Search | 28% | 98.17% | 3,880 | $31.77 | 74% | +| Epsilon LUCB | 4% | 96.99% | 447 | $6.10 | 95% | +| LM Proposal | 0% | 95.82% | 158 | $5.61 | 95% | +| Bayesian Opt | 4% | 95.41% | 3,666 | $35.56 | 71% | +| Threshold SE | 0% | 74.52% | 1,355 | $6.90 | 94% | diff --git a/docs/blog/posts/technical-deep-dive.md b/docs/blog/posts/technical-deep-dive.md index ab3cb19..4a34834 100644 --- a/docs/blog/posts/technical-deep-dive.md +++ b/docs/blog/posts/technical-deep-dive.md @@ -59,11 +59,12 @@ And the impact is enormous. Here's what we found across three benchmarks, compar | Benchmark | Expensive Combo | Acc | Cost | Budget Combo | Acc | Cost | Savings | |-----------|----------------|-----|------|-------------|-----|------|---------| -| HotpotQA | Opus + Opus | ~73% | $2.71 | Qwen3 Next + gpt-oss-120b | 71.3% | $0.13 | **21x** | -| MathQA | Opus + Opus | ~98.5% | $5.89 | Ministral + C3 Haiku | 94.0% | $0.05 | **118x** | -| BFCL | Opus | 72% | $60.78 | Qwen3 Next | 71% | $1.87 | **32x** | +| GPQA | Opus | 74.75% | $2.47 | gpt-oss-120b | 68.18% | $0.19 | **13x** | +| HotpotQA | Opus + Opus | ~32% | $2.02 | Qwen3 Next + gpt-oss-120b | 71.8% | $0.13 | **16x** | +| MathQA | Opus + Haiku 4.5 | 98.8% | $6.19 | gpt-oss-20b + Kimi | 94.6% | $0.26 | **24x** | +| BFCL | Opus | 70% | $60.13 | Qwen3 Next | 70% | $1.90 | **32x** | -These are real numbers from real benchmarks. Same accuracy band, 20-100x cost difference. No amount of caching or request batching can close a 32x gap. The model choice *is* the optimization. +These are real numbers from real benchmarks. Same accuracy band, 13-32x cost difference. No amount of caching or request batching can close a 32x gap. The model choice *is* the optimization. ## Agent Routing Is Not LLM Routing @@ -83,7 +84,7 @@ And the results prove why this matters. On HotpotQA (multi-hop question answerin **The weakest planner + the strongest solver beats the strongest planner + any solver.** -Ministral 3 8B (the cheapest, smallest model) as planner paired with Claude Opus as solver achieves 74.8% accuracy. Claude Opus as *both* planner and solver? Only ~73%. Why? Because Opus as planner is *too capable*: it answers the question directly, bypassing the solver's search tools entirely. The "worse" planner correctly delegates to the tool-augmented solver, producing better results. +Ministral 3 8B (the cheapest, smallest model) as planner paired with Claude Opus as solver achieves 74.3% accuracy. Claude Opus as *both* planner and solver? Only ~32%. Why? Because Opus as planner is *too capable*: it answers the question directly, bypassing the solver's search tools entirely. The "worse" planner correctly delegates to the tool-augmented solver, producing better results. You'd never find this by picking "the best model" for each layer independently. The best combo doesn't contain the best individual models. This is the credit assignment problem in action. @@ -128,14 +129,14 @@ The catch: Hill Climbing requires **topology information**. It needs a notion of ### How Much Do These Save? -Across our four benchmarks, Arm Elimination consistently achieves near-optimal accuracy while using 40-60% less budget than brute force: +Across our four benchmarks, Arm Elimination consistently achieves near-optimal accuracy while using up to 67% less budget than brute force: | Benchmark | Brute Force Accuracy | Arm Elimination Accuracy | Cumulative cost savings | |-----------|---------------------|------------------------|-------------| -| HotpotQA | 74.78% | 74.12% | 64% | -| GPQA | 80.30% | 80.14% | 49% | -| MathQA | 98.83% | 98.80% | 46% | -| BFCL | 72.00% | 72.00% | 11% | +| HotpotQA | 74.27% | 73.19% | 67% | +| MathQA | 98.84% | 98.83% | 58% | +| GPQA | 74.75% | 74.10% | 24% | +| BFCL | 70.00% | 69.37% | 12% | Nearly identical accuracy to exhaustive search, at roughly half the cumulative evaluation cost. These algorithms don't just save budget. They find the right combo with statistical guarantees. @@ -154,32 +155,34 @@ We validated AgentOpt across four diverse benchmarks using 9 models on Amazon Be | Benchmark | Best Combo | Why It's Surprising | |-----------|-----------|-------------------| -| HotpotQA | Ministral 3 8B + Opus | Weakest planner wins. Opus as planner bypasses search tools | -| MathQA | Opus + Qwen3 Next | Critic barely matters. Opus solves math correctly on the first try | -| BFCL | Opus (single) | Qwen3 Next ties at 32x lower cost. Statistical difference is ~1% | -| GPQA | Opus (single) | Straightforward. Raw capability wins for grad-level science | +| HotpotQA | Ministral 3 8B + Opus | Weakest planner wins. Opus as planner bypasses search tools and scores only ~32% | +| MathQA | Opus + Haiku 4.5 | Critic barely matters. Opus solves math correctly on the first try | +| BFCL | Opus / Kimi / Qwen3 Next (tied) | Three models tie at 70%. Qwen3 Next costs 32x less than Opus | +| GPQA | Opus | Kimi is within 2pp at less than half the cost | ### Algorithm Comparison (50 random seeds each)
| Algorithm | HotpotQA | GPQA | MathQA | BFCL |
|---|---|---|---|---|
| Algorithm | GPQA | BFCL | HotpotQA | MathQA |
| Brute Force | 74.78% / 0% | 80.30% / 0% | 98.83% / 0% | 72.00% / 0% |
| Arm Elimination | 74.12% / 64% | 80.14% / 49% | 98.80% / 46% | 72.00% / 11% |
| Hill Climbing | 73.38% / 63% | 79.64% / 9% | 97.81% / 60% | 71.94% / 6% |
| Bayesian Opt | 72.78% / 76% | 75.41% / 44% | 95.39% / 73% | 70.61% / 40% |
| Random Search | 72.34% / 74% | 70.53% / 63% | 98.04% / 74% | 67.99% / 63% |
| Epsilon-LUCB | 69.96% / 96% | 79.47% / 44% | 97.46% / 95% | 71.33% / 50% |
| Threshold SE | 63.62% / 93% | 65.88% / 94% | 77.23% / 98% | 57.52% / 92% |
| LM Proposal | 34.41% / 96% | 80.30% / 41% | 96.87% / 95% | 45.00% / 96% |
| Brute Force | 74.75% / 0% | 70.00% / 0% | 74.27% / 0% | 98.84% / 0% |
| Arm Elimination | 74.10% / 24% | 69.37% / 12% | 73.19% / 67% | 98.83% / 58% |
| Hill Climbing | 74.55% / 14% | 70.00% / 15% | 73.13% / 63% | 98.76% / 56% |
| Bayesian Opt | 72.43% / 45% | 69.27% / 40% | 73.33% / 76% | 95.41% / 71% |
| Random Search | 68.57% / 63% | 67.13% / 63% | 72.25% / 74% | 98.17% / 74% |
| Epsilon-LUCB | 73.14% / 47% | 69.90% / 53% | 69.71% / 97% | 96.99% / 95% |
| Threshold SE | 57.83% / 62% | 58.19% / 78% | 65.42% / 88% | 74.52% / 94% |
| LM Proposal | 74.75% / 48% | 44.03% / 96% | 34.13% / 97% | 95.82% / 96% |