# Reforge — Thesis Benchmark: Experiment Configuration & Execution

This notebook generates, validates, budgets, and executes the thesis benchmark experiments.

### Design
- **3 models** × **3 tiers** (GOLD/SILVER/BRONZE) × **4 opt levels** (O0–O3) × **L2 context** × **top-3 JSON output** = **36 experiments**
- Models: gpt-4o-mini (instant), deepseek-v3-0324 (coding), gpt-5.1 (thinking)
- Top-k(3): ranked candidate names with confidence, enables analyst-shortlist analysis
- All v2 prompt templates, no function limit

### Model-Aware Routing
> The LLM runner uses `workers.llm.model_router` to automatically adapt API
> calls per provider: response_format handling, Anthropic beta headers,
> DeepSeek reasoning token stripping, and `require_parameters` provider routing.

### Non-Negotiable Constraint
> **LLM input MUST contain ONLY Ghidra-derived artefacts.**  
> Ground truth (DWARF / source / join metadata) is used ONLY post-hoc for scoring.  
> The `/llm/functions` endpoint enforces this by construction.

### Prerequisites
- Docker stack running: `docker compose up -d` in `reforge/docker/`
- Services: `api` (port 8080), `redis`, `postgres`
- `OPENROUTER_API_KEY` set in `docker/.env`

## §1 — Setup & Health Check

In [None]:
import sys, os, json, time, importlib, textwrap
from pathlib import Path
from datetime import datetime
from collections import Counter

import requests

# Ensure reforge root is on sys.path
REFORGE_ROOT = Path(".").resolve().parent
if str(REFORGE_ROOT) not in sys.path:
    sys.path.insert(0, str(REFORGE_ROOT))

API = "http://localhost:8080"
OPENROUTER_KEY = os.environ.get(
    "OPENROUTER_API_KEY",
    "sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
)

def api(path, **kw):  return requests.get(f"{API}{path}", **kw).json()
def post(url, body):  return requests.post(url, json=body).json()

health = api("/health")
print(f"API: {health}")
print(f"Key: ...{OPENROUTER_KEY[-8:]}")

API: {'status': 'healthy', 'service': 'reforge-api', 'version': '0.1.0'}
Key: ...e0ea6829


## §2 — Review Legacy Experiments

In [2]:
experiments = api("/data/experiments")

legacy = [e for e in experiments if e.get('status') == 'legacy']
active = [e for e in experiments if e.get('status') != 'legacy']

print(f"Total experiments: {len(experiments)}")
print(f"Legacy (pilot):    {len(legacy)}")
print(f"Active:            {len(active)}")
print()
for e in legacy:
    print(f"  OLD  {e['id']:50s}  {e['model']}")
print()
for e in active[:20]:
    status_icon = {'ready': 'RDY', 'draft': 'DFT', 'running': 'RUN', 'completed': 'DON'}.get(e['status'], '???')
    print(f"  {status_icon}  {e['id']:50s}  {e['model']:35s}  {e.get('context_level', 'L0')}")

Total experiments: 53
Legacy (pilot):    5
Active:            48

  OLD  exp01_funcnaming_gpt4omini_gold_O0                  openai/gpt-4o-mini
  OLD  exp02_funcnaming_gpt4o_gold_O0                      openai/gpt-4o
  OLD  exp03_funcnaming_claude_gold_O0                     anthropic/claude-3.5-sonnet
  OLD  exp04_funcnaming_gpt4omini_gold_O2                  openai/gpt-4o-mini
  OLD  exp05_funcnaming_gpt4omini_silver_O0                openai/gpt-4o-mini

  RDY  bench_gpt4o-mini_gold_O0_L2_topk3                   openai/gpt-4o-mini                   L2
  RDY  bench_gpt4o-mini_gold_O1_L2_topk3                   openai/gpt-4o-mini                   L2
  RDY  bench_gpt4o-mini_gold_O2_L2_topk3                   openai/gpt-4o-mini                   L2
  RDY  bench_gpt4o-mini_gold_O3_L2_topk3                   openai/gpt-4o-mini                   L2
  RDY  bench_gpt4o-mini_silver_O0_L2_topk3                 openai/gpt-4o-mini                   L2
  RDY  bench_gpt4o-mini_silver_O1_L2_topk3  

## §3 — Build the Thesis Benchmark Matrix

Generate all 36 experiment configs:
- 3 models x 3 tiers x 4 opts x L2 x top-k(3)

In [2]:
from data.experiments import (
    build_thesis_matrix,
    build_benchmark_matrix,
    estimate_benchmark_cost,
    THESIS_MODELS,
    BENCHMARK_TIERS,
    BENCHMARK_OPTS,
    REGISTRY,
)

# Build the thesis matrix (36 experiments)
matrix = build_thesis_matrix(register=True)

print(f"Thesis benchmark matrix: {len(matrix)} experiments")
print(f"  Models:  {list(THESIS_MODELS.keys())}")
print(f"  Tiers:   GOLD, SILVER, BRONZE")
print(f"  Opts:    O0, O1, O2, O3")
print(f"  Context: L2 (max Ghidra structural data)")
print(f"  Top-k:   3 (JSON structured output)")
print()
for cfg in matrix:
    print(f"  {cfg.id:55s} -> {cfg.model:35s} top_k={cfg.top_k}")

# Push matrix to the API server
print()
print("Registering experiments with API server...")
resp = requests.post(
    f"{API}/data/experiments/bulk",
    json=[cfg.model_dump() for cfg in matrix],
)
if resp.status_code in (200, 201):
    reg = resp.json()
    print(f"  Registered {reg['registered']} experiments ({reg['created']} new, {reg['updated']} updated)")
else:
    print(f"  Registration failed: {resp.status_code} - {resp.text}")

Thesis benchmark matrix: 84 experiments
  Models:  ['gpt4o-mini', 'deepseek-v3', 'claude-sonnet45', 'llama31-70b', 'deepseek-r1', 'qwen3-coder', 'gpt51']
  Tiers:   GOLD, SILVER, BRONZE
  Opts:    O0, O1, O2, O3
  Context: L2 (max Ghidra structural data)
  Top-k:   3 (JSON structured output)

  bench_gpt4o-mini_gold_O0_L2_topk3                       -> openai/gpt-4o-mini                  top_k=3
  bench_gpt4o-mini_gold_O1_L2_topk3                       -> openai/gpt-4o-mini                  top_k=3
  bench_gpt4o-mini_gold_O2_L2_topk3                       -> openai/gpt-4o-mini                  top_k=3
  bench_gpt4o-mini_gold_O3_L2_topk3                       -> openai/gpt-4o-mini                  top_k=3
  bench_gpt4o-mini_silver_O0_L2_topk3                     -> openai/gpt-4o-mini                  top_k=3
  bench_gpt4o-mini_silver_O1_L2_topk3                     -> openai/gpt-4o-mini                  top_k=3
  bench_gpt4o-mini_silver_O2_L2_topk3                     -> openai/gpt-4o-m

## §4 — Budget Estimation

In [4]:
# Count functions per tier/opt
func_counts = {}
for tier in ["GOLD", "SILVER", "BRONZE"]:
    for opt_lvl in ["O0", "O1", "O2", "O3"]:
        try:
            fns = api("/llm/functions", params={"opt": opt_lvl, "tier": tier, "limit": 5000})
            func_counts[(tier, opt_lvl)] = len(fns)
        except Exception:
            func_counts[(tier, opt_lvl)] = 0

total_functions_per_exp = sum(func_counts.values()) // max(len(func_counts), 1)
print("Functions per (tier, opt):")
for (t, o), n in sorted(func_counts.items()):
    print(f"  {t:8s} {o:3s}: {n:4d} functions")
print(f"\nAverage per experiment: ~{total_functions_per_exp} functions")
print()

# top-k uses ~80 completion tokens (JSON overhead) vs ~20 for single-name
est = estimate_benchmark_cost(
    matrix,
    avg_prompt_tokens=800,
    avg_completion_tokens=80,
    functions_per_experiment=max(total_functions_per_exp, 50),
)

print(f"Total experiments:   {est['total_experiments']}")
print(f"Total LLM calls:    {est['total_calls']:,}")
print(f"Total input tokens:  {est['total_input_tokens']:,}")
print(f"Estimated cost:      ${est['estimated_cost_usd']:.2f}")
print()

BUDGET = 300.0
if est['estimated_cost_usd'] > BUDGET:
    print(f"OVER BUDGET (${BUDGET:.0f}). Narrow the matrix or reduce models.")
else:
    print(f"Within ${BUDGET:.0f} budget - ${BUDGET - est['estimated_cost_usd']:.2f} remaining")

# Per-model breakdown
model_costs = {}
for item in est['breakdown']:
    m = item['model']
    model_costs[m] = model_costs.get(m, 0) + item['est_cost_usd']
print("\nPer-model cost:")
for m, c in sorted(model_costs.items(), key=lambda x: -x[1]):
    print(f"  ${c:8.2f}  {m}")

Functions per (tier, opt):
  BRONZE   O0 :    0 functions
  BRONZE   O1 :   17 functions
  BRONZE   O2 :   30 functions
  BRONZE   O3 :   33 functions
  GOLD     O0 :  197 functions
  GOLD     O1 :  121 functions
  GOLD     O2 :   92 functions
  GOLD     O3 :   85 functions
  SILVER   O0 :   29 functions
  SILVER   O1 :   23 functions
  SILVER   O2 :   32 functions
  SILVER   O3 :   34 functions

Average per experiment: ~57 functions

Total experiments:   48
Total LLM calls:    2,736
Total input tokens:  2,188,800
Estimated cost:      $1.61

Within $300 budget - $298.39 remaining

Per-model cost:
  $    1.37  anthropic/claude-sonnet-4.5
  $    0.08  meta-llama/llama-3.1-70b-instruct
  $    0.08  deepseek/deepseek-r1-0528
  $    0.08  qwen/qwen3-coder


## §5 — Dry-Run Validation

Validate prompt rendering, JSON parsing, and API round-trip **without** making real LLM calls.

In [3]:
## Preview: render one full top-k prompt at L2

from workers.llm.prompt import load_template, render_prompt
from workers.llm.response_parser import parse_topk_response

# Fetch one L2 function from the API
sample = api("/llm/functions", params={
    "opt": "O0",
    "tier": "GOLD",
    "context_level": "L2",
    "limit": 1,
})

if not sample:
    print("No functions returned - is the data loaded?")
else:
    fn = sample[0]
    print(f"Function: {fn['dwarf_function_id']}")
    print(f"  ghidra_name:  {fn.get('ghidra_name')}")
    print(f"  test_case:    {fn.get('test_case')}")
    print()

    # Load the top-k L2 template and render
    template = load_template("function_naming_topk_L2")
    prompt = render_prompt(
        template,
        fn.get("c_raw", ""),
        calls=fn.get("calls_text"),
        cfg_summary=fn.get("cfg_text"),
        variables=fn.get("variables_text"),
    )

    print("=" * 80)
    print("FULL RENDERED PROMPT (top-k L2)")
    print("=" * 80)
    print(prompt)
    print("=" * 80)
    print(f"\nPrompt length: {len(prompt)} chars  (~{len(prompt)//4} tokens)")

    # Test the parser on a simulated response
    print("\n-- Parser test --")
    fake_response = '{"predictions": [{"name": "parse_header", "confidence": 0.9}, {"name": "read_input", "confidence": 0.6}, {"name": "process_data", "confidence": 0.3}]}'
    parsed = parse_topk_response(fake_response)
    print(f"  parse_ok: {parsed.parse_ok}")
    print(f"  predictions: {parsed.predictions}")
    print(f"  top-1: {parsed.predictions[0]['name']}")

Function: cu0x0:die0x12d
  ghidra_name:  FUN_004011fa
  test_case:    t01_crossfile_calls

FULL RENDERED PROMPT (top-k L2)
You are an expert reverse engineer analyzing decompiled binary code.

A function has been decompiled from a stripped binary using Ghidra. The original
symbol names have been removed by the strip tool. Your task is to analyze the
decompiled C code along with its structural metadata, then suggest multiple
candidate function names ranked by your confidence.

Guidelines:
- Use snake_case naming convention
- Be specific but concise — prefer 2-4 words
- Focus on the function's PURPOSE, not its implementation details
- Use call relationships and variable information to understand context
- Consider control-flow complexity when reasoning about function role
- If the function is a standard library wrapper, name it accordingly
- If you cannot determine the purpose, use a descriptive structural name

Return EXACTLY 3 candidate names ranked from most to least confident.
Respon

In [5]:
import asyncio
from workers.llm.runner import run_experiment

# Pick one experiment for dry-run validation
dry_run_id = matrix[0].id

print(f"DRY RUN: {dry_run_id}")
print("=" * 60)
summary = await run_experiment(
    dry_run_id,
    api_base=API,
    dry_run=True,
)
print(f"  Total: {summary['total']}, New: {summary['new']}, Errors: {summary['errors']}")
print(f"  {'PASS' if summary['errors'] == 0 else 'FAIL'}")

DRY RUN: bench_claude-sonnet45_gold_O0_L2_topk3
  Total: 197, New: 197, Errors: 0
  PASS


## §6 — Execute Experiments

### Execution Strategy
- **Phase 1**: gpt-4o-mini (12 experiments) — cheapest, validates full pipeline
- **Phase 2**: deepseek-v3-0324 (12 experiments) — code specialist (replaces discontinued coder-v2)
- **Phase 3**: gpt-5.1 (12 experiments) — premium thinking model

Each phase can be run independently. Results are idempotent (resume support).
The runner now pre-checks model availability and uses model-aware routing.

In [7]:
# Phase selector - uncomment ONE phase at a time

    # "claude-sonnet45":  "anthropic/claude-sonnet-4.5", 
    # "llama31-70b":      "meta-llama/llama-3.1-70b-instruct",
    # "deepseek-r1":      "deepseek/deepseek-r1-0528",
    # "qwen3-coder":      "qwen/qwen3-coder",


# Phase 1: cheap (validate pipeline)
#phase_filter = {"gpt4o-mini"}
# Phase 2: code specialist (was deepseek-coder2, now deepseek-v3)
# phase_filter = {"deepseek-v3"}
# Phase 3: premium
# phase_filter = {"gpt51"}
# All at once:
phase_filter = {"qwen3-coder", "deepseek-r1", "llama31-70b", "claude-sonnet45"}


phase_experiments = [
    cfg for cfg in matrix
    if any(label in cfg.id for label in phase_filter)
]
print(f"Phase experiments: {len(phase_experiments)}")
for cfg in phase_experiments:
    print(f"  {cfg.id}")

Phase experiments: 48
  bench_claude-sonnet45_gold_O0_L2_topk3
  bench_claude-sonnet45_gold_O1_L2_topk3
  bench_claude-sonnet45_gold_O2_L2_topk3
  bench_claude-sonnet45_gold_O3_L2_topk3
  bench_claude-sonnet45_silver_O0_L2_topk3
  bench_claude-sonnet45_silver_O1_L2_topk3
  bench_claude-sonnet45_silver_O2_L2_topk3
  bench_claude-sonnet45_silver_O3_L2_topk3
  bench_claude-sonnet45_bronze_O0_L2_topk3
  bench_claude-sonnet45_bronze_O1_L2_topk3
  bench_claude-sonnet45_bronze_O2_L2_topk3
  bench_claude-sonnet45_bronze_O3_L2_topk3
  bench_llama31-70b_gold_O0_L2_topk3
  bench_llama31-70b_gold_O1_L2_topk3
  bench_llama31-70b_gold_O2_L2_topk3
  bench_llama31-70b_gold_O3_L2_topk3
  bench_llama31-70b_silver_O0_L2_topk3
  bench_llama31-70b_silver_O1_L2_topk3
  bench_llama31-70b_silver_O2_L2_topk3
  bench_llama31-70b_silver_O3_L2_topk3
  bench_llama31-70b_bronze_O0_L2_topk3
  bench_llama31-70b_bronze_O1_L2_topk3
  bench_llama31-70b_bronze_O2_L2_topk3
  bench_llama31-70b_bronze_O3_L2_topk3
  bench_de

In [8]:
# Execute the selected phase
# Each experiment uses internal async concurrency (5 parallel LLM calls).

from workers.llm.runner import run_experiment

summaries = []
errors_total = 0

for i, cfg in enumerate(phase_experiments, 1):
    print(f"\n[{i}/{len(phase_experiments)}] {cfg.id}")
    print(f"  Model: {cfg.model}  Tier: {cfg.tier}  Opt: {cfg.opt}  Top-k: {cfg.top_k}")
    
    try:
        summary = await run_experiment(
            cfg.id,
            api_base=API,
            openrouter_key=OPENROUTER_KEY,
            concurrency=5,
        )
        summaries.append(summary)
        
        new = summary.get('new', 0)
        errs = summary.get('errors', 0)
        errors_total += errs
        status = 'OK' if errs == 0 else 'WARN'
        print(f"  {status} Completed: {summary.get('completed', 0)}, New: {new}, Errors: {errs}")
    except Exception as exc:
        print(f"  FAILED: {exc}")
        errors_total += 1
        summaries.append({"experiment_id": cfg.id, "error": str(exc)})

print(f"\n{'='*60}")
print(f"Phase complete: {len(summaries)} experiments, {errors_total} total errors")


[1/48] bench_claude-sonnet45_gold_O0_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: GOLD  Opt: O0  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 197/197 [01:41<00:00,  1.95fn/s]


  OK Completed: 196, New: 196, Errors: 0

[2/48] bench_claude-sonnet45_gold_O1_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: GOLD  Opt: O1  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 121/121 [01:02<00:00,  1.93fn/s]


  OK Completed: 121, New: 121, Errors: 0

[3/48] bench_claude-sonnet45_gold_O2_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: GOLD  Opt: O2  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 92/92 [00:47<00:00,  1.92fn/s]


  OK Completed: 92, New: 92, Errors: 0

[4/48] bench_claude-sonnet45_gold_O3_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: GOLD  Opt: O3  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 85/85 [00:44<00:00,  1.91fn/s]


  OK Completed: 85, New: 85, Errors: 0

[5/48] bench_claude-sonnet45_silver_O0_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: SILVER  Opt: O0  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 29/29 [00:15<00:00,  1.83fn/s]


  OK Completed: 29, New: 29, Errors: 0

[6/48] bench_claude-sonnet45_silver_O1_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: SILVER  Opt: O1  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 23/23 [00:12<00:00,  1.79fn/s]


  OK Completed: 23, New: 23, Errors: 0

[7/48] bench_claude-sonnet45_silver_O2_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: SILVER  Opt: O2  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 32/32 [00:17<00:00,  1.78fn/s]


  OK Completed: 32, New: 32, Errors: 0

[8/48] bench_claude-sonnet45_silver_O3_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: SILVER  Opt: O3  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 34/34 [00:20<00:00,  1.66fn/s]


  OK Completed: 34, New: 34, Errors: 0

[9/48] bench_claude-sonnet45_bronze_O0_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: BRONZE  Opt: O0  Top-k: 3


No functions to process — exiting


  OK Completed: 0, New: 0, Errors: 0

[10/48] bench_claude-sonnet45_bronze_O1_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: BRONZE  Opt: O1  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 17/17 [00:09<00:00,  1.73fn/s]


  OK Completed: 17, New: 17, Errors: 0

[11/48] bench_claude-sonnet45_bronze_O2_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: BRONZE  Opt: O2  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 30/30 [00:15<00:00,  1.89fn/s]


  OK Completed: 30, New: 30, Errors: 0

[12/48] bench_claude-sonnet45_bronze_O3_L2_topk3
  Model: anthropic/claude-sonnet-4.5  Tier: BRONZE  Opt: O3  Top-k: 3


LLM calls (anthropic/claude-sonnet-4.5): 100%|██████████| 33/33 [00:18<00:00,  1.78fn/s]


  OK Completed: 33, New: 33, Errors: 0

[13/48] bench_llama31-70b_gold_O0_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: GOLD  Opt: O0  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 197/197 [01:33<00:00,  2.11fn/s]


  OK Completed: 196, New: 196, Errors: 0

[14/48] bench_llama31-70b_gold_O1_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: GOLD  Opt: O1  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 121/121 [00:55<00:00,  2.16fn/s]


  OK Completed: 121, New: 121, Errors: 0

[15/48] bench_llama31-70b_gold_O2_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: GOLD  Opt: O2  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 92/92 [00:47<00:00,  1.94fn/s]


  OK Completed: 92, New: 92, Errors: 0

[16/48] bench_llama31-70b_gold_O3_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: GOLD  Opt: O3  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 85/85 [00:39<00:00,  2.15fn/s]


  OK Completed: 85, New: 85, Errors: 0

[17/48] bench_llama31-70b_silver_O0_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: SILVER  Opt: O0  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 29/29 [00:22<00:00,  1.31fn/s]


  OK Completed: 29, New: 29, Errors: 0

[18/48] bench_llama31-70b_silver_O1_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: SILVER  Opt: O1  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 23/23 [00:11<00:00,  2.00fn/s]


  OK Completed: 23, New: 23, Errors: 0

[19/48] bench_llama31-70b_silver_O2_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: SILVER  Opt: O2  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 32/32 [00:13<00:00,  2.30fn/s]


  OK Completed: 32, New: 32, Errors: 0

[20/48] bench_llama31-70b_silver_O3_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: SILVER  Opt: O3  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 34/34 [00:28<00:00,  1.20fn/s]


  OK Completed: 34, New: 34, Errors: 0

[21/48] bench_llama31-70b_bronze_O0_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: BRONZE  Opt: O0  Top-k: 3


No functions to process — exiting


  OK Completed: 0, New: 0, Errors: 0

[22/48] bench_llama31-70b_bronze_O1_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: BRONZE  Opt: O1  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 17/17 [00:08<00:00,  1.92fn/s]


  OK Completed: 17, New: 17, Errors: 0

[23/48] bench_llama31-70b_bronze_O2_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: BRONZE  Opt: O2  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 30/30 [00:17<00:00,  1.73fn/s]


  OK Completed: 30, New: 30, Errors: 0

[24/48] bench_llama31-70b_bronze_O3_L2_topk3
  Model: meta-llama/llama-3.1-70b-instruct  Tier: BRONZE  Opt: O3  Top-k: 3


LLM calls (meta-llama/llama-3.1-70b-instruct): 100%|██████████| 33/33 [00:19<00:00,  1.73fn/s]


  OK Completed: 33, New: 33, Errors: 0

[25/48] bench_deepseek-r1_gold_O0_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: GOLD  Opt: O0  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 197/197 [37:00<00:00, 11.27s/fn] 


  OK Completed: 196, New: 196, Errors: 0

[26/48] bench_deepseek-r1_gold_O1_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: GOLD  Opt: O1  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528):  86%|████████▌ | 104/121 [37:16<06:33, 23.17s/fn]LLM call failed for cu0x1167:die0x1442: peer closed connection without sending complete message body (incomplete chunked read)
LLM calls (deepseek/deepseek-r1-0528):  90%|█████████ | 109/121 [39:39<04:44, 23.68s/fn]LLM call failed for cu0x87f:die0xd40: peer closed connection without sending complete message body (incomplete chunked read)
LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 121/121 [46:28<00:00, 23.05s/fn]


  WARN Completed: 119, New: 119, Errors: 2

[27/48] bench_deepseek-r1_gold_O2_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: GOLD  Opt: O2  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528):  37%|███▋      | 34/92 [15:56<29:03, 30.07s/fn] LLM call failed for cu0x1391:die0x166c: peer closed connection without sending complete message body (incomplete chunked read)
LLM calls (deepseek/deepseek-r1-0528):  38%|███▊      | 35/92 [16:24<27:56, 29.41s/fn]LLM call failed for cu0x0:die0x547: peer closed connection without sending complete message body (incomplete chunked read)
LLM calls (deepseek/deepseek-r1-0528):  92%|█████████▏| 85/92 [35:06<02:03, 17.66s/fn]LLM call failed for cu0xda4:die0x10ef: peer closed connection without sending complete message body (incomplete chunked read)
LLM calls (deepseek/deepseek-r1-0528):  93%|█████████▎| 86/92 [35:49<02:32, 25.35s/fn]LLM call failed for cu0x0:die0x41c: peer closed connection without sending complete message body (incomplete chunked read)
LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 92/92 [40:58<00:00, 26.72s/fn]


  WARN Completed: 88, New: 88, Errors: 4

[28/48] bench_deepseek-r1_gold_O3_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: GOLD  Opt: O3  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528):  80%|████████  | 68/85 [20:22<04:32, 16.06s/fn]LLM call failed for cu0x7cc:die0x9fd: peer closed connection without sending complete message body (incomplete chunked read)
LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 85/85 [35:18<00:00, 24.93s/fn] 


  WARN Completed: 84, New: 84, Errors: 1

[29/48] bench_deepseek-r1_silver_O0_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: SILVER  Opt: O0  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 29/29 [11:19<00:00, 23.43s/fn]


  OK Completed: 29, New: 29, Errors: 0

[30/48] bench_deepseek-r1_silver_O1_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: SILVER  Opt: O1  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 23/23 [08:27<00:00, 22.08s/fn]


  OK Completed: 23, New: 23, Errors: 0

[31/48] bench_deepseek-r1_silver_O2_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: SILVER  Opt: O2  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 32/32 [17:23<00:00, 32.60s/fn]


  OK Completed: 32, New: 32, Errors: 0

[32/48] bench_deepseek-r1_silver_O3_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: SILVER  Opt: O3  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 34/34 [13:26<00:00, 23.73s/fn]


  OK Completed: 34, New: 34, Errors: 0

[33/48] bench_deepseek-r1_bronze_O0_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: BRONZE  Opt: O0  Top-k: 3


No functions to process — exiting


  OK Completed: 0, New: 0, Errors: 0

[34/48] bench_deepseek-r1_bronze_O1_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: BRONZE  Opt: O1  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 17/17 [04:09<00:00, 14.69s/fn]


  OK Completed: 17, New: 17, Errors: 0

[35/48] bench_deepseek-r1_bronze_O2_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: BRONZE  Opt: O2  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 30/30 [11:51<00:00, 23.73s/fn]


  OK Completed: 30, New: 30, Errors: 0

[36/48] bench_deepseek-r1_bronze_O3_L2_topk3
  Model: deepseek/deepseek-r1-0528  Tier: BRONZE  Opt: O3  Top-k: 3


LLM calls (deepseek/deepseek-r1-0528): 100%|██████████| 33/33 [12:01<00:00, 21.87s/fn]


  OK Completed: 33, New: 33, Errors: 0

[37/48] bench_qwen3-coder_gold_O0_L2_topk3
  Model: qwen/qwen3-coder  Tier: GOLD  Opt: O0  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 197/197 [00:32<00:00,  6.04fn/s]


  OK Completed: 196, New: 196, Errors: 0

[38/48] bench_qwen3-coder_gold_O1_L2_topk3
  Model: qwen/qwen3-coder  Tier: GOLD  Opt: O1  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 121/121 [00:17<00:00,  6.83fn/s]


  OK Completed: 121, New: 121, Errors: 0

[39/48] bench_qwen3-coder_gold_O2_L2_topk3
  Model: qwen/qwen3-coder  Tier: GOLD  Opt: O2  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 92/92 [00:13<00:00,  6.81fn/s]


  OK Completed: 92, New: 92, Errors: 0

[40/48] bench_qwen3-coder_gold_O3_L2_topk3
  Model: qwen/qwen3-coder  Tier: GOLD  Opt: O3  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 85/85 [00:12<00:00,  6.84fn/s]


  OK Completed: 85, New: 85, Errors: 0

[41/48] bench_qwen3-coder_silver_O0_L2_topk3
  Model: qwen/qwen3-coder  Tier: SILVER  Opt: O0  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 29/29 [00:04<00:00,  6.72fn/s]


  OK Completed: 29, New: 29, Errors: 0

[42/48] bench_qwen3-coder_silver_O1_L2_topk3
  Model: qwen/qwen3-coder  Tier: SILVER  Opt: O1  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 23/23 [00:03<00:00,  6.35fn/s]


  OK Completed: 23, New: 23, Errors: 0

[43/48] bench_qwen3-coder_silver_O2_L2_topk3
  Model: qwen/qwen3-coder  Tier: SILVER  Opt: O2  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 32/32 [00:05<00:00,  6.28fn/s]


  OK Completed: 32, New: 32, Errors: 0

[44/48] bench_qwen3-coder_silver_O3_L2_topk3
  Model: qwen/qwen3-coder  Tier: SILVER  Opt: O3  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 34/34 [00:05<00:00,  6.66fn/s]


  OK Completed: 34, New: 34, Errors: 0

[45/48] bench_qwen3-coder_bronze_O0_L2_topk3
  Model: qwen/qwen3-coder  Tier: BRONZE  Opt: O0  Top-k: 3


No functions to process — exiting


  OK Completed: 0, New: 0, Errors: 0

[46/48] bench_qwen3-coder_bronze_O1_L2_topk3
  Model: qwen/qwen3-coder  Tier: BRONZE  Opt: O1  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 17/17 [00:02<00:00,  5.87fn/s]


  OK Completed: 17, New: 17, Errors: 0

[47/48] bench_qwen3-coder_bronze_O2_L2_topk3
  Model: qwen/qwen3-coder  Tier: BRONZE  Opt: O2  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 30/30 [00:04<00:00,  7.08fn/s]


  OK Completed: 30, New: 30, Errors: 0

[48/48] bench_qwen3-coder_bronze_O3_L2_topk3
  Model: qwen/qwen3-coder  Tier: BRONZE  Opt: O3  Top-k: 3


LLM calls (qwen/qwen3-coder): 100%|██████████| 33/33 [00:05<00:00,  6.59fn/s]


  OK Completed: 33, New: 33, Errors: 0

Phase complete: 48 experiments, 7 total errors


## §7 — Score All Results

Trigger the scorer for all benchmark experiments. The scorer:
1. Joins ground truth post-hoc (leak-proof)
2. Enriches with stable keys for cross-opt pairing
3. Computes top-1 AND top-k metrics

In [9]:
all_exps = api("/data/experiments")
benchmark_exps = [e for e in all_exps if 'benchmark-v2' in e.get('tags', [])]

scored = 0
score_errors = 0
for e in benchmark_exps:
    exp_id = e['id']
    try:
        results = api(f"/results/{exp_id}")
        if not results or not results.get('rows'):
            continue
        
        resp = requests.post(f"{API}/results/{exp_id}/score")
        if resp.status_code == 200:
            score_data = resp.json()
            n = score_data.get('scored', 0)
            f1 = score_data.get('mean_token_f1', 0)
            print(f"  OK {exp_id}: {n} scored (F1={f1:.3f})")
            scored += 1
        else:
            print(f"  WARN {exp_id}: {resp.status_code}")
    except Exception as exc:
        print(f"  ERR {exp_id}: {exc}")
        score_errors += 1

print(f"\nScored: {scored}, Errors: {score_errors}")

  OK bench_gpt4o-mini_gold_O0_L2_topk3: 196 scored (F1=0.208)
  OK bench_gpt4o-mini_gold_O1_L2_topk3: 121 scored (F1=0.169)
  OK bench_gpt4o-mini_gold_O2_L2_topk3: 92 scored (F1=0.137)
  OK bench_gpt4o-mini_gold_O3_L2_topk3: 85 scored (F1=0.158)
  OK bench_gpt4o-mini_silver_O0_L2_topk3: 29 scored (F1=0.200)
  OK bench_gpt4o-mini_silver_O1_L2_topk3: 23 scored (F1=0.325)
  OK bench_gpt4o-mini_silver_O2_L2_topk3: 32 scored (F1=0.232)
  OK bench_gpt4o-mini_silver_O3_L2_topk3: 34 scored (F1=0.184)
  OK bench_gpt4o-mini_bronze_O1_L2_topk3: 17 scored (F1=0.161)
  OK bench_gpt4o-mini_bronze_O2_L2_topk3: 30 scored (F1=0.098)
  OK bench_gpt4o-mini_bronze_O3_L2_topk3: 33 scored (F1=0.089)
  OK bench_deepseek-v3_gold_O0_L2_topk3: 196 scored (F1=0.241)
  OK bench_deepseek-v3_gold_O1_L2_topk3: 121 scored (F1=0.223)
  OK bench_deepseek-v3_gold_O2_L2_topk3: 92 scored (F1=0.182)
  OK bench_deepseek-v3_gold_O3_L2_topk3: 85 scored (F1=0.184)
  OK bench_deepseek-v3_silver_O0_L2_topk3: 29 scored (F1=0.391)

## §8 — Progress Overview

In [8]:
all_exps = api("/data/experiments")
benchmark_exps = [e for e in all_exps if 'benchmark-v2' in e.get('tags', [])]

progress = {"no_results": 0, "has_results": 0, "scored": 0}
model_progress = {}

for e in benchmark_exps:
    exp_id = e['id']
    model = e['model']
    if model not in model_progress:
        model_progress[model] = {"total": 0, "done": 0}
    model_progress[model]["total"] += 1
    
    try:
        results = api(f"/results/{exp_id}")
        if results and results.get('rows'):
            progress["has_results"] += 1
            model_progress[model]["done"] += 1
            try:
                resp = api(f"/results/{exp_id}/scores")
                rows = resp.get("rows", []) if isinstance(resp, dict) else resp
                if rows and any(s.get('token_f1') is not None for s in rows):
                    progress["scored"] += 1
            except Exception:
                pass
        else:
            progress["no_results"] += 1
    except Exception:
        progress["no_results"] += 1

print(f"Benchmark progress ({len(benchmark_exps)} experiments):")
print(f"  No results yet: {progress['no_results']}")
print(f"  Has results:    {progress['has_results']}")
print(f"  Scored:         {progress['scored']}")
print()
print("Per-model progress:")
for m in sorted(model_progress.keys()):
    p = model_progress[m]
    pct = (p['done'] / p['total'] * 100) if p['total'] else 0
    done_blocks = int(pct // 5)
    bar = '#' * done_blocks + '.' * (20 - done_blocks)
    print(f"  {m:40s} [{bar}] {p['done']:3d}/{p['total']:3d} ({pct:.0f}%)")

Benchmark progress (36 experiments):
  No results yet: 3
  Has results:    33
  Scored:         33

Per-model progress:
  deepseek/deepseek-chat-v3-0324           [##################..]  11/ 12 (92%)
  openai/gpt-4o-mini                       [##################..]  11/ 12 (92%)
  openai/gpt-5.1                           [##################..]  11/ 12 (92%)


## §9 — Leak-Proof Assertion

In [None]:
from data.llm_contract import validate_no_leakage, FORBIDDEN_KEYS

sample_fns = api("/llm/functions", params={"opt": "O0", "tier": "GOLD", "limit": 50, "context_level": "L2"})

violations = 0
for fn in sample_fns:
    leaked = validate_no_leakage(fn)
    if leaked:
        print(f"  LEAK: {fn.get('dwarf_function_id', '?')}: {leaked}")
        violations += 1

if violations == 0:
    print(f"LEAK-PROOF: {len(sample_fns)} functions checked at L2 context - zero GT fields")
    print(f"   Forbidden keys blocked: {len(FORBIDDEN_KEYS)}")
else:
    print(f"!! {violations} VIOLATIONS DETECTED - fix immediately")

---
## Quick Reference

### CLI execution (alternative to notebook)
```bash
cd reforge
python -m workers.llm.runner --experiment bench_gpt4o-mini_gold_O0_L2_topk3 --api-base http://localhost:8080 --concurrency 5
python -m workers.llm.runner --experiment bench_gpt4o-mini_gold_O0_L2_topk3 --api-base http://localhost:8080 --dry-run
```

### Analysis
See `analysis.ipynb` for all figures, tables, and statistical tests.