# Reforge ‚Äî Benchmark v2: Experiment Configuration & Execution

This notebook generates, validates, budgets, and executes the full benchmark experiment matrix.

### Design
- **13 models** √ó **3 tiers** (GOLD/SILVER/BRONZE) √ó **4 opt levels** (O0‚ÄìO3) √ó **3 context levels** (L0/L1/L2) = **468 experiments**
- Context ablation: L0 (code-only) ‚Üí L1 (+calls) ‚Üí L2 (+calls+CFG+vars)
- All v2 prompt templates, no function limit
- Budget ceiling: **$300**

### Non-Negotiable Constraint
> **LLM input MUST contain ONLY Ghidra-derived artefacts.**  
> Ground truth (DWARF / source / join metadata) is used ONLY post-hoc for scoring.  
> The `/llm/functions` endpoint enforces this by construction.

### Prerequisites
- Docker stack running: `docker compose up -d` in `reforge/docker/`
- Services: `api` (port 8080), `redis`, `postgres`
- `OPENROUTER_API_KEY` set in `docker/.env`

## ¬ß1 ‚Äî Setup & Health Check

In [None]:
import sys, os, json, time, importlib, textwrap
from pathlib import Path
from datetime import datetime
from collections import Counter

import requests

# Ensure reforge root is on sys.path
REFORGE_ROOT = Path(".").resolve().parent
if str(REFORGE_ROOT) not in sys.path:
    sys.path.insert(0, str(REFORGE_ROOT))

API = "http://localhost:8080"
OPENROUTER_KEY = os.environ.get(
    "OPENROUTER_API_KEY",
    "sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
)

def api(path, **kw):  return requests.get(f"{API}{path}", **kw).json()
def post(url, body):  return requests.post(url, json=body).json()

health = api("/health")
print(f"API: {health}")
print(f"Key: ...{OPENROUTER_KEY[-8:]}")

API: {'status': 'healthy', 'service': 'reforge-api', 'version': '0.1.0'}
Key: ...e0ea6829


## ¬ß2 ‚Äî Review Legacy Experiments

In [2]:
experiments = api("/data/experiments")

legacy = [e for e in experiments if e.get('status') == 'legacy']
active = [e for e in experiments if e.get('status') != 'legacy']

print(f"Total experiments: {len(experiments)}")
print(f"Legacy (pilot):    {len(legacy)}")
print(f"Active:            {len(active)}")
print()
for e in legacy:
    print(f"  üóÑÔ∏è  {e['id']:50s}  {e['model']}")
print()
for e in active[:20]:
    status_icon = {'ready': '‚úÖ', 'draft': 'üìù', 'running': '‚è≥', 'completed': '‚úîÔ∏è'}.get(e['status'], '?')
    print(f"  {status_icon}  {e['id']:50s}  {e['model']:35s}  {e.get('context_level', 'L0')}")

Total experiments: 5
Legacy (pilot):    0
Active:            5


  ‚úÖ  exp01_funcnaming_gpt4omini_gold_O0                  openai/gpt-4o-mini                   L0
  ‚úÖ  exp02_funcnaming_gpt4o_gold_O0                      openai/gpt-4o                        L0
  ‚úÖ  exp03_funcnaming_claude_gold_O0                     anthropic/claude-3.5-sonnet          L0
  ‚úÖ  exp04_funcnaming_gpt4omini_gold_O2                  openai/gpt-4o-mini                   L0
  üìù  exp05_funcnaming_gpt4omini_silver_O0                openai/gpt-4o-mini                   L0


## ¬ß3 ‚Äî Build the Benchmark Matrix

Generate all experiment configs programmatically.  
Edit the slices below to control scope ‚Äî the full matrix is 468 experiments.

In [17]:
from data.experiments import (
    build_benchmark_matrix,
    estimate_benchmark_cost,
    BENCHMARK_MODELS,
    BENCHMARK_TIERS,
    BENCHMARK_OPTS,
    BENCHMARK_CONTEXT_LEVELS,
    REGISTRY,
)

# ‚îÄ‚îÄ Customise the slice here ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# For a test run, uncomment and narrow:
selected_models = {"gpt4o-mini": BENCHMARK_MODELS["gpt4o-mini"]}
selected_tiers  = ["GOLD"]
selected_opts   = ["O3"]
selected_ctx    = ["L2"]
#
# Full benchmark:
# selected_models = BENCHMARK_MODELS
# selected_tiers  = BENCHMARK_TIERS
# selected_opts   = BENCHMARK_OPTS
# selected_ctx    = BENCHMARK_CONTEXT_LEVELS

matrix = build_benchmark_matrix(
    models=selected_models,
    tiers=selected_tiers,
    opts=selected_opts,
    context_levels=selected_ctx,
    register=True,
)

print(f"Benchmark matrix: {len(matrix)} experiments")
print(f"  Models:  {len(selected_models)}")
print(f"  Tiers:   {selected_tiers}")
print(f"  Opts:    {selected_opts}")
print(f"  Context: {selected_ctx}")
print()
# Show first 10
for cfg in matrix[:10]:
    print(f"  {cfg.id:55s} ‚Üí {cfg.model:35s}  {cfg.context_level}")
if len(matrix) > 10:
    print(f"  ... and {len(matrix) - 10} more")

# ‚îÄ‚îÄ Push matrix to the API server so /data/experiments/{id} works ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
print()
print("Registering experiments with API server...")
resp = requests.post(
    f"{API}/data/experiments/bulk",
    json=[cfg.model_dump() for cfg in matrix],
)
if resp.status_code in (200, 201):
    reg = resp.json()
    print(f"  ‚úÖ Registered {reg['registered']} experiments ({reg['created']} new, {reg['updated']} updated)")
else:
    print(f"  ‚ö†Ô∏è  Registration failed: {resp.status_code} ‚Äî {resp.text}")
    print("  Dry-run / execution cells will fail with 404 until experiments are registered.")

Benchmark matrix: 1 experiments
  Models:  1
  Tiers:   ['GOLD']
  Opts:    ['O3']
  Context: ['L2']

  bench_gpt4o-mini_gold_O3_L2                             ‚Üí openai/gpt-4o-mini                   L2

Registering experiments with API server...
  ‚úÖ Registered 1 experiments (1 new, 0 updated)


## ¬ß4 ‚Äî Budget Estimation

Estimate cost before committing.  Uses rough OpenRouter pricing tiers.

In [5]:
# Count how many functions we actually have per tier/opt
func_counts = {}
for tier in selected_tiers:
    for opt_lvl in selected_opts:
        try:
            fns = api("/llm/functions", params={"opt": opt_lvl, "tier": tier, "limit": 5000})
            func_counts[(tier, opt_lvl)] = len(fns)
        except Exception:
            func_counts[(tier, opt_lvl)] = 0

total_functions_per_exp = sum(func_counts.values()) // (len(selected_tiers) * len(selected_opts))
print("Functions per (tier, opt):")
for (t, o), n in sorted(func_counts.items()):
    print(f"  {t:8s} {o:3s}: {n:4d} functions")
print(f"\nAverage per experiment: ~{total_functions_per_exp} functions")
print()

# Estimate cost
est = estimate_benchmark_cost(
    matrix,
    avg_prompt_tokens=800,
    avg_completion_tokens=20,
    functions_per_experiment=max(total_functions_per_exp, 50),
)

print(f"Total experiments:   {est['total_experiments']}")
print(f"Total LLM calls:    {est['total_calls']:,}")
print(f"Total input tokens:  {est['total_input_tokens']:,}")
print(f"Estimated cost:      ${est['estimated_cost_usd']:.2f}")
print()

BUDGET = 300.0
if est['estimated_cost_usd'] > BUDGET:
    print(f"‚ö†Ô∏è  OVER BUDGET (${BUDGET:.0f}). Narrow the matrix or reduce models.")
else:
    print(f"‚úÖ Within ${BUDGET:.0f} budget ‚Äî {BUDGET - est['estimated_cost_usd']:.2f} remaining")
    
# Per-model breakdown
model_costs = {}
for item in est['breakdown']:
    m = item['model']
    model_costs[m] = model_costs.get(m, 0) + item['est_cost_usd']
print("\nPer-model cost:")
for m, c in sorted(model_costs.items(), key=lambda x: -x[1]):
    print(f"  ${c:8.2f}  {m}")

Functions per (tier, opt):
  GOLD     O0 :  197 functions

Average per experiment: ~197 functions

Total experiments:   1
Total LLM calls:    197
Total input tokens:  157,600
Estimated cost:      $0.02

‚úÖ Within $300 budget ‚Äî 299.98 remaining

Per-model cost:
  $    0.02  openai/gpt-4o-mini


## ¬ß5 ‚Äî Dry-Run Validation

Pick one experiment per context level. Validate prompt rendering and API round-trip  
**without** making real LLM calls.

In [6]:
## Preview: render one full prompt at L2 (code + calls + CFG + variables)

from workers.llm.prompt import load_template, render_prompt

# Fetch one L2 function from the API
sample = api("/llm/functions", params={
    "opt": "O0",
    "tier": "GOLD",
    "context_level": "L2",
    "limit": 1,
})

if not sample:
    print("‚ö†Ô∏è  No functions returned ‚Äî is the data loaded?")
else:
    fn = sample[0]
    print(f"Function: {fn['dwarf_function_id']}")
    print(f"  ghidra_name:  {fn.get('ghidra_name')}")
    print(f"  test_case:    {fn.get('test_case')}")
    print(f"  loc:          {fn.get('loc_decompiled')}")
    print(f"  cyclomatic:   {fn.get('cyclomatic')}")
    print(f"  bb_count:     {fn.get('bb_count')}")
    print()

    # Load the L2 template and render
    template = load_template("function_naming_v2_L2")
    prompt = render_prompt(
        template,
        fn.get("c_raw", ""),
        calls=fn.get("calls_text"),
        cfg_summary=fn.get("cfg_text"),
        variables=fn.get("variables_text"),
    )

    print("=" * 80)
    print("FULL RENDERED PROMPT (L2)")
    print("=" * 80)
    print(prompt)
    print("=" * 80)
    print(f"\nPrompt length: {len(prompt)} chars  (~{len(prompt)//4} tokens)")


Function: cu0x0:die0x12d
  ghidra_name:  FUN_004011fa
  test_case:    t01_crossfile_calls
  loc:          17
  cyclomatic:   None
  bb_count:     None

FULL RENDERED PROMPT (L2)
You are an expert reverse engineer analyzing decompiled binary code.

A function has been decompiled from a stripped binary using Ghidra. The original
symbol names have been removed by the strip tool. Your task is to analyze the
decompiled C code along with its structural metadata, then suggest a meaningful,
descriptive function name that reflects what the function does.

Guidelines:
- Use snake_case naming convention
- Be specific but concise ‚Äî prefer 2-4 words
- Focus on the function's PURPOSE, not its implementation details
- Use call relationships and variable information to understand context
- Consider control-flow complexity when reasoning about function role
- If the function is a standard library wrapper, name it accordingly
- If you cannot determine the purpose, use a descriptive structural name

Re

In [18]:
import asyncio
from workers.llm.runner import run_experiment

# Pick one cheap-model experiment per context level for validation
dry_run_ids = [
    cfg.id for cfg in matrix
    if "gpt4o-mini" in cfg.id and "gold" in cfg.id and "O3" in cfg.id
][:3]  # Should be L0, L1, L2

print(f"Dry-run targets: {dry_run_ids}")
print()

for exp_id in dry_run_ids:
    print(f"\n{'='*60}")
    print(f"DRY RUN: {exp_id}")
    print(f"{'='*60}")
    summary = await run_experiment(
        exp_id,
        api_base=API,
        dry_run=True,
    )
    print(f"  Total: {summary['total']}, New: {summary['new']}, Errors: {summary['errors']}")
    print(f"  {'‚úÖ PASS' if summary['errors'] == 0 else '‚ùå FAIL'}")

Dry-run targets: ['bench_gpt4o-mini_gold_O3_L2']


DRY RUN: bench_gpt4o-mini_gold_O3_L2
  Total: 85, New: 85, Errors: 0
  ‚úÖ PASS


## ¬ß6 ‚Äî Execute Experiments

### Execution Strategy
- **Phase 1**: Cheap models first (gpt-4o-mini, llama, deepseek) ‚Äî validate pipeline
- **Phase 2**: Mid-tier models (gpt-4o, claude-3.5-sonnet)
- **Phase 3**: Premium models (gpt-5.1, claude-opus-4.6, codex-max)

Each phase can be run independently. Results are idempotent (resume support).

In [19]:
# ‚îÄ‚îÄ Phase selector ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Uncomment ONE phase at a time, or set custom filter.

CHEAP_MODELS = {"gpt4o-mini", "llama31-70b"} #, "deepseek-coder2", "deepseek-v32", "deepseek-r1", "qwen3-coder"
MID_MODELS   = {"gpt4o", "claude35sonnet", "claude-sonnet45", "gemini3-pro"}
PREMIUM_MODELS = {"gpt51", "gpt51-codex-max", "claude-opus46"}

# Phase 1: cheap
phase_filter = CHEAP_MODELS
# Phase 2: mid
# phase_filter = MID_MODELS
# Phase 3: premium
# phase_filter = PREMIUM_MODELS
# All at once (‚ö†Ô∏è expensive):
# phase_filter = CHEAP_MODELS | MID_MODELS | PREMIUM_MODELS

phase_experiments = [
    cfg for cfg in matrix
    if any(label in cfg.id for label in phase_filter)
]
print(f"Phase experiments: {len(phase_experiments)}")
for cfg in phase_experiments[:5]:
    print(f"  {cfg.id}")
if len(phase_experiments) > 5:
    print(f"  ... and {len(phase_experiments)-5} more")

Phase experiments: 1
  bench_gpt4o-mini_gold_O3_L2


In [20]:
# ‚îÄ‚îÄ Execute the selected phase ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# This cell runs all experiments in the phase sequentially.
# Each experiment uses internal async concurrency (5 parallel LLM calls).

summaries = []
errors_total = 0

for i, cfg in enumerate(phase_experiments, 1):
    print(f"\n[{i}/{len(phase_experiments)}] {cfg.id}")
    print(f"  Model: {cfg.model}  Tier: {cfg.tier}  Opt: {cfg.opt}  Ctx: {cfg.context_level}")
    
    try:
        summary = await run_experiment(
            cfg.id,
            api_base=API,
            openrouter_key=OPENROUTER_KEY,
            concurrency=5,
        )
        summaries.append(summary)
        
        new = summary.get('new', 0)
        errs = summary.get('errors', 0)
        errors_total += errs
        status = '‚úÖ' if errs == 0 else '‚ö†Ô∏è'
        print(f"  {status} Completed: {summary.get('completed', 0)}, New: {new}, Errors: {errs}")
    except Exception as exc:
        print(f"  ‚ùå FAILED: {exc}")
        errors_total += 1
        summaries.append({"experiment_id": cfg.id, "error": str(exc)})

print(f"\n{'='*60}")
print(f"Phase complete: {len(summaries)} experiments, {errors_total} total errors")


[1/1] bench_gpt4o-mini_gold_O3_L2
  Model: openai/gpt-4o-mini  Tier: GOLD  Opt: O3  Ctx: L2


LLM calls (openai/gpt-4o-mini): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 85/85 [00:10<00:00,  8.01fn/s]


  ‚úÖ Completed: 85, New: 85, Errors: 0

Phase complete: 1 experiments, 0 total errors


## ¬ß7 ‚Äî Score All Results

Trigger the scorer for all experiments that have results but haven't been scored.

In [21]:
# Fetch all experiments and trigger scoring for those with results
all_exps = api("/data/experiments")
benchmark_exps = [e for e in all_exps if 'benchmark-v2' in e.get('tags', [])]

scored = 0
score_errors = 0
for e in benchmark_exps:
    exp_id = e['id']
    try:
        # Check if results exist
        results = api(f"/results/{exp_id}")
        if not results:
            continue
        
        # Trigger scoring
        resp = requests.post(f"{API}/results/{exp_id}/score")
        if resp.status_code == 200:
            score_data = resp.json()
            n = score_data.get('scored', 0)
            print(f"  ‚úÖ {exp_id}: {n} scored")
            scored += 1
        else:
            print(f"  ‚ö†Ô∏è  {exp_id}: {resp.status_code}")
    except Exception as exc:
        print(f"  ‚ùå {exp_id}: {exc}")
        score_errors += 1

print(f"\nScored: {scored}, Errors: {score_errors}")

  ‚úÖ bench_gpt4o-mini_gold_O0_L2: 196 scored
  ‚úÖ bench_gpt4o-mini_gold_O1_L2: 115 scored
  ‚úÖ bench_gpt4o-mini_gold_O2_L2: 92 scored
  ‚úÖ bench_gpt4o-mini_gold_O3_L2: 85 scored

Scored: 4, Errors: 0


## ¬ß8 ‚Äî Quick Progress Overview

Show how many experiments have results and their scoring status.

In [None]:
all_exps = api("/data/experiments")
benchmark_exps = [e for e in all_exps if 'benchmark-v2' in e.get('tags', [])]

progress = {"no_results": 0, "has_results": 0, "scored": 0}
model_progress = {}

for e in benchmark_exps:
    exp_id = e['id']
    model = e['model']
    if model not in model_progress:
        model_progress[model] = {"total": 0, "done": 0}
    model_progress[model]["total"] += 1
    
    try:
        results = api(f"/results/{exp_id}")
        if results:
            progress["has_results"] += 1
            model_progress[model]["done"] += 1
            # Check if scored
            try:
                resp = api(f"/results/{exp_id}/scores")
                rows = resp.get("rows", []) if isinstance(resp, dict) else resp
                if rows and any(s.get('token_f1') is not None for s in rows):
                    progress["scored"] += 1
            except Exception:
                pass
        else:
            progress["no_results"] += 1
    except Exception:
        progress["no_results"] += 1

print(f"Benchmark progress ({len(benchmark_exps)} experiments):")
print(f"  No results yet: {progress['no_results']}")
print(f"  Has results:    {progress['has_results']}")
print(f"  Scored:         {progress['scored']}")
print()
print("Per-model progress:")
for m in sorted(model_progress.keys()):
    p = model_progress[m]
    pct = (p['done'] / p['total'] * 100) if p['total'] else 0
    bar = '‚ñà' * int(pct // 5) + '‚ñë' * (20 - int(pct // 5))
    print(f"  {m:40s} {bar} {p['done']:3d}/{p['total']:3d} ({pct:.0f}%)")

Benchmark progress (1 experiments):
  No results yet: 0
  Has results:    1
  Scored:         0

Per-model progress:
  openai/gpt-4o-mini                       ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   1/  1 (100%)


---
## Quick Reference

### CLI execution (alternative to notebook)
```bash
# Single experiment
cd reforge
python -m workers.llm.runner --experiment bench_gpt4o-mini_gold_O0_L0 --api-base http://localhost:8080 --concurrency 5

# Dry run
python -m workers.llm.runner --experiment bench_gpt4o-mini_gold_O0_L0 --api-base http://localhost:8080 --dry-run

# Batch (bash loop)
for exp in bench_gpt4o-mini_gold_O0_L0 bench_gpt4o-mini_gold_O0_L1 bench_gpt4o-mini_gold_O0_L2; do
    python -m workers.llm.runner --experiment $exp --api-base http://localhost:8080 -v
done
```

### Analysis
See `analysis.ipynb` for all figures, tables, and statistical tests.