# BenchMAC Analysis V2

This notebook provides a comprehensive analysis of all BenchMAC experiments, including both successful and failed runs.

Unlike the first analysis notebook which only considered completed experiments, this version:
- Analyzes failed experiments and categorizes failure reasons
- Handles experiments with empty diffs (now treated as completed with zero metrics)
- Identifies missing experiments and evaluations
- Provides actionable recommendations for re-running experiments with adjusted limits

## Approach

1. Load all experiments and evaluations from disk
2. Compute experiment tasks as the Cartesian product: instances × agent_configs
3. Match each task to experiments and evaluations
4. Identify and categorize failures (steps limit, cost limit, other errors)
5. Filter out agent configs with excessive cost-related failures
6. Create final dataset: one experiment + evaluation per task
7. Perform analysis and visualization

## Setup and Imports

In [1]:
import json
from collections import Counter, defaultdict
from pathlib import Path

import pandas as pd
from pydantic import TypeAdapter

from bench_mac.core.models import (
    EvaluationCompleted,
    EvaluationFailed,
    EvaluationResult,
)
from experiments.models import (
    AgentConfig,
    CompletedExperiment,
    ExperimentResult,
    FailedExperiment,
    MiniSweAgentConfig,
)

## 1. Load All Experiments and Evaluations

In [2]:
# Configuration
BENCHMAC_DIR = Path("../.benchmac")
EXPERIMENTS_DIR = BENCHMAC_DIR / "experiments" / "results"
EVALUATIONS_DIR = BENCHMAC_DIR / "evaluations"
assert EXPERIMENTS_DIR.exists()
assert EVALUATIONS_DIR.exists()

# Instances to exclude from analysis
EXCLUDED_INSTANCES = {
    "akveo__ngx-admin_v15_to_v16",  # Known problematic instance
}

In [3]:
def load_experiments(
    experiments_dir: Path,
) -> tuple[list[CompletedExperiment], list[FailedExperiment]]:
    """Load all experiments from JSON files."""
    completed: list[CompletedExperiment] = []
    failed: list[FailedExperiment] = []

    for json_file in experiments_dir.rglob("*.json"):
        with json_file.open("r") as f:
            experiment = ExperimentResult.model_validate_json(f.read()).root

            match experiment:
                case CompletedExperiment():
                    completed.append(experiment)
                case FailedExperiment():
                    failed.append(experiment)

    return completed, failed


def load_evaluations(
    evaluations_dir: Path,
) -> tuple[list[EvaluationCompleted], list[EvaluationFailed]]:
    """Load all evaluations from JSONL files."""
    completed: list[EvaluationCompleted] = []
    failed: list[EvaluationFailed] = []

    eval_adapter = TypeAdapter(EvaluationResult)

    for jsonl_file in evaluations_dir.rglob("*.jsonl"):
        with jsonl_file.open("r") as f:
            for line in f:
                if line.strip():
                    eval_result = eval_adapter.validate_python(json.loads(line))
                    match eval_result:
                        case EvaluationCompleted():
                            completed.append(eval_result)
                        case EvaluationFailed():
                            failed.append(eval_result)

    return completed, failed

In [4]:
# Load all data
print("Loading experiments...")
completed_experiments, failed_experiments = load_experiments(EXPERIMENTS_DIR)

print("Loading evaluations...")
completed_evaluations, failed_evaluations = load_evaluations(EVALUATIONS_DIR)

print(f"\nLoaded {len(completed_experiments)} completed experiments")
print(f"Loaded {len(failed_experiments)} failed experiments")
print(f"Loaded {len(completed_evaluations)} completed evaluations")
print(f"Loaded {len(failed_evaluations)} failed evaluations")

Loading experiments...
Loading evaluations...

Loaded 168 completed experiments
Loaded 35 failed experiments
Loaded 172 completed evaluations
Loaded 0 failed evaluations


## 2. Initial Filtering and Statistics

In [5]:
# Filter out excluded instances
completed_experiments = [
    e for e in completed_experiments if e.task.instance_id not in EXCLUDED_INSTANCES
]
failed_experiments = [
    e for e in failed_experiments if e.task.instance_id not in EXCLUDED_INSTANCES
]

print("After filtering excluded instances:")
print(f"  {len(completed_experiments)} completed experiments")
print(f"  {len(failed_experiments)} failed experiments")

After filtering excluded instances:
  168 completed experiments
  35 failed experiments


In [6]:
# Count experiments with empty diffs
empty_diff_experiments = [
    e for e in completed_experiments if e.submission.model_patch == ""
]

print(f"\nExperiments with empty diffs: {len(empty_diff_experiments)}")


Experiments with empty diffs: 7


## 3. Extract Unique Instances and Agent Configs

In [7]:
# Extract unique instances
all_instance_ids = {
    exp.task.instance_id for exp in completed_experiments + failed_experiments
}

instance_ids = sorted(all_instance_ids)

print(f"Found {len(instance_ids)} unique benchmark instances:")
for iid in instance_ids:
    print(f"  - {iid}")

Found 9 unique benchmark instances:
  - gothinkster__angular-realworld-example-app_v11_to_v12
  - gothinkster__angular-realworld-example-app_v12_to_v13
  - gothinkster__angular-realworld-example-app_v13_to_v14
  - gothinkster__angular-realworld-example-app_v14_to_v15
  - gothinkster__angular-realworld-example-app_v15_to_v16
  - gothinkster__angular-realworld-example-app_v16_to_v17
  - gothinkster__angular-realworld-example-app_v17_to_v18
  - gothinkster__angular-realworld-example-app_v18_to_v19
  - gothinkster__angular-realworld-example-app_v19_to_v20


In [8]:
# Extract unique agent configs
type AgentConfigKey = str
agent_configs_dict: dict[AgentConfigKey, AgentConfig] = {}

for exp in completed_experiments + failed_experiments:
    config = exp.task.agent_config
    key = config.key
    if key not in agent_configs_dict:
        agent_configs_dict[key] = config

agent_configs = sorted(agent_configs_dict.values(), key=lambda x: x.key)

print(f"\nFound {len(agent_configs)} unique agent configurations:")
for ac in agent_configs:
    print(f"  - {ac.key}")
    if isinstance(ac, MiniSweAgentConfig):
        print(f"    - {ac.model_name}")
        model_kwargs = ac.model_kwargs
        if model_kwargs:
            print(f"    - {model_kwargs}")


Found 21 unique agent configurations:
  - angular-schematics/789e301f
  - swe-agent-mini/anthropic/claude-opus-4-1-20250805@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
    - anthropic/claude-opus-4-1-20250805
    - {'temperature': 0.0}
  - swe-agent-mini/anthropic/claude-sonnet-4-20250514@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
    - anthropic/claude-sonnet-4-20250514
    - {'temperature': 0.0}
  - swe-agent-mini/anthropic/claude-sonnet-4-5-20250929@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
    - anthropic/claude-sonnet-4-5-20250929
    - {'temperature': 0.0}
  - swe-agent-mini/gemini/gemini-2.5-flash-lite@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
    - gemini/gemini-2.5-flash-lite
    - {'temperature': 0.0}
  - swe-agent-mini/gemini/gemini-2.5-flash@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
    - gemini/gemini-2

## 4. Compute Experiment Tasks and Match to Results

In [9]:
# Compute all possible experiment tasks (Cartesian product)
from experiments.models import ExperimentTask

experiment_tasks = [
    ExperimentTask(instance_id=instance_id, agent_config=agent_config)
    for instance_id in instance_ids
    for agent_config in agent_configs
]

print(f"Total experiment tasks: {len(experiment_tasks)}")
print(f"  = {len(instance_ids)} instances × {len(agent_configs)} agent configs")  # noqa: RUF001

Total experiment tasks: 189
  = 9 instances × 21 agent configs


In [10]:
# Group experiments by task
experiments_by_task: dict[
    ExperimentTask, list[FailedExperiment | CompletedExperiment]
] = defaultdict(list)

for exp in completed_experiments:
    task = ExperimentTask(
        instance_id=exp.task.instance_id, agent_config=exp.task.agent_config
    )
    experiments_by_task[task].append(exp)

for exp in failed_experiments:
    task = ExperimentTask(
        instance_id=exp.task.instance_id, agent_config=exp.task.agent_config
    )
    experiments_by_task[task].append(exp)

assert sum(len(exps) for exps in experiments_by_task.values()) == len(
    completed_experiments
) + len(failed_experiments)

In [11]:
# Analyze task coverage
tasks_with_no_experiments = [
    task
    for task in experiment_tasks
    if task not in experiments_by_task or not experiments_by_task[task]
]

tasks_with_multiple_experiments = [
    (task, len(experiments_by_task[task]))
    for task in experiment_tasks
    if len(experiments_by_task[task]) > 1
]

print(f"\nTasks with 0 experiments: {len(tasks_with_no_experiments)}")
if tasks_with_no_experiments:
    print("  Consider running these experiments:")
    for task in tasks_with_no_experiments[:5]:  # Show first 5
        print(f"    - {task.instance_id} with {task.agent_config.key}")
    if len(tasks_with_no_experiments) > 5:
        print(f"    ... and {len(tasks_with_no_experiments) - 5} more")

print(f"\nTasks with >1 experiments: {len(tasks_with_multiple_experiments)}")
if tasks_with_multiple_experiments:
    print("  (This is not a problem - we keep the latest completed experiment)")
    for task, count in tasks_with_multiple_experiments:
        print(
            f"    - {task.instance_id} with {task.agent_config.key}: {count} experiments"
        )
        failed, completed = (
            [e for e in experiments_by_task[task] if isinstance(e, FailedExperiment)],
            [
                e
                for e in experiments_by_task[task]
                if isinstance(e, CompletedExperiment)
            ],
        )
        print(f"      - {len(failed)} failed, {len(completed)} completed")


Tasks with 0 experiments: 0

Tasks with >1 experiments: 12
  (This is not a problem - we keep the latest completed experiment)
    - gothinkster__angular-realworld-example-app_v11_to_v12 with swe-agent-mini/anthropic/claude-opus-4-1-20250805@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc: 2 experiments
      - 2 failed, 0 completed
    - gothinkster__angular-realworld-example-app_v11_to_v12 with swe-agent-mini/gemini/gemini-2.5-flash-lite@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc: 2 experiments
      - 1 failed, 1 completed
    - gothinkster__angular-realworld-example-app_v11_to_v12 with swe-agent-mini/mistral/devstral-small-2507@modelkw-2b574f61@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc: 2 experiments
      - 2 failed, 0 completed
    - gothinkster__angular-realworld-example-app_v11_to_v12 with swe-agent-mini/mistral/magistral-small-2509@modelkw-813f7a1c@minisweagent-1.13.0@tasktpl-73947524@agentsettin

## 5. Analyze Failed Experiments

In [12]:
from typing import cast

# Find tasks with only failed experiments
tasks_with_only_failures: list[tuple[ExperimentTask, list[FailedExperiment]]] = []

for task in experiment_tasks:
    if task not in experiments_by_task:
        continue

    exps = experiments_by_task[task]
    if not exps:
        print(f"Task {(task.instance_id, task.agent_config.key)} has no experiments")
        continue

    if all(isinstance(e, FailedExperiment) for e in exps):
        tasks_with_only_failures.append((task, cast(list[FailedExperiment], exps)))

print(f"Tasks with only failed experiments: {len(tasks_with_only_failures)}")

Tasks with only failed experiments: 23


In [13]:
STEP_LIMIT_MULTIPLE = 100  # ⚠️⚠️⚠️⚠️ Assuming step limit is a multiple of that ⚠️⚠️⚠️⚠️⚠️


# Categorize failure reasons
def categorize_failure(exp: FailedExperiment) -> str:
    """Categorize the reason for experiment failure."""

    # Check if steps limit was exceeded
    if (
        "LimitsExceeded" in exp.error
        and exp.artifacts
        and exp.artifacts.execution_trace
    ):
        num_steps = len(exp.artifacts.execution_trace.steps)
        if num_steps % STEP_LIMIT_MULTIPLE == 0:
            return "steps_limit_exceeded"

    # Check error message for cost-related failures
    if "LimitsExceeded" in exp.error:
        return (
            "cost_limit_exceeded"  # Assume if it's not a step limit, it's a cost limit
        )

    return "other_error"


failure_categories = Counter()
failures_by_category: dict[str, list[tuple[ExperimentTask, FailedExperiment]]] = (
    defaultdict(list)
)  # category -> list[(task, FailedExperiment)]

for task, exps in tasks_with_only_failures:
    latest_failure = max(exps, key=lambda e: e.ended_at)
    category = categorize_failure(latest_failure)
    failure_categories[category] += 1
    failures_by_category[category].append((task, latest_failure))

print("\nFailure categories:")
for category, count in failure_categories.most_common():
    print(f"  {category}: {count}")


Failure categories:
  steps_limit_exceeded: 14
  cost_limit_exceeded: 9


## 6. Handle Cost Limit Failures

Rule:
- If more than X percent of an agent config's runs exceeded the cost limit, discard all runs from that agent
- For remaining cost-exceeded runs, provide recommendations to re-run with higher limits

In [14]:
# Set the threshold for discarding agents due to cost failures
COST_FAILURE_DISCARD_THRESHOLD_PERCENT = 33

# Count cost failures per agent config
cost_failures_by_agent = defaultdict(int)
total_runs_by_agent = defaultdict(int)

for task in experiment_tasks:
    agent_key = task.agent_config.key
    total_runs_by_agent[agent_key] += 1

    if task in experiments_by_task:
        exps = experiments_by_task[task]
        latest = max(
            exps, key=lambda e: e.ended_at if hasattr(e, "ended_at") else e.started_at
        )

        if (
            isinstance(latest, FailedExperiment)
            and categorize_failure(latest) == "cost_limit_exceeded"
        ):
            cost_failures_by_agent[agent_key] += 1

# Identify agents to discard (>COST_FAILURE_DISCARD_THRESHOLD_PERCENT% cost failures)
agents_to_discard: set[AgentConfigKey] = set()
agents_needing_rerun: dict[AgentConfigKey, int] = {}

for agent_key in total_runs_by_agent:
    total = total_runs_by_agent[agent_key]
    cost_fails = cost_failures_by_agent[agent_key]

    if (
        total > 0
        and (cost_fails / total * 100) > COST_FAILURE_DISCARD_THRESHOLD_PERCENT
    ):
        agents_to_discard.add(agent_key)
    elif cost_fails > 0:
        agents_needing_rerun[agent_key] = cost_fails

print(
    f"Agent configs to discard (>{COST_FAILURE_DISCARD_THRESHOLD_PERCENT}% cost failures): {len(agents_to_discard)}"
)
for agent_key in agents_to_discard:
    total = total_runs_by_agent[agent_key]
    cost_fails = cost_failures_by_agent[agent_key]
    print(
        f"  {agent_key[:60]}...: {cost_fails}/{total}={cost_fails / total * 100:2.0f}% failures"
    )

print(
    f"\nAgent configs needing re-runs (some cost failures): {len(agents_needing_rerun)}"
)
for agent_key, count in agents_needing_rerun.items():
    print(f"  {agent_key[:60]}...: {count} failures")
    # Show the actual tasks for this agent_key that failed due to cost
    failed_tasks = [
        task
        for task in experiment_tasks
        if task.agent_config.key == agent_key
        and task in experiments_by_task
        and any(
            isinstance(exp, FailedExperiment)
            and categorize_failure(exp) == "cost_limit_exceeded"
            for exp in experiments_by_task[task]
        )
    ]
    for task in failed_tasks:
        print(f"    - Task instance_id: {task.instance_id}")

Agent configs to discard (>33% cost failures): 2
  swe-agent-mini/xai/grok-4-0709@modelkw-916a2d40@minisweagent...: 4/9=44% failures
  swe-agent-mini/anthropic/claude-opus-4-1-20250805@modelkw-91...: 5/9=56% failures

Agent configs needing re-runs (some cost failures): 0


In [15]:
# Show specific instances that need re-running with higher cost limits
if agents_needing_rerun:
    print("\n" + "=" * 80)
    print("ACTION REQUIRED: Re-run these experiments with higher cost limits")
    print("=" * 80)

    for task, _exp in failures_by_category["cost_limit_exceeded"]:
        agent_key = task.agent_config.key
        if agent_key not in agents_to_discard:
            print(f"\nInstance: {task.instance_id}")
            print(f"Agent config: {agent_key}")

    print("\n" + "=" * 80)
    print("After re-running, return to this notebook to continue analysis.")
    print("=" * 80)

## 7. Create Final Dataset: One Experiment Per Task

For each experiment task, we select the latest completed experiment (if available).
Tasks with only failed experiments or from discarded agents are excluded.

In [16]:
# Filter out discarded agents
valid_tasks = [
    task for task in experiment_tasks if task.agent_config.key not in agents_to_discard
]
agent_configs = sorted(
    [ac for ac in agent_configs if ac.key not in agents_to_discard], key=lambda x: x.key
)
assert set(agent_configs) == {task.agent_config for task in valid_tasks}

print(f"Valid tasks after filtering: {len(valid_tasks)}")
print(
    f"  (Removed {len(experiment_tasks) - len(valid_tasks)} tasks from {len(agents_to_discard)} discarded agents)"
)
# Show final number of agent configs
print(f"Final number of agent configs: {len(agent_configs)}")
for ac in agent_configs:
    print(f"  - {ac.key}")

Valid tasks after filtering: 171
  (Removed 18 tasks from 2 discarded agents)
Final number of agent configs: 19
  - angular-schematics/789e301f
  - swe-agent-mini/anthropic/claude-sonnet-4-20250514@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
  - swe-agent-mini/anthropic/claude-sonnet-4-5-20250929@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
  - swe-agent-mini/gemini/gemini-2.5-flash-lite@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
  - swe-agent-mini/gemini/gemini-2.5-flash@modelkw-916a2d40@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
  - swe-agent-mini/gemini/gemini-2.5-pro@modelkw-06f54fcf@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
  - swe-agent-mini/mistral/devstral-medium-2507@modelkw-2b574f61@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc
  - swe-agent-mini/mistral/devstral-small-2507@modelkw-2b574f61@minisweagent-1.13.0@tasktpl-73947524@age

In [17]:
# Map each valid task to its latest completed experiment
task_to_experiment: dict[ExperimentTask, CompletedExperiment | FailedExperiment] = {}

for task in valid_tasks:
    task_pretty = f"{task.instance_id} with {task.agent_config.key}"
    if task not in experiments_by_task:
        print(f"Task {task_pretty} not found in experiments_by_task")
        continue
    exps = experiments_by_task[task]
    if not exps:
        print(f"Task {task_pretty} has no experiments")
        continue

    failed = sorted(
        [e for e in exps if isinstance(e, FailedExperiment)], key=lambda e: e.ended_at
    )
    completed = sorted(
        [e for e in exps if isinstance(e, CompletedExperiment)],
        key=lambda e: e.ended_at,
    )
    if not completed:
        print(
            f"Task {task_pretty} has no completed experiments, so we assign the latest failed experiment"
        )
        assert len(failed) >= 1
        exp = failed[-1]
        execution_trace = (
            exp.artifacts.execution_trace
            if exp.artifacts and exp.artifacts.execution_trace
            else None
        )
        num_steps = len(execution_trace.steps) if execution_trace else None
        print(f"  Failed experiment: {exp.error} (num_steps: {num_steps})")
    else:
        exp = completed[-1]

    # Keep latest completed experiment
    task_to_experiment[task] = exp

print(f"\nTasks mapped to experiments: {len(task_to_experiment)}")

Task gothinkster__angular-realworld-example-app_v11_to_v12 with swe-agent-mini/mistral/devstral-small-2507@modelkw-2b574f61@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc has no completed experiments, so we assign the latest failed experiment
  Failed experiment: Mini SWE Agent stopped before submission: LimitsExceeded:  (num_steps: 100)
Task gothinkster__angular-realworld-example-app_v12_to_v13 with swe-agent-mini/mistral/devstral-small-2507@modelkw-2b574f61@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc has no completed experiments, so we assign the latest failed experiment
  Failed experiment: Mini SWE Agent stopped before submission: LimitsExceeded:  (num_steps: 100)
Task gothinkster__angular-realworld-example-app_v13_to_v14 with swe-agent-mini/mistral/mistral-small-2506@modelkw-7322c1f7@minisweagent-1.13.0@tasktpl-73947524@agentsettings-e3def7dc has no completed experiments, so we assign the latest failed experiment
  Failed experiment: Mini SWE Agent sto

## 8. Match Experiments to Evaluations

In [18]:
# Group evaluations by submission ID
from bench_mac.core.models import SubmissionID

evaluations_by_submission: dict[SubmissionID, list[EvaluationResult]] = defaultdict(
    list
)

for eval_result in completed_evaluations:
    evaluations_by_submission[eval_result.result.submission_id].append(eval_result)

for eval_result in failed_evaluations:
    evaluations_by_submission[eval_result.submission_id].append(eval_result)

assert len(evaluations_by_submission) == len(completed_evaluations) + len(
    failed_evaluations
)

In [19]:
# Match each experiment to evaluations
task_to_evaluation: dict[ExperimentTask, EvaluationResult] = {}
tasks_without_evaluation: list[ExperimentTask] = []
tasks_with_only_failed_evals: dict[ExperimentTask, list[EvaluationFailed]] = (
    defaultdict(list)
)

for task, exp in task_to_experiment.items():
    match exp:
        case CompletedExperiment():
            submission_id = exp.submission.submission_id
        case FailedExperiment():
            # FailedExperiments have no submission
            continue

    if (
        submission_id not in evaluations_by_submission
        or not evaluations_by_submission[submission_id]
    ):
        tasks_without_evaluation.append(task)
        continue

    evals = evaluations_by_submission[submission_id]
    assert len(evals) > 0

    completed_evals = [e for e in evals if isinstance(e, EvaluationCompleted)]
    failed_evals = [e for e in evals if isinstance(e, EvaluationFailed)]

    if not completed_evals:
        tasks_with_only_failed_evals[task] = failed_evals
        continue

    # Keep latest completed evaluation
    latest_eval = max(completed_evals, key=lambda e: e.ended_at)
    task_to_evaluation[task] = latest_eval

print(f"Tasks with completed evaluations: {len(task_to_evaluation)}")
print(f"Tasks without any evaluation: {len(tasks_without_evaluation)}")
print(f"Tasks with only failed evaluations: {len(tasks_with_only_failed_evals)}")

Tasks with completed evaluations: 157
Tasks without any evaluation: 0
Tasks with only failed evaluations: 0


In [20]:
# Show failed evaluation errors
if tasks_with_only_failed_evals:
    print("\nFailed evaluations:")
    for task, evals in tasks_with_only_failed_evals.items():
        task_pretty = f"{task.instance_id} with {task.agent_config.key}"
        print(f"  {task_pretty}:")
        for eval in evals:
            print(f"    - {eval.error}")
        print()

In [21]:
# Prompt user to run missing evaluations
if tasks_without_evaluation or tasks_with_only_failed_evals:
    print("\n" + "=" * 80)
    print("ACTION REQUIRED: Run evaluations for experiments")
    print("=" * 80)
    print("\nRun: uv run benchmac eval")
    print(
        f"\nMissing evaluations: {len(tasks_without_evaluation) + len(tasks_with_only_failed_evals)}"
    )
    print("=" * 80)

In [22]:
EvaluatedExperiment = tuple[CompletedExperiment, EvaluationCompleted]
data: dict[ExperimentTask, FailedExperiment | EvaluatedExperiment] = {}

for task in valid_tasks:
    match task_to_experiment[task]:
        case CompletedExperiment():
            exp = task_to_experiment[task]
            eval = task_to_evaluation[task]
            assert isinstance(exp, CompletedExperiment)
            assert isinstance(eval, EvaluationCompleted)
            data[task] = (exp, eval)
        case FailedExperiment():
            exp = task_to_experiment[task]
            assert isinstance(exp, FailedExperiment)
            data[task] = exp

assert not any(task in data for task in tasks_without_evaluation)
assert not any(task in data for task in tasks_with_only_failed_evals)

tasks = list(data.keys())
experiments = [x if isinstance(x, FailedExperiment) else x[0] for x in data.values()]
failed_experiments = [x for x in experiments if isinstance(x, FailedExperiment)]
completed_experiments = [x for x in experiments if isinstance(x, CompletedExperiment)]
evaluations = [x[1] for x in data.values() if isinstance(x, tuple)]

In [23]:
from collections import defaultdict

# Group data by agent_config.key
grouped_by_agent: dict[str, list[FailedExperiment | EvaluatedExperiment]] = defaultdict(
    list
)
for task, result in data.items():
    agent_key = task.agent_config.key
    grouped_by_agent[agent_key].append(result)

assert all(len(results) == len(instance_ids) for results in grouped_by_agent.values())

At this point, we have successfully matched experiment tasks to their results

## Next Steps

From here, we can proceed with:
- Computing success metrics and outcome categories
- Analyzing patch characteristics
- Creating leaderboards and visualizations
- Detecting execution loops in failed experiments

# Analysis

Now that we loaded and prepared all the data, what do we want to analyze ?
Some open questions:

- What models are the best ? Worst ? 
- What models are the most token efficient
- What models are the most step efficient
- What models are the fastest/slowest ?
- What models are the most expensive, what are the cheapest ? 
  Looking at the price per token of models is not enough because:
    Cheap per-token reasoning models can be more expensive than expensive per-token non-reasoning models, since reasoning models can generate way more tokens than non-reasoning models.
    Some models might take more steps than others for the same task. 
    In essence, because we can't predict the amount of tokens generated by the models, we need a more reliable metric for cost. One metric could be the average cost per instance. 
    But it wouldn't give us an idea of the cost/performance ratio. 

dataframe construction


evaluation metrics coming from the evaluation

percentage of steps suceeded

number of steps


## 9. Build Analysis Dataset

To enable downstream tables and leaderboards we consolidate the `data` mapping
into a tabular structure that keeps one row per `(instance_id, agent_config)`
pair and captures experiment/evaluation outcomes side by side.



In [24]:
from typing import Any


def _get_agent_display(agent_config: AgentConfig) -> str:
    """Return a human-friendly label for the agent configuration."""

    if isinstance(agent_config, MiniSweAgentConfig):
        return agent_config.model_name
    return agent_config.display_name


def _extract_experiment_common(
    experiment: CompletedExperiment | FailedExperiment,
) -> dict[str, Any]:
    """Collect fields shared by completed and failed experiments."""

    artifacts = experiment.artifacts
    execution_trace = artifacts.execution_trace if artifacts else None
    step_count = len(execution_trace.steps) if execution_trace else None

    return {
        "experiment_id": experiment.id,
        "experiment_started_at": experiment.started_at,
        "experiment_ended_at": experiment.ended_at,
        "experiment_duration_seconds": experiment.duration.total_seconds(),
        "experiment_cost_usd": artifacts.cost_usd if artifacts else None,
        "experiment_n_calls": artifacts.n_calls if artifacts else None,
        "experiment_step_count": step_count,
    }


records: list[dict[str, Any]] = []

for task, result in data.items():
    agent_cfg = task.agent_config
    base: dict[str, Any] = {
        "instance_id": task.instance_id,
        "agent_key": agent_cfg.key,
        "agent_display_name": _get_agent_display(agent_cfg),
        "agent_scaffold": agent_cfg.scaffold,
    }

    if isinstance(result, FailedExperiment):
        record = {
            **base,
            "experiment_status": "failed",
            "failure_category": categorize_failure(result),
            "submission_id": None,
            "empty_diff": None,
            "evaluation_status": None,
            "evaluation_id": None,
            "evaluation_started_at": None,
            "evaluation_ended_at": None,
            "evaluation_duration_seconds": None,
            "evaluation_step_count": None,
            "evaluation_total_duration_seconds": None,
            "patch_application_success": None,  # TODO: shoudln't we set it to False ?
            "install_success": None,  # TODO: shoudln't we set it to False ?
            "build_success": None,  # TODO: shoudln't we set it to False ?
            "target_version_achieved": None,  # TODO: shoudln't we set it to False ?
        }
        record.update(_extract_experiment_common(result))
        records.append(record)
        continue

    experiment, evaluation = result
    metrics = evaluation.result.metrics

    record = {
        **base,
        "experiment_status": "completed",
        "failure_category": None,
        "submission_id": experiment.submission.submission_id,
        "empty_diff": experiment.submission.model_patch.strip() == "",
        "evaluation_status": "completed",
        "evaluation_id": evaluation.id,
        "patch_application_success": metrics.patch_application_success,
        "install_success": metrics.install_success,
        "build_success": metrics.build_success,
        "target_version_achieved": metrics.target_version_achieved,
    }
    record.update(_extract_experiment_common(experiment))
    records.append(record)

analysis_df = (
    pd.DataFrame.from_records(records)
    .sort_values(["instance_id", "agent_key"])
    .reset_index(drop=True)
)

analysis_df.head()

Unnamed: 0,instance_id,agent_key,agent_display_name,agent_scaffold,experiment_status,failure_category,submission_id,empty_diff,evaluation_status,evaluation_id,...,experiment_ended_at,experiment_duration_seconds,experiment_cost_usd,experiment_n_calls,experiment_step_count,evaluation_started_at,evaluation_ended_at,evaluation_duration_seconds,evaluation_step_count,evaluation_total_duration_seconds
0,gothinkster__angular-realworld-example-app_v11...,angular-schematics/789e301f,angular-schematics/789e301f,angular-schematics,completed,,0199a233-1775-71a6-8535-6ebf930e8d18,False,completed,0199a24b-d760-7d66-b2a0-bc006af02001,...,2025-10-01 23:56:36.294169+00:00,114.877647,,,2,,,,,
1,gothinkster__angular-realworld-example-app_v11...,swe-agent-mini/anthropic/claude-sonnet-4-20250...,anthropic/claude-sonnet-4-20250514,swe-agent-mini,completed,,0199a45b-a964-757e-9fc8-6524883b29da,False,completed,0199a76a-55ce-7684-9dd2-dc3e09760e18,...,2025-10-02 10:40:13.332620+00:00,446.549921,0.920712,29.0,29,,,,,
2,gothinkster__angular-realworld-example-app_v11...,swe-agent-mini/anthropic/claude-sonnet-4-5-202...,anthropic/claude-sonnet-4-5-20250929,swe-agent-mini,completed,,0199a45b-a965-7cad-81e6-5a240a9eb661,False,completed,0199a76e-dd8d-7726-844f-bce9d3e0f941,...,2025-10-02 10:47:41.814046+00:00,445.896804,0.552558,20.0,20,,,,,
3,gothinkster__angular-realworld-example-app_v11...,swe-agent-mini/gemini/gemini-2.5-flash-lite@mo...,gemini/gemini-2.5-flash-lite,swe-agent-mini,completed,,0199a90c-ed54-78d2-8916-96c6d2db864f,False,completed,0199a99e-4774-73d7-bad6-7227d16f2faf,...,2025-10-03 08:12:29.190901+00:00,353.443725,0.020955,29.0,29,,,,,
4,gothinkster__angular-realworld-example-app_v11...,swe-agent-mini/gemini/gemini-2.5-flash@modelkw...,gemini/gemini-2.5-flash,swe-agent-mini,completed,,0199a45b-a961-760f-8b6e-2b4230c123a4,False,completed,0199a767-95b9-730e-8e2e-6eb481e46b9f,...,2025-10-02 10:36:10.979403+00:00,331.32046,0.042373,23.0,23,,,,,


In [None]:
total_cost = analysis_df["experiment_cost_usd"].dropna().sum()
print(f"Total cost (USD): {total_cost:.4f}")

Total cost (USD): 21.6799
