# Agent Colab: Model Comparison Benchmark

Compares **Gemini 3 Pro Preview** vs **Gemini 2.5 Pro** on an agentic, data-driven reasoning task.

Each model is run **N times** on the same task and data. Per-test pass rates are collected to show statistically meaningful differences â€” especially on the "headroom" challenges (citation rings, temporal anomalies, typo correction, ambiguous author disambiguation).

**Setup:** Google Colab Pro with `google.colab.ai` (no API keys needed)

**Agent must:**
1. Load data from files (environment interaction)
2. Extract entities and resolve ambiguities (multi-step reasoning)
3. Analyze citation network and detect anomalies (graph reasoning)
4. Save final_report.json to disk (artifact generation)
5. Pass all unit tests

## Setup

In [None]:
import subprocess, sys, os, shutil

REPO_URL = "https://github.com/EhsanKA/agentic_task.git"
REPO_DIR = "/content/agentic_task"

if os.path.exists(REPO_DIR):
    shutil.rmtree(REPO_DIR)
subprocess.run(["git", "clone", REPO_URL, REPO_DIR], check=True)
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "--force-reinstall", "--no-cache-dir", REPO_DIR], check=True)

# Purge ALL cached benchmark submodules so fresh code is loaded
for mod_name in list(sys.modules):
    if mod_name == "benchmark" or mod_name.startswith("benchmark."):
        del sys.modules[mod_name]

# Verify correct version loaded
from benchmark.evaluation.agent import build_agent_context
import inspect
print("build_agent_context signature:", inspect.signature(build_agent_context))

In [None]:
from google.colab import ai
import json

available_models = ai.list_models()
print("Available models:", available_models)

## Configuration & Data Generation

In [None]:
from benchmark.data.loader import setup_data
from benchmark.evaluation.prompt import BENCHMARK_PROMPT
from benchmark.evaluation.agent import build_agent_context
from benchmark.evaluation.comparison import (
    run_comparison, pass_rate_table, summary_table, run_log_table, print_verdict,
)

# --- Comparison configuration ---
MODELS_TO_COMPARE = [
    "google/gemini-3-pro-preview",
    "google/gemini-2.5-pro",
]
NUM_RUNS = 5  # runs per model (increase to 10 for stronger statistical signal)

# Filter to models actually available in this Colab session
MODELS_TO_COMPARE = [m for m in MODELS_TO_COMPARE if m in available_models]
assert len(MODELS_TO_COMPARE) >= 1, f"None of the target models available. Have: {available_models}"
print(f"Models to compare: {MODELS_TO_COMPARE}")
print(f"Runs per model:    {NUM_RUNS}")

# Generate data once (fixed across all runs for fair comparison)
_, _, _, DATA_DIR = setup_data()
print(f"Data directory:    {DATA_DIR}")

# Adapter: wrap google.colab.ai into the signature comparison.py expects
def generate_fn(prompt, model_name):
    return ai.generate_text(prompt=prompt, model_name=model_name)

## Multi-Run Comparison

In [None]:
context = build_agent_context(BENCHMARK_PROMPT, DATA_DIR)

all_run_results = run_comparison(
    models=MODELS_TO_COMPARE,
    num_runs=NUM_RUNS,
    context=context,
    data_dir=DATA_DIR,
    generate_fn=generate_fn,
)

In [None]:
df_rates = pass_rate_table(all_run_results, MODELS_TO_COMPARE)
print("=" * 70)
print("PER-TEST PASS RATES")
print("=" * 70)
print(df_rates.to_string())
print("\n*** = headroom challenge test (designed to differentiate models)")

## Overall Summary

In [None]:
df_summary = summary_table(all_run_results, MODELS_TO_COMPARE)
print("=" * 70)
print("MODEL COMPARISON SUMMARY")
print("=" * 70)
print(df_summary.to_string(index=False))

## Detailed Run Log

In [None]:
df_runs = run_log_table(all_run_results)
print(df_runs.to_string(index=False))

## Verdict

In [None]:
print_verdict(all_run_results, MODELS_TO_COMPARE)