# Vyapti Probe Benchmark — Interactive Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SharathSPhD/pramana/blob/main/notebooks/03_vyapti_evaluation.ipynb)

This notebook provides an interactive interface for running the **Vyapti Probe Benchmark** — a 100-problem evaluation suite testing whether LLMs can distinguish statistical regularity from *vyapti* (invariable concomitance).

**What's inside:**
- Browse all 100 problems (50 probes + 50 controls) across 5 Hetvabhasa categories
- Run evaluation with local models (Ollama/llama.cpp) or Hugging Face models
- View 5-tier scoring results (outcome, structure, vyapti, Z3, hetvabhasa)
- Statistical analysis with bootstrap confidence intervals
- Visualizations: probe-vs-control heatmaps, failure distributions, CI plots

In [None]:
#@title 1. Environment Setup
import os
import sys
import subprocess
from pathlib import Path

# Detect environment
IN_COLAB = "google.colab" in sys.modules or os.path.exists("/content")

if IN_COLAB:
    REPO_URL = "https://github.com/SharathSPhD/pramana.git"
    REPO_DIR = "/content/pramana"
    if not os.path.exists(REPO_DIR):
        print("Cloning repository...")
        subprocess.run(
            ["git", "clone", "--depth", "1", REPO_URL, REPO_DIR],
            check=True,
            capture_output=True,
        )
        print("Done.")
    os.chdir(REPO_DIR)
    subprocess.run(
        [sys.executable, "-m", "pip", "install", "-q", "-r", "notebooks/requirements.txt"],
        check=True,
        capture_output=True,
    )
else:
    # Local: support running from repo root or notebooks/
    cwd = Path.cwd().resolve()
    if (cwd / "data" / "vyapti_probe").exists():
        os.chdir(cwd)
    elif (cwd.parent / "data" / "vyapti_probe").exists():
        os.chdir(cwd.parent)

PROJECT_ROOT = Path.cwd().resolve()
DATA_DIR = PROJECT_ROOT / "data" / "vyapti_probe"
RESULTS_REAL_DIR = DATA_DIR / "results_real"

if not DATA_DIR.exists():
    raise FileNotFoundError(
        f"Could not find benchmark data directory at {DATA_DIR}. "
        "Run this notebook from the pramana repo root (or Colab setup)."
    )

# Add project paths for imports
sys.path.insert(0, str(PROJECT_ROOT / "src"))
sys.path.insert(0, str(PROJECT_ROOT / "notebooks"))

# GPU detection
try:
    import torch

    GPU_AVAILABLE = torch.cuda.is_available()
    if GPU_AVAILABLE:
        gpu_name = torch.cuda.get_device_name(0)
        print(f"GPU detected: {gpu_name}")
    else:
        print("No GPU detected. CPU mode.")
except ImportError:
    GPU_AVAILABLE = False
    print("PyTorch not available. Using CPU-only backends.")

print(f"Project root: {PROJECT_ROOT}")
print(f"Results path: {RESULTS_REAL_DIR}")
print("Setup complete.")

In [None]:
#@title 2. Load Benchmark Data
import json
from collections import Counter

with open(DATA_DIR / "problems.json") as f:
    problems = json.load(f)

with open(DATA_DIR / "solutions.json") as f:
    solutions_list = json.load(f)

solutions = {s["id"]: s for s in solutions_list}
problems_by_id = {p["id"]: p for p in problems}

probes = [p for p in problems if p["type"] == "probe"]
controls = [p for p in problems if p["type"] == "control"]

print(f"Loaded {len(problems)} problems ({len(probes)} probes + {len(controls)} controls)")
print(f"Loaded {len(solutions)} solutions")
print()

cat_counts = Counter(p["category"] for p in probes)
print("Probes by category:")
for cat, count in sorted(cat_counts.items()):
    print(f"  {cat}: {count}")

missing_solutions = [p["id"] for p in problems if p["id"] not in solutions]
if missing_solutions:
    print(f"\nWARNING: {len(missing_solutions)} problems missing solutions")

## 3. Browse Problems

Explore benchmark content by category and type (independent of model execution).

In [None]:
#@title 3. Problem Browser
from IPython.display import display, Markdown, HTML

#@markdown **Select category and problem:**
category = 'savyabhichara' #@param ['savyabhichara', 'viruddha', 'prakaranasama', 'sadhyasama', 'kalatita']
problem_type = 'probe' #@param ['probe', 'control']
problem_index = 0 #@param {type: "slider", min: 0, max: 14}

filtered = [p for p in problems if p['category'] == category and p['type'] == problem_type]
if problem_index >= len(filtered):
    problem_index = len(filtered) - 1

p = filtered[problem_index]
sol = solutions.get(p['id'], {})

md = f"""### {p['id']} | {p['logic_type']} | Difficulty: {p['difficulty']}

**Problem:**

{p['problem_text']}

---

**Correct Answer:** {p['correct_answer']}

"""

if p['type'] == 'probe':
    md += f"""**Trap Answer:** {p['trap_answer']}

**Vyapti Under Test:** {p['vyapti_under_test']}

**Why It Fails:** {p['why_it_fails']}
"""

if sol:
    md += f"""\n---\n\n**Solution Details:**
- Vyapti Status: {sol.get('vyapti_status', 'N/A')}
- Counterexample: {sol.get('counterexample', 'N/A')}
- Hetvabhasa: {sol.get('hetvabhasa_type', 'N/A')}
- Z3 Note: {sol.get('z3_verification', 'N/A')}
"""

display(Markdown(md))

## 4. Load Pre-computed Evaluation Results (Primary Mode)

This is the default workflow. It reads real evaluation outputs from `data/vyapti_probe/results_real/` and does **not** require loading any model backend.

In [None]:
#@title 4a. Load Results Artifacts
import json
from collections import defaultdict

RESULTS_DIR = RESULTS_REAL_DIR
summary_path = RESULTS_DIR / "summary.json"
comparisons_path = RESULTS_DIR / "comparisons.json"
taxonomy_path = RESULTS_DIR / "taxonomy_coverage.json"
report_path = RESULTS_DIR / "report.md"

if not summary_path.exists():
    raise FileNotFoundError(
        f"Missing {summary_path}. Run scripts/run_vyapti_real.py and scripts/run_vyapti_analysis.py first."
    )

with open(summary_path) as f:
    summary = json.load(f)

comparisons = []
if comparisons_path.exists():
    with open(comparisons_path) as f:
        comparisons = json.load(f)

taxonomy = None
if taxonomy_path.exists():
    with open(taxonomy_path) as f:
        taxonomy = json.load(f)

MODEL_RESULTS = defaultdict(dict)
PER_MODEL_COUNTS = {}

for model in summary.keys():
    model_dir = RESULTS_DIR / model.replace("/", "_").replace(" ", "_")
    if not model_dir.exists():
        PER_MODEL_COUNTS[model] = 0
        continue

    for fp in sorted(model_dir.glob("*.json")):
        if fp.name == "summary.json":
            continue
        with open(fp) as f:
            row = json.load(f)
        pid = row.get("problem_id", fp.stem)
        MODEL_RESULTS[model][pid] = row

    PER_MODEL_COUNTS[model] = len(MODEL_RESULTS[model])

AVAILABLE_MODELS = list(summary.keys())

print(f"Loaded summary for {len(summary)} models from {RESULTS_DIR}")
for model in AVAILABLE_MODELS:
    count = PER_MODEL_COUNTS.get(model, 0)
    flag = "OK" if count >= 100 else "WARN"
    print(f"  [{flag}] {model}: {count} per-problem records")

if comparisons:
    print(f"Loaded {len(comparisons)} statistical comparisons")
if taxonomy:
    print("Loaded taxonomy coverage metrics")

In [None]:
#@title 4b. Overview Table (All Models)
rows = []
for model, data in summary.items():
    rows.append(
        {
            "model": model,
            "accuracy": data.get("accuracy", 0.0),
            "probe_accuracy": data.get("probe_accuracy", 0.0),
            "control_accuracy": data.get("control_accuracy", 0.0),
            "gap_control_minus_probe": data.get("control_accuracy", 0.0) - data.get("probe_accuracy", 0.0),
            "records_loaded": PER_MODEL_COUNTS.get(model, 0),
        }
    )

rows = sorted(rows, key=lambda r: r["accuracy"], reverse=True)

try:
    import pandas as pd
    from IPython.display import display

    df = pd.DataFrame(rows)
    display(df.style.format({
        "accuracy": "{:.1%}",
        "probe_accuracy": "{:.1%}",
        "control_accuracy": "{:.1%}",
        "gap_control_minus_probe": "{:+.1%}",
    }))
except Exception:
    print(f'\n{"Model":<28} {"Accuracy":>10} {"Probe":>8} {"Control":>8} {"Gap":>8} {"Rows":>6}')
    print('-' * 80)
    for row in rows:
        print(
            f"{row['model']:<28} "
            f"{row['accuracy']:>9.1%} "
            f"{row['probe_accuracy']:>7.1%} "
            f"{row['control_accuracy']:>7.1%} "
            f"{row['gap_control_minus_probe']:>+7.1%} "
            f"{row['records_loaded']:>6}"
        )

print("\nPrimary mode note: this analysis uses pre-computed JSON artifacts only; no model is loaded.")

## 5. Browse Model Responses (Input/Output Data)

In [None]:
#@title 5a. Response Browser
from IPython.display import Markdown, display
import html

#@markdown Select a model and problem id to inspect the saved model response and tier scores.
selected_model = "deepseek_8b_base" #@param ["llama_3b_base", "deepseek_8b_base", "stage0_pramana", "stage1_pramana", "base_with_cot", "base_with_nyaya_template"]
selected_problem_id = "SAV-01" #@param {type:"string"}

model_records = MODEL_RESULTS.get(selected_model, {})
record = model_records.get(selected_problem_id.strip())
problem = problems_by_id.get(selected_problem_id.strip())
solution = solutions.get(selected_problem_id.strip(), {})

if problem is None:
    print(f"Problem '{selected_problem_id}' not found in benchmark.")
elif record is None:
    print(f"No saved record found for model='{selected_model}', problem='{selected_problem_id}'.")
    sample_ids = sorted(list(model_records.keys()))[:15]
    if sample_ids:
        print(f"Sample available ids for {selected_model}: {', '.join(sample_ids)}")
else:
    tiers = record.get("tiers", [])
    tier_lines = []
    for t in tiers:
        status = "PASS" if t.get("passed", False) else "FAIL"
        tier_lines.append(f"- Tier {t.get('tier', '?')} ({t.get('name', 'unknown')}): {status} | score={t.get('score', 0):.2f}")

    md = f"""
### {selected_problem_id} | {problem.get('category', 'N/A')} | {problem.get('type', 'N/A')}

**Model:** `{selected_model}`  
**Final answer correct:** `{record.get('final_answer_correct', False)}`  
**Hetvabhasa classification:** `{record.get('hetvabhasa_classification', 'N/A')}`

**Problem text**

{problem.get('problem_text', '')}

---

**Ground-truth answer**

{solution.get('answer', problem.get('correct_answer', 'N/A'))}

---

**Tier breakdown**

{chr(10).join(tier_lines)}

---

**Model raw response**

```text
{record.get('raw_response', '')}
```
"""
    display(Markdown(md))

In [None]:
#@title 6. Statistical Analysis (from saved artifacts)
CATEGORIES = ["savyabhichara", "viruddha", "prakaranasama", "sadhyasama", "kalatita"]

print("=== C1-C4 Comparisons ===")
if comparisons:
    for c in comparisons:
        print(f"\n{c.get('name', 'Unnamed comparison')}")
        print(f"  {c.get('description', '')}")
        print(f"  Difference: {c.get('difference', 0):+.3f}")
        if c.get("p_value_approx", -1) >= 0:
            print(f"  95% CI: [{c.get('ci_lower', 0):+.3f}, {c.get('ci_upper', 0):+.3f}]")
            print(f"  p-value: {c.get('p_value_approx', 1.0):.4f}")
            print(f"  Significant: {c.get('significant', False)}")
        else:
            print("  Descriptive metric (no inferential p-value)")
        print(f"  N: {c.get('n_samples', 0)}")
else:
    print("No comparisons.json loaded.")

print("\n=== Hetvabhasa Taxonomy ===")
if taxonomy:
    print(f"Total failures: {taxonomy.get('total_failures', 0)}")
    print(f"Coverage: {taxonomy.get('coverage_pct', 0):.1f}%")
    print(f"Assisted predictive accuracy: {taxonomy.get('predictive_accuracy', 0):.1f}%")
    print(f"Strict predictive accuracy: {taxonomy.get('strict_predictive_accuracy', 0):.1f}%")
    print(f"Fallback classifications: {taxonomy.get('fallback_count', 0)}")
    print("\nDistribution:")
    dist = taxonomy.get("distribution", {})
    total_failures = max(1, taxonomy.get("total_failures", 1))
    for htype, count in sorted(dist.items(), key=lambda kv: kv[1], reverse=True):
        print(f"  {htype:<16} {count:>4} ({count/total_failures:>5.1%})")
else:
    print("No taxonomy_coverage.json loaded.")

print("\n=== Category-wise Probe/Control by Model ===")
for model, data in summary.items():
    print(f"\n{model}")
    print(f"  {'Category':<16} {'Probe':>9} {'Control':>9} {'Gap':>8}")
    for cat in CATEGORIES:
        cat_data = data.get("by_category", {}).get(cat, {})
        pc = cat_data.get("probe_correct", 0)
        pt = max(1, cat_data.get("probe_total", 0))
        cc = cat_data.get("control_correct", 0)
        ct = max(1, cat_data.get("control_total", 0))
        gap = (cc / ct) - (pc / pt)
        print(f"  {cat:<16} {pc:>2}/{pt:<5} {cc:>2}/{ct:<5} {gap:>+7.1%}")

In [None]:
#@title 7. Visualization (from pre-computed data)
#@markdown Choose a model for category-level probe/control visualization.
viz_model = "deepseek_8b_base" #@param ["llama_3b_base", "deepseek_8b_base", "stage0_pramana", "stage1_pramana", "base_with_cot", "base_with_nyaya_template"]

try:
    import matplotlib.pyplot as plt
    import numpy as np

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    ax1, ax2, ax3 = axes

    # Panel 1: Probe vs control by category for selected model
    cat_short = {
        "savyabhichara": "SAV",
        "viruddha": "VIR",
        "prakaranasama": "PRA",
        "sadhyasama": "SAD",
        "kalatita": "KAL",
    }
    model_data = summary.get(viz_model, {})
    by_cat = model_data.get("by_category", {})

    cats = ["savyabhichara", "viruddha", "prakaranasama", "sadhyasama", "kalatita"]
    probe_accs = []
    control_accs = []
    labels = []

    for cat in cats:
        c = by_cat.get(cat, {})
        pt = max(1, c.get("probe_total", 0))
        ct = max(1, c.get("control_total", 0))
        probe_accs.append(c.get("probe_correct", 0) / pt)
        control_accs.append(c.get("control_correct", 0) / ct)
        labels.append(cat_short[cat])

    x = np.arange(len(labels))
    w = 0.36
    ax1.bar(x - w / 2, probe_accs, w, label="Probes", alpha=0.85)
    ax1.bar(x + w / 2, control_accs, w, label="Controls", alpha=0.85)
    ax1.set_xticks(x)
    ax1.set_xticklabels(labels)
    ax1.set_ylim(0, 1)
    ax1.set_ylabel("Accuracy")
    ax1.set_title(f"{viz_model}: Probe vs Control by category")
    ax1.legend()

    # Panel 2: Probe and control accuracy across all models
    models = list(summary.keys())
    probe_all = [summary[m].get("probe_accuracy", 0.0) for m in models]
    control_all = [summary[m].get("control_accuracy", 0.0) for m in models]
    x2 = np.arange(len(models))

    ax2.bar(x2 - w / 2, probe_all, w, label="Probe", alpha=0.85)
    ax2.bar(x2 + w / 2, control_all, w, label="Control", alpha=0.85)
    ax2.set_xticks(x2)
    ax2.set_xticklabels(models, rotation=40, ha="right")
    ax2.set_ylim(0, 1)
    ax2.set_title("All models: Probe vs Control")
    ax2.legend()

    # Panel 3: Hetvabhasa failure distribution from taxonomy_coverage.json
    if taxonomy and taxonomy.get("distribution"):
        dist = taxonomy["distribution"]
        labels3 = list(dist.keys())
        values3 = [dist[k] for k in labels3]
        ax3.bar(labels3, values3, alpha=0.9)
        ax3.set_title("Failure distribution by hetvabhasa")
        ax3.set_ylabel("Count")
        ax3.tick_params(axis="x", labelrotation=30)
    else:
        ax3.text(0.5, 0.5, "No taxonomy data", ha="center", va="center")
        ax3.set_title("Failure distribution by hetvabhasa")

    plt.tight_layout()
    plt.show()
except ImportError:
    print("matplotlib not available. Install with: pip install matplotlib")

## 8. Optional Live Evaluation (Advanced)

Use this only if you explicitly want to run fresh inference. This section loads a real backend (`ollama` or `transformers`) and runs the benchmark runner on a subset or full set.

In [None]:
#@title 8a. Run Live Evaluation (Optional)
#@markdown This is optional and requires real model loading. Keep `run_live_eval` unchecked for viewer-only mode.

live_backend = "ollama" #@param ["ollama", "transformers"]
live_model_name = "llama3.2:3b-instruct-q4_K_M" #@param {type:"string"}
ollama_base_url = "http://localhost:11434" #@param {type:"string"}
hf_token = "" #@param {type:"string"}

#@markdown **Generation parameters:**
live_max_new_tokens = 1024 #@param {type:"integer"}
live_temperature = 0.5 #@param {type:"number"}
live_top_p = 0.75 #@param {type:"number"}
live_top_k = 5 #@param {type:"integer"}

#@markdown **Execution controls:**
run_live_eval = False #@param {type:"boolean"}
live_evaluate_all = False #@param {type:"boolean"}
live_subset_size = 5 #@param {type:"integer"}

if not run_live_eval:
    print("Live evaluation is disabled.")
    print("Set run_live_eval=True only when you want fresh inference.")
else:
    try:
        from pramana_backend import create_backend
        from pramana.benchmarks.vyapti_runner import VyaptiEvaluationRunner

        if live_backend == "ollama":
            backend_obj = create_backend(
                "ollama",
                model_name=live_model_name,
                base_url=ollama_base_url,
            )
        elif live_backend == "transformers":
            backend_obj = create_backend(
                "transformers",
                model_id=live_model_name,
                hf_token=hf_token or None,
            )
        else:
            raise ValueError(f"Unsupported backend: {live_backend}")

        def model_fn(prompt: str) -> str:
            return backend_obj.generate(
                prompt,
                max_new_tokens=live_max_new_tokens,
                temperature=live_temperature,
                top_p=live_top_p,
                top_k=live_top_k,
            )

        runner = VyaptiEvaluationRunner(
            {
                "benchmark_path": "data/vyapti_probe/problems.json",
                "solutions_path": "data/vyapti_probe/solutions.json",
            }
        )

        if live_evaluate_all:
            live_results = runner.evaluate_model(live_model_name, model_fn)
        else:
            n = max(1, min(int(live_subset_size), len(problems)))
            subset_ids = [p["id"] for p in problems[:n]]
            live_results = runner.evaluate_model(live_model_name, model_fn, problem_ids=subset_ids)

        total = len(live_results)
        correct = sum(1 for r in live_results if r.final_answer_correct)
        probes_live = [r for r in live_results if r.problem_type == "probe"]
        controls_live = [r for r in live_results if r.problem_type == "control"]
        probe_acc = (sum(1 for r in probes_live if r.final_answer_correct) / max(1, len(probes_live)))
        control_acc = (sum(1 for r in controls_live if r.final_answer_correct) / max(1, len(controls_live)))

        print(f"\n=== Live Results: {live_model_name} ({live_backend}) ===")
        print(f"Accuracy: {correct}/{total} ({correct / max(1, total):.1%})")
        print(f"Probe accuracy: {probe_acc:.1%}")
        print(f"Control accuracy: {control_acc:.1%}")
        print(f"Vyapti gap (control - probe): {(control_acc - probe_acc):+.1%}")
    except Exception as e:
        print(f"Live evaluation failed: {e}")
        print("\nTips:")
        print("- For ollama: ensure local server is running and model exists")
        print("- For transformers: provide a valid HF model id and token (if gated)")

---

## Summary

This notebook is now **viewer-first**:

1. Primary workflow analyzes pre-computed real campaign artifacts from `data/vyapti_probe/results_real/`.
2. You can browse benchmark content and inspect model input/output behavior per problem.
3. You can run optional fresh inference with real backends only (`ollama` or `transformers`).

**Core data files:**
- `data/vyapti_probe/problems.json`
- `data/vyapti_probe/solutions.json`
- `data/vyapti_probe/results_real/summary.json`
- `data/vyapti_probe/results_real/comparisons.json`
- `data/vyapti_probe/results_real/taxonomy_coverage.json`

**Core scripts:**
- `scripts/run_vyapti_real.py` — Full real evaluation campaign
- `scripts/run_vyapti_analysis.py` — Statistical analysis + plots

Use the optional live section only when you intentionally want to load models and run inference.