# Vyapti Probe Benchmark — Interactive Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SharathSPhD/pramana/blob/main/notebooks/03_vyapti_evaluation.ipynb)

This notebook provides an interactive interface for running the **Vyapti Probe Benchmark** — a 100-problem evaluation suite testing whether LLMs can distinguish statistical regularity from *vyapti* (invariable concomitance).

**What's inside:**
- Browse all 100 problems (50 probes + 50 controls) across 5 Hetvabhasa categories
- Run evaluation with local models (Ollama/llama.cpp) or Hugging Face models
- View 5-tier scoring results (outcome, structure, vyapti, Z3, hetvabhasa)
- Statistical analysis with bootstrap confidence intervals
- Visualizations: probe-vs-control heatmaps, failure distributions, CI plots

In [None]:
#@title 1. Environment Setup
import os, sys, subprocess, importlib

# Detect environment
IN_COLAB = 'google.colab' in sys.modules or os.path.exists('/content')

if IN_COLAB:
    REPO_URL = 'https://github.com/SharathSPhD/pramana.git'
    REPO_DIR = '/content/pramana'
    if not os.path.exists(REPO_DIR):
        print('Cloning repository...')
        subprocess.run(['git', 'clone', '--depth', '1', REPO_URL, REPO_DIR],
                       check=True, capture_output=True)
        print('Done.')
    os.chdir(REPO_DIR)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', '-r', 'notebooks/requirements.txt'],
                   check=True, capture_output=True)
    sys.path.insert(0, os.path.join(REPO_DIR, 'src'))
else:
    # Local: assume running from repo root or notebooks/
    root = os.path.abspath(os.path.join(os.getcwd(), '..'))
    if os.path.exists(os.path.join(root, 'data', 'vyapti_probe')):
        os.chdir(root)
    sys.path.insert(0, os.path.join(os.getcwd(), 'src'))

# GPU detection
try:
    import torch
    GPU_AVAILABLE = torch.cuda.is_available()
    if GPU_AVAILABLE:
        gpu_name = torch.cuda.get_device_name(0)
        print(f'GPU detected: {gpu_name}')
    else:
        print('No GPU detected. CPU mode.')
except ImportError:
    GPU_AVAILABLE = False
    print('PyTorch not available. Using CPU-only backends.')

print('Setup complete.')

In [None]:
#@title 2. Load Benchmark Data
import json
from pathlib import Path

DATA_DIR = Path('data/vyapti_probe')

with open(DATA_DIR / 'problems.json') as f:
    problems = json.load(f)
with open(DATA_DIR / 'solutions.json') as f:
    solutions_list = json.load(f)
    solutions = {s['id']: s for s in solutions_list}

probes = [p for p in problems if p['type'] == 'probe']
controls = [p for p in problems if p['type'] == 'control']

print(f'Loaded {len(problems)} problems ({len(probes)} probes + {len(controls)} controls)')
print(f'Loaded {len(solutions)} solutions')
print()

# Summary by category
from collections import Counter
cat_counts = Counter(p['category'] for p in probes)
print('Probes by category:')
for cat, count in sorted(cat_counts.items()):
    print(f'  {cat}: {count}')

## 3. Browse Problems

Explore the benchmark problems by category and type.

In [None]:
#@title 3. Problem Browser
from IPython.display import display, Markdown, HTML

#@markdown **Select category and problem:**
category = 'savyabhichara' #@param ['savyabhichara', 'viruddha', 'prakaranasama', 'sadhyasama', 'kalatita']
problem_type = 'probe' #@param ['probe', 'control']
problem_index = 0 #@param {type: "slider", min: 0, max: 14}

filtered = [p for p in problems if p['category'] == category and p['type'] == problem_type]
if problem_index >= len(filtered):
    problem_index = len(filtered) - 1

p = filtered[problem_index]
sol = solutions.get(p['id'], {})

md = f"""### {p['id']} | {p['logic_type']} | Difficulty: {p['difficulty']}

**Problem:**

{p['problem_text']}

---

**Correct Answer:** {p['correct_answer']}

"""

if p['type'] == 'probe':
    md += f"""**Trap Answer:** {p['trap_answer']}

**Vyapti Under Test:** {p['vyapti_under_test']}

**Why It Fails:** {p['why_it_fails']}
"""

if sol:
    md += f"""\n---\n\n**Solution Details:**
- Vyapti Status: {sol.get('vyapti_status', 'N/A')}
- Counterexample: {sol.get('counterexample', 'N/A')}
- Hetvabhasa: {sol.get('hetvabhasa_type', 'N/A')}
- Z3 Note: {sol.get('z3_verification', 'N/A')}
"""

display(Markdown(md))

## 4. Run Evaluation

Evaluate a model on the full benchmark or a subset.

In [None]:
#@title 4a. Configure Model Backend

backend = 'simulation' #@param ['simulation', 'ollama', 'transformers']
model_name = 'simulated_base' #@param {type: 'string'}

#@markdown **Generation parameters:**
max_new_tokens = 2048 #@param {type: 'integer'}
temperature = 0.5 #@param {type: 'number'}
top_p = 0.75 #@param {type: 'number'}
top_k = 5 #@param {type: 'integer'}

def get_model_fn(backend, model_name):
    if backend == 'simulation':
        import random
        rng = random.Random(42)
        def sim_fn(prompt):
            if rng.random() < 0.4:
                return 'No, the conclusion cannot be drawn. There is a counterexample that falsifies the universal.'
            return 'Yes, based on the pattern observed in the premises, we can conclude the answer is affirmative.'
        return sim_fn
    elif backend == 'ollama':
        import requests
        def ollama_fn(prompt):
            resp = requests.post('http://localhost:11434/api/generate', json={
                'model': model_name,
                'prompt': prompt,
                'options': {'num_predict': max_new_tokens, 'temperature': temperature, 'top_p': top_p, 'top_k': top_k},
                'stream': False,
            }, timeout=300)
            return resp.json().get('response', '')
        return ollama_fn
    elif backend == 'transformers':
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map='auto')
        def hf_fn(prompt):
            inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
            with torch.no_grad():
                outputs = model.generate(**inputs, max_new_tokens=max_new_tokens,
                                         temperature=temperature, top_p=top_p, top_k=top_k, do_sample=True)
            return tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
        return hf_fn
    else:
        raise ValueError(f'Unknown backend: {backend}')

model_fn = get_model_fn(backend, model_name)
print(f'Backend: {backend}, Model: {model_name}')

In [None]:
#@title 4b. Run Evaluation
from pramana.benchmarks.vyapti_runner import VyaptiEvaluationRunner

config = {
    'benchmark_path': 'data/vyapti_probe/problems.json',
    'solutions_path': 'data/vyapti_probe/solutions.json',
}

runner = VyaptiEvaluationRunner(config)

#@markdown **Evaluate subset or all?**
evaluate_all = True #@param {type: 'boolean'}
subset_size = 10 #@param {type: 'integer'}

if evaluate_all:
    results = runner.evaluate_model(model_name, model_fn)
else:
    subset_ids = [p['id'] for p in problems[:subset_size]]
    results = runner.evaluate_model(model_name, model_fn, problem_ids=subset_ids)

# Summary
correct = sum(1 for r in results if r.final_answer_correct)
total = len(results)
print(f'\n=== Results: {model_name} ===')
print(f'Accuracy: {correct}/{total} ({correct/total:.1%})')

probe_results = [r for r in results if r.problem_type == 'probe']
control_results = [r for r in results if r.problem_type == 'control']
probe_correct = sum(1 for r in probe_results if r.final_answer_correct)
control_correct = sum(1 for r in control_results if r.final_answer_correct)
print(f'Probe accuracy: {probe_correct}/{len(probe_results)} ({probe_correct/len(probe_results):.1%})')
print(f'Control accuracy: {control_correct}/{len(control_results)} ({control_correct/len(control_results):.1%})')
print(f'Vyapti gap: {(control_correct/len(control_results) - probe_correct/len(probe_results)):.1%}')

## 5. Analysis and Visualization

In [None]:
#@title 5a. Category-wise Breakdown
from dataclasses import asdict

CATEGORIES = ['savyabhichara', 'viruddha', 'prakaranasama', 'sadhyasama', 'kalatita']

print(f'\n{"Category":<20} {"Probe":>8} {"Control":>8} {"Gap":>8}')
print('-' * 48)

for cat in CATEGORIES:
    cat_probes = [r for r in results if r.category == cat and r.problem_type == 'probe']
    cat_controls = [r for r in results if r.category == cat and r.problem_type == 'control']
    pc = sum(1 for r in cat_probes if r.final_answer_correct)
    cc = sum(1 for r in cat_controls if r.final_answer_correct)
    pt = len(cat_probes) or 1
    ct = len(cat_controls) or 1
    gap = cc/ct - pc/pt
    print(f'{cat:<20} {pc}/{pt:>3}     {cc}/{ct:>3}     {gap:>+.0%}')

In [None]:
#@title 5b. Hetvabhasa Failure Distribution
from collections import Counter

failures = [r for r in results if not r.final_answer_correct]
h_dist = Counter(r.hetvabhasa_classification for r in failures)

print(f'\nTotal failures: {len(failures)}')
print(f'\n{"Hetvabhasa Type":<20} {"Count":>6} {"Pct":>6}')
print('-' * 34)
for htype, count in h_dist.most_common():
    print(f'{htype:<20} {count:>6} {count/len(failures):>6.1%}')

In [None]:
#@title 5c. Visualization
try:
    import matplotlib.pyplot as plt
    import numpy as np

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # Probe vs Control by category
    cats = ['SAV', 'VIR', 'PRA', 'SAD', 'KAL']
    probe_accs = []
    control_accs = []
    for cat in CATEGORIES:
        cp = [r for r in results if r.category == cat and r.problem_type == 'probe']
        cc = [r for r in results if r.category == cat and r.problem_type == 'control']
        probe_accs.append(sum(1 for r in cp if r.final_answer_correct) / max(len(cp), 1))
        control_accs.append(sum(1 for r in cc if r.final_answer_correct) / max(len(cc), 1))

    x = np.arange(len(cats))
    w = 0.35
    ax1.bar(x - w/2, probe_accs, w, label='Probes', color='#e74c3c', alpha=0.8)
    ax1.bar(x + w/2, control_accs, w, label='Controls', color='#2ecc71', alpha=0.8)
    ax1.set_xticks(x)
    ax1.set_xticklabels(cats)
    ax1.set_ylabel('Accuracy')
    ax1.set_title(f'Probe vs Control Accuracy ({model_name})')
    ax1.legend()
    ax1.set_ylim(0, 1)

    # Hetvabhasa distribution
    if h_dist:
        labels = list(h_dist.keys())
        values = list(h_dist.values())
        colors = plt.cm.Set2(range(len(labels)))
        ax2.pie(values, labels=labels, autopct='%1.0f%%', colors=colors)
        ax2.set_title('Failure Type Distribution')
    else:
        ax2.text(0.5, 0.5, 'No failures!', ha='center', va='center', fontsize=14)
        ax2.set_title('Failure Type Distribution')

    plt.tight_layout()
    plt.show()
except ImportError:
    print('matplotlib not available. Install with: pip install matplotlib')

## 6. Load Pre-computed Results

If you have previously run the full evaluation campaign, load and analyze those results.

In [None]:
#@title 6. Load and Compare Pre-computed Results
results_dir = Path('data/vyapti_probe/results')

if (results_dir / 'summary.json').exists():
    with open(results_dir / 'summary.json') as f:
        summary = json.load(f)
    
    print('Pre-computed results found!')
    print(f'\n{"Model":<28} {"Accuracy":>10} {"Probe":>8} {"Control":>8} {"Gap":>8}')
    print('-' * 66)
    for model, data in summary.items():
        gap = data['control_accuracy'] - data['probe_accuracy']
        print(f'{model:<28} {data["accuracy"]:>9.1%} {data["probe_accuracy"]:>7.1%} {data["control_accuracy"]:>7.1%} {gap:>+7.1%}')
    
    # Load comparisons if available
    if (results_dir / 'comparisons.json').exists():
        with open(results_dir / 'comparisons.json') as f:
            comparisons = json.load(f)
        print('\n--- Statistical Comparisons ---')
        for c in comparisons:
            sig = 'SIGNIFICANT' if c['significant'] else 'not significant'
            print(f"\n{c['name']}")
            print(f"  Difference: {c['difference']:+.3f} (95% CI: [{c['ci_lower']:+.3f}, {c['ci_upper']:+.3f}])")
            print(f"  {sig} (p = {c['p_value_approx']:.4f})")
else:
    print('No pre-computed results found.')
    print('Run the evaluation first (Section 4) or execute:')
    print('  python scripts/run_vyapti_evaluation.py --simulate')

---

## Summary

This notebook provides the interactive interface for the **Vyapti Probe Benchmark**.

**Key scripts:**
- `scripts/run_vyapti_evaluation.py` — Full evaluation campaign (real or simulated)
- `scripts/run_vyapti_analysis.py` — Statistical analysis and visualization
- `scripts/publish_vyapti_dataset.py` — Publish to Hugging Face Hub

**Key files:**
- `data/vyapti_probe/problems.json` — 100 benchmark problems
- `data/vyapti_probe/solutions.json` — Ground truth solutions
- `data/vyapti_probe/z3_encodings/` — Formal Z3 verification modules
- `docs/paper_vyapti/what_nyaya_reveals.md` — Draft diagnosis paper