# Universal VLM Runner Demo

This notebook demonstrates how to use the universal runner with different VLM providers for neuropsych benchmark tasks.

## Features:
- Support for multiple providers (OpenAI, Anthropic, Google, Hugging Face)
- Unified interface across all providers
- Easy configuration management
- Batch comparisons between providers

In [1]:
%load_ext autoreload
%autoreload 2

import json
import os
from evaluator import Evaluator
from universal_runner import create_runner, ModelConfig, OpenAIModelRunner
from loaders import TaskLoader
import pandas as pd

## Configuration Setup

Load API keys and configure model parameters.

In [2]:
# Load API keys
with open("utils/api_keys.json", "r") as f:
    api_keys = json.load(f)

# Load task paths
with open("test_specs/test_list.json", 'r') as file:
    all_task_paths = []
    for stage in json.load(file):
        all_task_paths.extend(stage['task_paths'])

print(f"Found {len(all_task_paths)} tasks")
print("Sample tasks:")
for i, path in enumerate(all_task_paths[:3]):
    print(f"  {i}: {path}")

Found 31 tasks
Sample tasks:
  0: test_specs/low/borb_orientation_meta.json
  1: test_specs/low/borb_line_length_comparison_meta.json
  2: test_specs/low/borb_size_comparison_meta.json


## OpenAI Demo

Using the universal runner with OpenAI GPT-4o.

In [3]:
# Configure OpenAI model
openai_config = ModelConfig(
    model_name="gpt-4o",
    api_key=api_keys["open_ai"],
    max_tokens=100,
    temperature=0.0
)

# Create runner using factory function
runner = create_runner("openai", openai_config)

# Run a single task
task_index = 12  # Same as original test.ipynb
print(f"Running task: {all_task_paths[task_index]}")

loader = TaskLoader(all_task_paths[task_index])
task_info, results = runner.generate_response(loader)

print(f"\nTask completed!")
print(f"Task: {task_info['task']}")
print(f"Stage: {task_info['stage']}")
print(f"Process: {task_info['process']}")
print(f"Number of trials: {len(results)}")

Running task: test_specs/mid/mindset_decomposition_meta.json


Getting model responses: 100%|██████████| 30/30 [00:42<00:00,  1.41s/it]


Task completed!
Task: mindset_decomposition
Stage: mid
Process: property_biases
Number of trials: 30





In [4]:
# Evaluate results
evaluator = Evaluator()
evaluator.evaluate((task_info, results))

# Display results
results_df = evaluator.get_result()
print("Evaluation Results:")
results_df

Evaluation Results:


Unnamed: 0,task,task_type,stage,process,num_trials,raw_score,percent_score
0,mindset_decomposition,match_to_sample,mid,property_biases,30,27,0.9


## Inspect Individual Responses

Let's look at some individual model responses to understand the task better.

In [5]:
# Show first few responses
print("Sample Model Responses:")
print("=" * 50)

for i, trial in enumerate(results[:3]):
    print(f"\nTrial {trial['trial_id']}:")
    print(f"Prompt: {trial['prompt'][:100]}...")
    print(f"Model Response: {trial['model_response']}")
    print(f"Correct Answer: {trial['answer_key']}")
    print("-" * 30)

Sample Model Responses:

Trial trial_000:
Prompt: Here's a task. I will present you with three pictures. The first one is the target/reference image. ...
Model Response: I'm unable to choose which option resembles the target image.
Correct Answer: 1
------------------------------

Trial trial_001:
Prompt: Here's a task. I will present you with three pictures. The first one is the target/reference image. ...
Model Response: The second option most closely resembles the target image.

{2}
Correct Answer: 2
------------------------------

Trial trial_002:
Prompt: Here's a task. I will present you with three pictures. The first one is the target/reference image. ...
Model Response: The second option most closely resembles the target image.

{2}
Correct Answer: 1
------------------------------


## Batch Processing Demo

Run multiple tasks with the same configuration.

In [6]:
# Run multiple tasks (first 3 for demo)
batch_evaluator = Evaluator()
batch_results = []

for i, task_path in enumerate(all_task_paths[:3]):
    print(f"\nProcessing task {i+1}/3: {task_path}")
    
    try:
        loader = TaskLoader(task_path)
        runner = create_runner("openai", openai_config)
        task_info, task_results = runner.generate_response(loader)
        
        # Evaluate
        batch_evaluator.evaluate((task_info, task_results))
        batch_results.append({
            'task_path': task_path,
            'task_info': task_info,
            'status': 'success'
        })
        print(f"✓ Completed: {task_info['task']}")
        
    except Exception as e:
        print(f"✗ Failed: {e}")
        batch_results.append({
            'task_path': task_path,
            'status': 'failed',
            'error': str(e)
        })

print("\nBatch processing completed!")


Processing task 1/3: test_specs/low/borb_orientation_meta.json


Getting model responses: 100%|██████████| 30/30 [00:25<00:00,  1.16it/s]


✓ Completed: borb_orientation

Processing task 2/3: test_specs/low/borb_line_length_comparison_meta.json


Getting model responses: 100%|██████████| 30/30 [00:26<00:00,  1.13it/s]


✓ Completed: borb_line_length_comparison

Processing task 3/3: test_specs/low/borb_size_comparison_meta.json


Getting model responses: 100%|██████████| 30/30 [00:29<00:00,  1.02it/s]

✓ Completed: borb_size_comparison

Batch processing completed!





In [None]:
# Show batch results
batch_df = batch_evaluator.get_result()
print("Batch Evaluation Results:")
batch_df

## Configuration Comparison

Compare different model configurations on the same task.

In [None]:
# Different configurations to compare
configs = {
    "gpt-4o_temp0": ModelConfig(
        model_name="gpt-4o",
        api_key=api_keys["open_ai"],
        max_tokens=100,
        temperature=0.0
    ),
    "gpt-4o_temp0.3": ModelConfig(
        model_name="gpt-4o",
        api_key=api_keys["open_ai"],
        max_tokens=100,
        temperature=0.3
    ),
    "gpt-4o-mini": ModelConfig(
        model_name="gpt-4o-mini",
        api_key=api_keys["open_ai"],
        max_tokens=100,
        temperature=0.0
    )
}

# Run comparison on a single task
comparison_task = all_task_paths[12]
comparison_results = {}

print(f"Comparing configurations on: {comparison_task}")
print("=" * 60)

for config_name, config in configs.items():
    print(f"\nRunning {config_name}...")
    
    try:
        loader = TaskLoader(comparison_task)
        runner = create_runner("openai", config)
        task_info, task_results = runner.generate_response(loader)
        
        # Evaluate
        temp_evaluator = Evaluator()
        temp_evaluator.evaluate((task_info, task_results))
        result_df = temp_evaluator.get_result()
        
        comparison_results[config_name] = {
            'score': result_df['percent_score'].iloc[0],
            'raw_score': result_df['raw_score'].iloc[0],
            'num_trials': result_df['num_trials'].iloc[0],
            'sample_responses': [r['model_response'] for r in task_results[:2]]
        }
        
        print(f"✓ Score: {result_df['percent_score'].iloc[0]:.2%}")
        
    except Exception as e:
        print(f"✗ Failed: {e}")
        comparison_results[config_name] = {'error': str(e)}

In [None]:
# Summary of comparison results
print("Configuration Comparison Summary:")
print("=" * 40)

comparison_df = pd.DataFrame([
    {
        'Configuration': config_name,
        'Score': results.get('score', 'Error'),
        'Raw Score': f"{results.get('raw_score', 'N/A')}/{results.get('num_trials', 'N/A')}",
        'Status': 'Success' if 'error' not in results else 'Failed'
    }
    for config_name, results in comparison_results.items()
])

comparison_df

## Advanced Features

Demonstrate advanced features of the universal runner.

In [None]:
# Using additional parameters
advanced_config = ModelConfig(
    model_name="gpt-4o",
    api_key=api_keys["open_ai"],
    max_tokens=150,
    temperature=0.1,
    additional_params={
        "top_p": 0.9,
        "frequency_penalty": 0.1,
        "presence_penalty": 0.1
    }
)

print("Running with advanced parameters:")
print(f"- Temperature: {advanced_config.temperature}")
print(f"- Max tokens: {advanced_config.max_tokens}")
print(f"- Additional params: {advanced_config.additional_params}")

# Run with advanced config
loader = TaskLoader(all_task_paths[12])
advanced_runner = OpenAIModelRunner(advanced_config)
task_info, advanced_results = advanced_runner.generate_response(loader)

print(f"\nCompleted with {len(advanced_results)} trials")
print("Sample response:")
print(f"\"{advanced_results[0]['model_response']}\"")

## Save Results

Save evaluation results to CSV for further analysis.

In [None]:
# Save batch results
if 'batch_evaluator' in locals():
    batch_evaluator.save_as_csv("universal_runner_batch_results.csv")
    print("✓ Batch results saved to 'universal_runner_batch_results.csv'")

# Save comparison results
if 'comparison_df' in locals():
    comparison_df.to_csv("config_comparison_results.csv", index=False)
    print("✓ Comparison results saved to 'config_comparison_results.csv'")

print("\nDemo completed successfully! 🎉")

## Next Steps

To extend this demo:

1. **Try other providers**: Uncomment and configure Anthropic or Google runners
2. **Multi-provider comparison**: Run the same task across different providers
3. **Custom evaluation metrics**: Extend the evaluator for specific analysis
4. **Batch processing**: Process all tasks in the test suite
5. **Error handling**: Add robust error handling for production use

### Example: Multi-provider comparison (requires additional API keys)

```python
# providers = {
#     "openai": ModelConfig(model_name="gpt-4o", api_key=api_keys["open_ai"]),
#     "anthropic": ModelConfig(model_name="claude-3-5-sonnet-20241022", api_key=api_keys["anthropic"]),
#     "google": ModelConfig(model_name="gemini-1.5-pro", api_key=api_keys["google"])
# }
# 
# for provider_name, config in providers.items():
#     runner = create_runner(provider_name, config)
#     # ... run comparison
```