# Code demo with OpenAI API

The current jupyter notebook demonstrates the workflow of the benchmark using OpenAI API as an example. This notebook will also breakdown core components and functions used in this repository, including loading task data, running models, and evaluating responses.

In [1]:
# Import dependencies

import json
import os
import pandas as pd

from loaders import TaskLoader # for loading tasks from the test_specs directory
from runner import ModelConfig, OpenAIModelRunner # for setting up OpenAI config and runner
from evaluator import Evaluator # for automatically evaluating responses

## Load API Keys

In order to run the models, we need to first load the API keys. This should be stored in a separate file (e.g. api_keys.json). The api_keys file structure should be as follows:

```json
{
    "open_ai": "your_openai_api_key",
    "anthropic": "your_anthropic_api_key",
    "google": "your_google_api_key"
}
```

In [4]:
# Get API keys
with open("utils/api_keys.json", "r") as f:
    api_keys = json.load(f)

Found 31 tasks
Sample tasks:
  0: test_specs/low/borb_orientation_meta.json
  1: test_specs/low/borb_line_length_comparison_meta.json
  2: test_specs/low/borb_size_comparison_meta.json


## Inspect Available Tests

The code below shows a sample of tests available in the test_specs directory.
In the paper: https://arxiv.org/abs/2504.10786v1. Neuropsychological tests are categorized into which visual processing stage they are designed to tap into- low, mid, or high. The tests in test_specs are categorized into these stages.

Each task metadata file e.g., `borb_line_length_comparison_meta.json` contains information about the task such as task groupings, task name, task type, number of stimuli (used to determine prompt format), the prompt, and also trial-by-trial information (e.g., stimuli paths and answer keys). 

In [5]:
# Load task paths
with open("test_specs/test_list.json", 'r') as file:
    all_task_paths = []
    for stage in json.load(file):
        all_task_paths.extend(stage['task_paths'])

# Print the number of tasks available
print(f"Found {len(all_task_paths)} tasks")

# Print sample tasks
print("Sample tasks:")
for i, path in enumerate(all_task_paths[:3]): # print first 3 tasks
    print(f"  {i}: {path}")

Found 31 tasks
Sample tasks:
  0: test_specs/low/borb_orientation_meta.json
  1: test_specs/low/borb_line_length_comparison_meta.json
  2: test_specs/low/borb_size_comparison_meta.json


In [8]:
# Show metadata fields
with open("test_specs/low/borb_line_length_comparison_meta.json", "r") as f:
    task_payload = json.load(f)

print(task_payload)

{'stage': 'low', 'process': 'simple_element_judgements', 'task': 'borb_line_length_comparison', 'task_type': 'same_different', 'num_stim': 'two', 'prompt': "This is a task. In this task you will tell me if these two lines are the same length or not. Answer with only 'same' or 'different'. Please put your final answer in {}.", 'trials': [{'trial': 'trial_001', 'images': {'options': ['datasets/low/borb_line_length_comparison/trial_001/A.png', 'datasets/low/borb_line_length_comparison/trial_001/B.png']}, 'answer_key': 'different'}, {'trial': 'trial_002', 'images': {'options': ['datasets/low/borb_line_length_comparison/trial_002/A.png', 'datasets/low/borb_line_length_comparison/trial_002/B.png']}, 'answer_key': 'same'}, {'trial': 'trial_003', 'images': {'options': ['datasets/low/borb_line_length_comparison/trial_003/A.png', 'datasets/low/borb_line_length_comparison/trial_003/B.png']}, 'answer_key': 'same'}, {'trial': 'trial_004', 'images': {'options': ['datasets/low/borb_line_length_compar

## Pipeline Demo with OpenAI API

The following will walk you through the process of running and evaluating a single test using OpenAI API.

The pipeline involves the following core components:

1. **TaskLoader**: Loads the task data from based on the task metadata in the test_specs folder. It also contains the function used to encode the stimuli images.
2. **OpenAIModelRunner**: This is a wrapper around the OpenAI API and handles the API calls. This wrapper ensures that the API settings and prompt format are consistent with that used in the original paper.  
3. **Evaluator**: This file contains the Evaluator class that evaluates the model responses against the answer key stored on the task metadata in the test_specs folder.

### Loading a task

TaskLoader can be initialized to load a task information from a json file.

For instance:

```python
loader = TaskLoader("test_specs/low/borb_line_length_comparison_meta.json")
```

The attribute `loader.task_data` contains all the data stored in the json file.

```python
loader.task_data
```

The function `loader.get_task_info()` returns a dictionary containing the task information.

```python
loader.get_task_info()
```

While `loader.get_trials()` returns a list of dictionaries containing the trial-by-trial information.

```python
loader.get_trials()
```


In [22]:
# Load one the test
task_index = 0 
print(f"We will run task: {all_task_paths[task_index]}")

## LOAD TASK
# The task loader requires a path to the task metadata
# For example: `test_specs/low/borb_orientation_meta.json`
loader = TaskLoader(all_task_paths[task_index])

# Inspect task data stored in the loader
loader.task_data # This attributes contain all the data

We will run task: test_specs/low/borb_orientation_meta.json


{'stage': 'low',
 'process': 'simple_element_judgements',
 'task': 'borb_orientation',
 'task_type': 'yes_no',
 'num_stim': 'one',
 'prompt': "This is a task. In this task you will tell me if these two lines are parallel or not. Answer with only 'yes' or 'no'. Please put your final answer in {}.",
 'trials': [{'trial': 'trial_001',
   'images': {'target': ['datasets/low/borb_orientation/001.png']},
   'answer_key': 'yes'},
  {'trial': 'trial_002',
   'images': {'target': ['datasets/low/borb_orientation/002.png']},
   'answer_key': 'no'},
  {'trial': 'trial_003',
   'images': {'target': ['datasets/low/borb_orientation/003.png']},
   'answer_key': 'yes'},
  {'trial': 'trial_004',
   'images': {'target': ['datasets/low/borb_orientation/004.png']},
   'answer_key': 'yes'},
  {'trial': 'trial_005',
   'images': {'target': ['datasets/low/borb_orientation/005.png']},
   'answer_key': 'no'},
  {'trial': 'trial_006',
   'images': {'target': ['datasets/low/borb_orientation/006.png']},
   'answer

### Initialize the Model Runner and Run the Test

ModelConfig is a dataclass that stores the configuration for the model runner. It contains the model name, max tokens, temperature, api key, and any additional parameters. This is used to initialize the model runner.

We have three model runners already available: OpenAIModelRunner, AnthropicModelRunner, and GoogleModelRunner. These are the providers of models used in the original paper. If you want to use a different model, you can create a new model runner by inheriting from the ModelRunner class.

Once the model runner is initialized, we can use it to run the tests. The model runner's function generate_response() takes a TaskLoader object as input and returns a tuple containing the task information and the model responses. This consists of the task information originally existing in the test metadata, but now with the addition of the model's responses for each trial.

In [None]:
## INITIALIZE
# Configure OpenAI model
openai_config = ModelConfig(
    model_name="gpt-4o-2024-05-13", # model name
    api_key=api_keys["openai"], # api key # ensure to set this in utils/api_keys.json
    max_tokens=1000, # max tokens used to generate the response
    temperature=1.0 # default temperature for the model- also used in the original paper
)

# Initialize runner
runner = OpenAIModelRunner(openai_config)

## RUN THE TEST
# Run the task
results = runner.generate_response(loader) # return task info and results (payload + model responses)

# Here `results` is a tuple containing the task info and the results
print(f"Task: {results[0]['task']}")
print(f"Stage: {results[0]['stage']}") # stage of the task: low, mid, or high
print(f"Process: {results[0]['process']}") # which finer-grained process the task belongs to
print(f"Number of trials: {len(results[1])}") # number of trials in the task

# Print example model response
print(results[1][0]['model_response']) #[1] returns trial-by-trial data, [0] reflects the first trial

Getting model responses: 100%|██████████| 30/30 [00:29<00:00,  1.03it/s]

Task: borb_orientation
Stage: low
Process: simple_element_judgements
Number of trials: 30





In [32]:
# Show example of trial-by-trial details

# Show first few responses
print("Sample Model Responses:")
print("=" * 50)

# Look at the results
for i, trial in enumerate(results[1][:3]):
    print(f"\nTrial {trial['trial_id']}:")
    print(f"Prompt: {trial['prompt'][:100]}...") # Prompts are same for all the trials within a task.
    print(f"Model Response: {trial['model_response']}") # Final model responses are those indicated within {}
    print(f"Correct Answer: {trial['answer_key']}")
    print("-" * 30)

Sample Model Responses:

Trial trial_001:
Prompt: This is a task. In this task you will tell me if these two lines are parallel or not. Answer with on...
Model Response: {no}
Correct Answer: yes
------------------------------

Trial trial_002:
Prompt: This is a task. In this task you will tell me if these two lines are parallel or not. Answer with on...
Model Response: {no}
Correct Answer: no
------------------------------

Trial trial_003:
Prompt: This is a task. In this task you will tell me if these two lines are parallel or not. Answer with on...
Model Response: {no}
Correct Answer: yes
------------------------------


### Evaluate the Results

The current repository provides a class that helps automatically mark the model responses based on the criteria used in the paper.

The evaluator takes the output tuple of the runner that contains both the overall task information and trial-by-trial information. The evaluator reads evaluation config file from `utils/evaluator_config.json`* to determine the method of evaluation for each task.

Inputting the evaluator with the output will initialize marking process. The results are stored as a table and can be accessed using `.get_results()`.

The results table can be saved as a CSV file using `.save_as_csv()`. Path can be specified as an argument. If no path is specified, the results will be saved as `results.csv` in the current directory.

For every task evaluated the evaluator will store the corresponding result as a row in the table. Running and evaluating through all tasks will result in a table with the results for all tasks. *See below for how to run a batch of tasks*.

<br>

*`evaluator_config.json` is a file that contains the list of tasks that use each evaluation method. 

In [34]:
# Evaluate results
evaluator = Evaluator()
evaluator.evaluate(results)

# Display results
results_df = evaluator.get_result() # get pandas dataframe
results_df

Unnamed: 0,task,task_type,stage,process,num_trials,raw_score,percent_score
0,borb_orientation,yes_no,low,simple_element_judgements,30,17,0.566667


### Running a Batch of Tasks

We can run loop through multiple tasks and input the results into the evaluator.

In [35]:
# Run a subset of tasks
n_tasks = 5

# Initialize evaluator
batch_evaluator = Evaluator()
batch_results = []

# Loop through tasks in our list of task paths
# For each task
for i, task_path in enumerate(all_task_paths[:n_tasks]): 
    print(f"\nProcessing task {i+1}/{len(all_task_paths)}: {task_path}")
    
    loader = TaskLoader(task_path) # Load the task data
    runner = OpenAIModelRunner(openai_config) # Initialize the runner
    results = runner.generate_response(loader) # Get the model response
    batch_evaluator.evaluate(results) # Evaluate the result, this will append to the result table stored 
    print(f"✓ Completed: {results[0]['task']}")

# Save results
batch_evaluator.save_as_csv(f"results_{openai_config.model_name}.csv")


Processing task 1/31: test_specs/low/borb_orientation_meta.json


Getting model responses: 100%|██████████| 30/30 [00:28<00:00,  1.06it/s]


✓ Completed: borb_orientation

Processing task 2/31: test_specs/low/borb_line_length_comparison_meta.json


Getting model responses: 100%|██████████| 30/30 [00:30<00:00,  1.03s/it]


✓ Completed: borb_line_length_comparison

Processing task 3/31: test_specs/low/borb_size_comparison_meta.json


Getting model responses: 100%|██████████| 30/30 [00:30<00:00,  1.02s/it]


✓ Completed: borb_size_comparison

Processing task 4/31: test_specs/low/borb_position_of_gap_meta.json


Getting model responses: 100%|██████████| 40/40 [00:44<00:00,  1.11s/it]


✓ Completed: borb_position_of_gap

Processing task 5/31: test_specs/low/mindset_weber_law_meta.json


Getting model responses: 100%|██████████| 40/40 [00:40<00:00,  1.02s/it]

✓ Completed: mindset_weber_law



