# Promptfoo-style eval without promptfoo

Goal: Run test-suite-style eval (like Promptfoo) with completely custom components, i.e. without using Promptfoo.

In this case, you have 2 options:
1. Run with inputs. Library runs AIConfig for you first.
2. Run with outputs only. You run AIConfig and save the outputs for eval.

Run the notebook in order for an example of each.

Assumptions:
* You have a parametrized AIConfig with a test input called "the_query", like this: 
`"input": "{{the_query}}"`
* You have some evaluation criteria in mind for the AIConfig's text output.
* Promptfoo integration does not meet my needs, e.g.
  * You want to run the AIConfig myself instead of handing control to Promptfoo
  * You need to scale beyond what Promptfoo can reasonably handle

In [3]:
print("Imports and set log level")

import logging

import pandas as pd
import lastmile_utils.lib.jupyter as jupyter_utils

pd.set_option("display.max_colwidth", None)

from aiconfig.eval.lib import (
    brevity,
    substring_match,
    run_test_suite_with_inputs,
    TestSuiteWithInputsSettings,
)

jupyter_utils.set_log_level(logging.WARNING)



Imports and set log level



  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import openai
openai.api_key = open("/home/jacobjensen/secrets/openai_api_key.txt", "r").read().strip()

## Option 1: provide inputs, library runs AIConfig for you

In [5]:
print(
    """
    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.
"""
)


ts_settings = TestSuiteWithInputsSettings(
    prompt_name="gen_itinerary",
    aiconfig_path="./travel_parametrized.aiconfig.json",
)

# Each of these pairs will be used to construct a test case just below.
# For each pair (input, expected_substring) we define a test case that says, 
# "When I run this input through this AIConfig, 
# I expect the output to contain this particular substring".

# For example, when we call `substring_match(substring, case_sensitive=False)` below,
# and substring=="Empire State Building", we are telling the library to create a 
# boolean metric (i.e. a pass/fail test case) that passes (value==1.0) if the substring
# "empire state building" appears in the AIConfig output 
# when the AIConfig is given the input "Iconic midtown skyscrapers".
# "Tell me 3 fun attractions related to {{the_query}} to do in NYC."
# Each test input will get put into "the_query" in the input prompt:
# See the aiconfig (python/src/aiconfig/eval/custom_eval/examples/travel/travel_parametrized.aiconfig.json).
test_inputs_with_substrings = [
    ("different kinds of cuisines", "Magnolia Bakery"),
    ("iconic midtown skyscrapers", "Empire State Building"),
]
expected_substrings = []

test_suite_with_inputs = []
for test_input, substring in test_inputs_with_substrings:
    # Add the brevity metric
    test_fn1 = brevity
    test_suite_with_inputs.append((test_input, test_fn1))
    # Add substring check function
    test_fn2 = substring_match(substring, case_sensitive=False)
    test_suite_with_inputs.append((test_input, test_fn2))


    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.



In [6]:
print("If you like, you can inspect the test suite before passing it to the evaluation library.")

for test_input, fn in test_suite_with_inputs:
    print("\nTest input:\n", test_input, "\nFunction:\n", fn)

If you like, you can inspect the test suite before passing it to the evaluation library.

Test input:
 different kinds of cuisines 
Function:
 <function brevity at 0x7f94dc1767a0>

Test input:
 different kinds of cuisines 
Function:
 functools.partial(<function contains_substring at 0x7f94dc176a70>, substring='Magnolia Bakery', case_sensitive=False)

Test input:
 iconic midtown skyscrapers 
Function:
 <function brevity at 0x7f94dc1767a0>

Test input:
 iconic midtown skyscrapers 
Function:
 functools.partial(<function contains_substring at 0x7f94dc176a70>, substring='Empire State Building', case_sensitive=False)


In [7]:
print("Run the eval interface (option 1, with inputs)")

df_result = await run_test_suite_with_inputs(
    test_suite=test_suite_with_inputs,
    settings=ts_settings,
)

print("Raw output")
df_result

Run the eval interface (option 1, with inputs)
Raw output


Unnamed: 0,input,aiconfig_output,value,metric_name,metric_description,best_value,worst_value
0,different kinds of cuisines,"1. Chinatown Food Tour: Explore cuisine, learn cultural history. 2. Little Italy Pizza Class: Hands-on pizza-making, enjoy your creation. 3. Harlem Soul Food Tour: Dive into culture, savor traditional dishes.",209.0,brevity,Absolute text length,1.0,inf
1,different kinds of cuisines,"1. Chinatown Food Tour: Explore cuisine, learn cultural history. 2. Little Italy Pizza Class: Hands-on pizza-making, enjoy your creation. 3. Harlem Soul Food Tour: Dive into culture, savor traditional dishes.",0.0,contains_substring,1.0 (pass) if contains given substring,1.0,0.0
2,iconic midtown skyscrapers,1. Admire NYC from Empire State Building's decks. 2. Experience Top of the Rock's sky-high vistas. 3. Soak in panoramic views at One World Observatory.,151.0,brevity,Absolute text length,1.0,inf
3,iconic midtown skyscrapers,1. Admire NYC from Empire State Building's decks. 2. Experience Top of the Rock's sky-high vistas. 3. Soak in panoramic views at One World Observatory.,1.0,contains_substring,1.0 (pass) if contains given substring,1.0,0.0


In [8]:
print("Unstack for nicer manual review")
df_result.set_index(["input", "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


Unnamed: 0_level_0,metric_name,brevity,contains_substring
input,aiconfig_output,Unnamed: 2_level_1,Unnamed: 3_level_1
different kinds of cuisines,"1. Chinatown Food Tour: Explore cuisine, learn cultural history.\n2. Little Italy Pizza Class: Hands-on pizza-making, enjoy your creation. \n3. Harlem Soul Food Tour: Dive into culture, savor traditional dishes.",209.0,0.0
iconic midtown skyscrapers,1. Admire NYC from Empire State Building's decks.\n2. Experience Top of the Rock's sky-high vistas.\n3. Soak in panoramic views at One World Observatory.,151.0,1.0


## Option 2: Run eval on already-computed AIConfig outputs.

In [9]:
print("Define outputs to test and criteria, similar to option 1.")


from aiconfig.eval.lib import (
    brevity,
    substring_match,
    run_test_suite_outputs_only,
)


# This is similar to "test_inputs_with_substrings" above, but we have the AIConfig *outputs*
# in the test cases, rather than the inputs. The library will evaluate these strings directly
# because there is no need to run the AIConfig to generate the outputs.
test_outputs_with_substrings = [
    (
        "Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience",
        "Magnolia Bakery"
    ),
    (
        "1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit \"Top of the Rock\", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",
        "Empire State Building"
    )
]


def error_metric(o): 
    raise ValueError("This is an error metric")

test_suite_outputs_only = []
for test_output, substring in test_outputs_with_substrings:
    # Add the brevity metric
    # test_fn1 = brevity
    test_fn1 = error_metric
    test_suite_outputs_only.append((test_output, test_fn1))
    # Add substring check function
    test_fn2 = substring_match(substring, case_sensitive=False)
    test_suite_outputs_only.append(
        (test_output, test_fn2)
    )

Define outputs to test and criteria, similar to option 1.


In [10]:
print("If you like, you can inspect the test suite before passing it to the evaluation library.")

for test_output, fn in test_suite_outputs_only:
    print("\nTest output:\n", test_output, "\nFunction:\n", fn)

If you like, you can inspect the test suite before passing it to the evaluation library.

Test output:
 Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience 
Function:
 <function error_metric at 0x7f94dc06e680>

Test output:
 Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience 
Function:
 functools.partial(<function contains_substring at 0x7f94dc176a70>, substring='Magnolia Bakery', case_sensitive=False)

Test output:
 1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit "Top of the Rock", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities. 
Function:
 <function error_metric at 0x7f94dc06e680>

Test output:
 1. Empire State Buildin

In [11]:
print("Run the eval library")
df_result = await run_test_suite_outputs_only(
    test_suite=test_suite_outputs_only,
)
print("Raw output")
df_result

Run the eval library


ValueError: This is an error metric

In [None]:
print("Unstack for nicer manual review")
df_result.set_index([ "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


metric_name,brevity,contains_substring
aiconfig_output,Unnamed: 1_level_1,Unnamed: 2_level_1
"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",267.0,1.0
Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,160.0,0.0
