<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# <center>Using Arize with AI agents</center>

This guide shows you how to create and evaluate agents with Arize to improve performance. We'll go through the following steps:

* Create a customer support agent using a router template

* Trace the agent activity, including function calling

* Create a dataset to benchmark performance

* Evaluate agent performance using code, human annotation, and LLM as a judge

* Experiment with different prompts and models

# Initial setup


We'll setup our libraries, keys, and OpenAI tracing using Phoenix.

### Install Libraries

In [None]:
!pip install -qq arize-otel openai openinference-instrumentation-openai opentelemetry-sdk opentelemetry-exporter-otlp gcsfs nest_asyncio arize-phoenix "arize[Datasets]"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.0/64.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.7/149.7 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.5/52.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m306.0/306.0 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.2/233.2 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Setup Keys

In [None]:
import os
from getpass import getpass
import nest_asyncio
nest_asyncio.apply()

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key: ··········


### Setup Tracing

To follow with this tutorial, you'll need to sign up for Arize and get your API key. You can see the [guide here](https://docs.arize.com/arize/llm-tracing/quickstart-llm).

In [None]:
# Import open-telemetry dependencies
from arize_otel import register_otel, Endpoints

# Setup OTEL via our convenience function
register_otel(
    endpoints = Endpoints.ARIZE,
    space_id = getpass("🔑 Enter your Arize Space ID: "),
    api_key = getpass("🔑 Enter your Arize API key: "),
    model_id = "agents-cookbook", # name this to whatever you would like
)
# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.openai import OpenAIInstrumentor

# Finish automatic instrumentation
OpenAIInstrumentor().instrument()

# Create customer support agent

We'll be creating a customer support agent using function calling following the architecture below:

<img src="https://storage.cloud.google.com/arize-assets/tutorials/images/agent_architecture.png" width="800"/>

### Setup functions and create customer support agent

We have 6 functions that we define below.

1. product_comparison
2. product_search
3. customer_support
4. track_package
5. product_details
6. apply_discount_code



In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "product_comparison",
            "description": "Compare features of two products.",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_a_id": {
                        "type": "string",
                        "description": "The unique identifier of Product A."
                    },
                    "product_b_id": {
                        "type": "string",
                        "description": "The unique identifier of Product B."
                    }
                },
                "required": ["product_a_id", "product_b_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "product_search",
            "description": "Search for products based on criteria.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query string."
                    },
                    "category": {
                        "type": "string",
                        "description": "The category to filter the search."
                    },
                    "min_price": {
                        "type": "number",
                        "description": "The minimum price of the products to search.",
                        "default": 0
                    },
                    "max_price": {
                        "type": "number",
                        "description": "The maximum price of the products to search."
                    },
                    "page": {
                        "type": "integer",
                        "description": "The page number for pagination.",
                        "default": 1
                    },
                    "page_size": {
                        "type": "integer",
                        "description": "The number of results per page.",
                        "default": 20
                    }
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "customer_support",
            "description": "Get contact information for customer support regarding an issue.",
            "parameters": {
                "type": "object",
                "properties": {
                    "issue_type": {
                        "type": "string",
                        "description": "The type of issue (e.g., billing, technical support)."
                    }
                },
                "required": ["issue_type"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "track_package",
            "description": "Track the status of a package based on the tracking number.",
            "parameters": {
                "type": "object",
                "properties": {
                    "tracking_number": {
                        "type": "integer",
                        "description": "The tracking number of the package."
                    }
                },
                "required": ["tracking_number"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "product_details",
            "description": "Returns details for a given product id",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_id": {
                        "type": "string",
                        "description": "The id of a product to look up."
                    }
                },
                "required": ["product_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "apply_discount_code",
            "description": "Applies the discount code to a given order.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "integer",
                        "description": "The id of the order to apply the discount code to."
                    },
                    "discount_code": {
                        "type": "string",
                        "description": "The discount code to apply"
                    }
                },
                "required": ["order_id", "discount_code"]
            }
        }
    }
]

We define a function below called run_prompt, which uses the chat completion call from OpenAI with functions

In [None]:
import os
from textwrap import dedent
import json

import openai

client = openai.Client()

def run_prompt(input):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        tools=tools,
        tool_choice="auto",
        messages=[
            {
                "role": "system",
                "content": " ",
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    print(response)

    if hasattr(response.choices[0].message, 'tool_calls') and response.choices[0].message.tool_calls is not None and len(response.choices[0].message.tool_calls) > 0:
        tool_calls = response.choices[0].message.tool_calls
    else:
        tool_calls = []

    if response.choices[0].message.content is None:
        response.choices[0].message.content = ""
    ret = {"question" :input, "tools":tool_calls, "response":response.choices[0].message.content}
    return ret

Let's test it and see if it returns the right function! Based on whether we set tool_choice to "auto" or "required", the router will have different behavior.

In [None]:
run_prompt("Hi, I'd like to apply to apply a discount code to my order.")

ChatCompletion(id='chatcmpl-AOSjYkYXUiUSLOjyAd21f5BtA0BZn', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Please provide me with your order ID and the discount code you'd like to apply.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1730393688, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_0ba0d124f1', usage=CompletionUsage(completion_tokens=17, prompt_tokens=347, total_tokens=364, completion_tokens_details=CompletionTokensDetails(audio_tokens=None, reasoning_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0)))


{'question': "Hi, I'd like to apply to apply a discount code to my order.",
 'tools': [],
 'response': "Please provide me with your order ID and the discount code you'd like to apply."}

Now we have a basic agent, let's generate a dataset of questions and run the prompt against this dataset!

# Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our customer support agent.

In [None]:
GEN_TEMPLATE = """
You are an assistant that generates complex customer service questions.
The questions should often involve:

Multiple Categories: Questions that could logically fall into more than one category (e.g., combining product details with a discount code).
Vague Details: Questions with limited or vague information that require clarification to categorize correctly.
Mixed Intentions: Queries where the customer’s goal or need is unclear or seems to conflict within the question itself.
Indirect Language: Use of indirect or polite phrasing that obscures the direct need or request (e.g., using "I was wondering if..." or "Perhaps you could help me with...").
For specific categories:

Track Package: Include vague timing references (e.g., "recently" or "a while ago") instead of specific dates.
Product Comparison and Product Search: Include generic descriptors without specific product names or IDs (e.g., "high-end smartphones" or "energy-efficient appliances").
Apply Discount Code: Include questions about discounts that might apply to hypothetical or past situations, or without mentioning if they have made a purchase.
Product Details: Ask for comparisons or details that involve multiple products or categories ambiguously (e.g., "Tell me about your range of electronics that are good for home office setups").

Examples of More Challenging Questions
"There's an issue with one of the items I think I bought last month—what should I do?"
"I need help with something I ordered, or maybe I'm just looking for something new. Can you help?"

Some questions should be straightforward uses of the provided functions

Respond with a list, one question per line. Do not include any numbering at the beginning of each line. Do not include any category headings.
Generate 25 questions. Be sure there are no duplicate questions.
"""

In [None]:
import nest_asyncio
import pandas as pd
nest_asyncio.apply()
from phoenix.evals import OpenAIModel
pd.set_option('display.max_colwidth', 500)

model = OpenAIModel(model="gpt-4o", max_tokens=1300)

In [None]:
resp = model(GEN_TEMPLATE)

In [None]:
split_response = resp.strip().split('\n')

questions_df = pd.DataFrame(split_response, columns=['question'])
questions_df["tools"] = ""
questions_df["response"] = ""
print(questions_df)

                                                                                                                                     question  \
0     I was wondering if you could help me find a good deal on something I might have ordered recently, or maybe suggest something similar?     
1                               Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen?     
2                          I think I used a discount code a while back, but I'm not sure if it applied correctly—can you check that for me?     
3                                         Perhaps you could assist me with tracking a package that I believe was sent out not too long ago?     
4                   I'm curious about your range of high-performance laptops, especially those that might be good for both gaming and work.     
5                                 Is there a way to apply a discount code to an order I placed a few weeks ago, or is it too late 

Now let's run it and manually inspect the traces! Change the value in `.head(2)` from any number between 1 and 25 to run it on that many data points from the questions we generated earlier.

Then manually inspect the outputs in Phoenix.

In [None]:
response_df = questions_df['question'].head(2).apply(run_prompt).apply(pd.Series)

ChatCompletion(id='chatcmpl-AOSju94LxVnNLHy7q8Hxg8qkoaL66', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Could you please provide me with more details about the item you ordered recently? This could include the product name, category, or any specific features you're looking for. This will help me find a good deal or suggest something similar.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1730393710, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_0ba0d124f1', usage=CompletionUsage(completion_tokens=46, prompt_tokens=359, total_tokens=405, completion_tokens_details=CompletionTokensDetails(audio_tokens=None, reasoning_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0)))
ChatCompletion(id='chatcmpl-AOSjvCXXBe1AlxNMZ9zibwnIUUUMx', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, m

In [None]:
response_df

Unnamed: 0,question,tools,response
0,"I was wondering if you could help me find a good deal on something I might have ordered recently, or maybe suggest something similar?",[],"Could you please provide me with more details about the item you ordered recently? This could include the product name, category, or any specific features you're looking for. This will help me find a good deal or suggest something similar."
1,Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen?,"[ChatCompletionMessageToolCall(id='call_YHcjrv9WnsBZx84Zo56VCvMW', function=Function(arguments='{""query"":""eco-friendly kitchen gadgets"",""category"":""kitchen"",""page"":1,""page_size"":5}', name='product_search'), type='function')]",


# Evaluating your agent

Now that we have a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.

Here, we are defining our evaluation templates to judge whether the router selected a function correctly, whether it selected the right function, and whether it filled the arguments correctly.

In [None]:
ROUTER_EVAL_TEMPLATE = ''' You are comparing a response to a question, and verifying whether that response should have made a function call instead of responding directly. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Response]: {tools}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the Question above to the response. You must determine whether the reponse
decided to call the correct function.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the agent should have made function call instead of responding directly and did not, or the function call chosen was the incorrect one.
"correct" means the selected function would correctly and fully answer the user's question.

Here is more information on each function:
product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.
product_search: Search for products based on criteria.
track_package: Track the status of a package based on the tracking number.
customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.
apply_discount_code: Applies a discount code to an order.
product_details: Get detailed features on one product.
'''

FUNCTION_SELECTION_EVAL_TEMPLATE = ''' You are comparing a function call response to a question and trying to determine if the generated call is correct. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Response]: {tools}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the Question above to the function call. You must determine whether the function call
will return the answer to the Question. Please focus on whether the very specific
question can be answered by the function call.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the function call will not provide an answer to the Question.
"correct" means the function call will definitely provide an answer to the Question.

Here is more information on each function:
product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.
product_search: Search for products based on criteria.
track_package: Track the status of a package based on the tracking number.
customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.
apply_discount_code: Applies a discount code to an order.
product_details: Get detailed features on one product.
'''

PARAMETER_EXTRACTION_EVAL_TEMPLATE = ''' You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Response]: {tools}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the parameters in the generated function against the JSON provided below.
The parameters extracted from the question must match the JSON below exactly.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the parameters in the function do not match the JSON schema below exactly, or the generated function does not correctly answer the user's question.
You should also respond with "incorrect" if the response makes up information that is not in the JSON schema.
"correct" means the function call parameters match the JSON below and provides only relevant information.

Here is more information on each function:
product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.
product_search: Search for products based on criteria.
track_package: Track the status of a package based on the tracking number.
customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.
apply_discount_code: Applies a discount code to an order.
product_details: Get detailed features on one product.
'''

Let's run evaluations using Phoenix's llm_classify function for our responses dataframe we generated above!

In [None]:
from phoenix.evals import (
    OpenAIModel,
    llm_classify
)

rails = ["incorrect", "correct"]

router_eval_df = llm_classify(
    dataframe=response_df,
    template=ROUTER_EVAL_TEMPLATE,
    model=OpenAIModel('gpt-4o'),
    rails=rails,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4
)

function_selection_eval_df = llm_classify(
    dataframe=response_df,
    template=FUNCTION_SELECTION_EVAL_TEMPLATE,
    model=OpenAIModel('gpt-4o'),
    rails=rails,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4
)

parameter_extraction_eval_df = llm_classify(
    dataframe=response_df,
    template=PARAMETER_EXTRACTION_EVAL_TEMPLATE,
    model=OpenAIModel('gpt-4o'),
    rails=rails,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4
)

llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

Let's look at and inspect the results of our evaluatiion!

In [None]:
router_eval_df

Unnamed: 0,label,explanation,prompt,exceptions,execution_status,execution_seconds
0,correct,"The LLM Response is asking for more details about the item the user ordered recently. It did not make a function call, but it also did not need to at this point, as it is still gathering information to understand the user's request better. Therefore, the response is correct.","You are comparing a response to a question, and verifying whether that response should have made a function call instead of responding directly. Here is the data:\n [BEGIN DATA]\n ************\n [Question]: I was wondering if you could help me find a good deal on something I might have ordered recently, or maybe suggest something similar? \n ************\n [Tool Response]: []\n ************\n [LLM Response]: Could you please provide me with more details about the item ...",[],COMPLETED,3.774273
1,correct,"The user asked for information about eco-friendly kitchen gadgets. The tool's response was to call the 'product_search' function with the query 'eco-friendly kitchen gadgets', which is the correct function to find products based on specific criteria. Therefore, the tool's response is correct.","You are comparing a response to a question, and verifying whether that response should have made a function call instead of responding directly. Here is the data:\n [BEGIN DATA]\n ************\n [Question]: Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen? \n ************\n [Tool Response]: [ChatCompletionMessageToolCall(id='call_YHcjrv9WnsBZx84Zo56VCvMW', function=Function(arguments='{""query"":""eco-friendly kitchen ga...",[],COMPLETED,5.063081


In [None]:
function_selection_eval_df

Unnamed: 0,label,explanation,prompt,exceptions,execution_status,execution_seconds
0,incorrect,"The tool response is empty, meaning no function call was generated to answer the question. Therefore, it's incorrect.","You are comparing a function call response to a question and trying to determine if the generated call is correct. Here is the data:\n [BEGIN DATA]\n ************\n [Question]: I was wondering if you could help me find a good deal on something I might have ordered recently, or maybe suggest something similar? \n ************\n [Tool Response]: []\n ************\n [LLM Response]: Could you please provide me with more details about the item you ordered recently? This cou...",[],COMPLETED,2.356433
1,correct,"The function call is 'product_search' with arguments for 'query' as 'eco-friendly kitchen gadgets', 'category' as 'kitchen', 'page' as 1, and 'page_size' as 5. This matches the question asking about eco-friendly gadgets suitable for a modern kitchen. Therefore, the function call is correct.","You are comparing a function call response to a question and trying to determine if the generated call is correct. Here is the data:\n [BEGIN DATA]\n ************\n [Question]: Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen? \n ************\n [Tool Response]: [ChatCompletionMessageToolCall(id='call_YHcjrv9WnsBZx84Zo56VCvMW', function=Function(arguments='{""query"":""eco-friendly kitchen gadgets"",""category"":""kitchen"",""p...",[],COMPLETED,4.51092


In [None]:
parameter_extraction_eval_df

Unnamed: 0,label,explanation,prompt,exceptions,execution_status,execution_seconds
0,incorrect,"The tool response does not provide any function call or parameters, hence it does not match the JSON schema. Therefore, the response is incorrect.","You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:\n [BEGIN DATA]\n ************\n [Question]: I was wondering if you could help me find a good deal on something I might have ordered recently, or maybe suggest something similar? \n ************\n [Tool Response]: []\n ************\n [LLM Response]: Could you please provide me with more detai...",[],COMPLETED,3.070908
1,correct,"The function call parameters match the JSON schema and provide only relevant information. The function 'product_search' is used correctly with the arguments 'query', 'category', 'page', and 'page_size' which are all appropriate for the user's question about eco-friendly kitchen gadgets.","You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:\n [BEGIN DATA]\n ************\n [Question]: Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen? \n ************\n [Tool Response]: [ChatCompletionMessageToolCall(id='call_YHcjrv9WnsBZx84Zo56VCvMW', function=Function(arguments='{""query"":""eco-f...",[],COMPLETED,4.250648


# Create an experiment

With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

In [None]:
from arize.experimental.datasets.core.client import ArizeDatasetsClient
import os
from uuid import uuid1

from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    Evaluator,
)
from arize.experimental.datasets.utils.constants import GENERATIVE
from openai import OpenAI

import pandas as pd

# Your developer key
developer_key = "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpYXQiOjE3MTM0NjEwMjksInVzZXJJZCI6MTAwMDgsInV1aWQiOiI4NmIwOWUxMC0wMTk2LTRhMDUtODhiZS1iZGZjNGI5ZGRmNWQiLCJpc3MiOiJodHRwczovL2FwcC5hcml6ZS5jb20ifQ.ZX6NjfZ37pW1nwOH0IlhAWf1NyJuouTuMifs6TtKAdA"
api_key = "416ad605925bf226fd9"
space_id="U3BhY2U6NjM3MjoyMXJG"

# Set up the arize client
arize_client = ArizeDatasetsClient(developer_key=developer_key, api_key=api_key)


dataset_id = arize_client.create_dataset(space_id=space_id, dataset_name="agents-cookbook-"+str(uuid1()), dataset_type=GENERATIVE, data=questions_df.head(2))
dataset = arize_client.get_dataset(space_id=space_id, dataset_id=dataset_id)
print(dataset)

                                                                                                                                  question  \
0  I was wondering if you could help me find a good deal on something I might have ordered recently, or maybe suggest something similar?     
1                            Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen?     

  tools response     created_at     updated_at  \
0                 1730401895132  1730401895132   
1                 1730401895132  1730401895132   

                                     id  
0  1139f54b-5bd1-4c10-8484-30c4bd97651d  
1  360bd645-35b3-47e7-915f-a405237b965f  


In [None]:
import nest_asyncio
from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)

from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    Evaluator
)

from arize.experimental.datasets.experiments.types import (
    Example,
    ExperimentRun,
)

class RouterEvaluator(Evaluator):
    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        question = output.get("question")
        tools = output.get("tools")
        response = output.get("response")

        df_in = pd.DataFrame({"question": question, "tools": str(tools), "response": response}, index=[0])

        rails = ["correct", "incorrect"]
        expect_df = llm_classify(
            dataframe=df_in,
            template= ROUTER_EVAL_TEMPLATE,
            model=OpenAIModel(model="gpt-4o"),
            rails=rails,
            provide_explanation=True,
            run_sync=True
        )

        label = expect_df['label'][0]
        score = 1 if rails and label == rails[0] else 0 # Choose the 0 item in rails as the correct "1" label
        explanation = expect_df['explanation'][0]
        return EvaluationResult(score=score, label=label, explanation=explanation)


class FunctionSelectionEvaluator(Evaluator):
    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        question = output.get("question")
        tools = output.get("tools")
        response = output.get("response")

        df_in = pd.DataFrame({"question": question, "tools": str(tools), "response": response}, index=[0])
        rails = ["correct", "incorrect"]
        expect_df = llm_classify(
            dataframe=df_in,
            template= FUNCTION_SELECTION_EVAL_TEMPLATE,
            model=OpenAIModel(model="gpt-4o"),
            rails=rails,
            provide_explanation=True,
            run_sync=True
        )

        label = expect_df['label'][0]
        score = 1 if rails and label == rails[0] else 0 # Choose the 0 item in rails as the correct "1" label
        explanation = expect_df['explanation'][0]
        return EvaluationResult(score=score, label=label, explanation=explanation)



class ParameterExtractionEvaluator(Evaluator):
    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        question = output.get("question")
        tools = output.get("tools")
        response = output.get("response")

        df_in = pd.DataFrame({"question": question, "tools": str(tools), "response": response}, index=[0])
        rails = ["correct", "incorrect"]
        expect_df = llm_classify(
            dataframe=df_in,
            template= PARAMETER_EXTRACTION_EVAL_TEMPLATE,
            model=OpenAIModel(model="gpt-4o"),
            rails=rails,
            provide_explanation=True,
            run_sync=True
        )

        label = expect_df['label'][0]
        score = 1 if rails and label == rails[0] else 0 # Choose the 0 item in rails as the correct "1" label
        explanation = expect_df['explanation'][0]
        return EvaluationResult(score=score, label=label, explanation=explanation)

In [None]:
dataset

Unnamed: 0,questions,tools,response,created_at,updated_at,id
0,"I was wondering if you could help me find a good deal on something I might have ordered recently, or maybe suggest something similar?",,,1730394219421,1730394219421,053edb60-265d-4e40-93b9-e2033e34dc53
1,Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen?,,,1730394219421,1730394219421,923bd867-6235-46f5-bef7-3d5dfa759851
2,"I think I used a discount code a while back, but I'm not sure if it applied correctly—can you check that for me?",,,1730394219421,1730394219421,092ec16b-4454-44e6-aa5a-8fc936e8a36e
3,Perhaps you could assist me with tracking a package that I believe was sent out not too long ago?,,,1730394219421,1730394219421,ce1bd7b5-84a9-4323-a146-605d25c68ac0
4,"I'm curious about your range of high-performance laptops, especially those that might be good for both gaming and work.",,,1730394219421,1730394219421,7646c9d7-76d6-4f24-a879-48bf4e764a13
5,"Is there a way to apply a discount code to an order I placed a few weeks ago, or is it too late for that?",,,1730394219421,1730394219421,37a19952-a569-49f4-b0b8-e73138a92e81
6,"I was hoping you could help me compare some of your top-rated home appliances, but I'm not sure which ones to look at.",,,1730394219421,1730394219421,daed4514-9ad2-4bf1-8f5c-4ef102cfbc29
7,Can you assist me with finding a package that I think was supposed to arrive recently?,,,1730394219421,1730394219421,d08eecfc-8bf6-48ff-a4ba-9611ca2ed0ea
8,"I'm interested in learning more about your collection of energy-efficient devices, particularly those that could enhance a home office.",,,1730394219421,1730394219421,5c48f983-743f-493c-96c3-ef44bbcf80f4
9,"I need some information on a product I might have purchased, or maybe I'm just considering buying it—can you help?",,,1730394219421,1730394219421,c09f9a97-b36f-45e8-89bf-8d6560e54c18


In [None]:
def prompt_gen_task(example):
    return run_prompt(example.dataset_row.get("question"))

## Run Experiment
arize_client.run_experiment(
    space_id=space_id,
    dataset_id=dataset_id,
    task=prompt_gen_task,
    evaluators=[RouterEvaluator(), FunctionSelectionEvaluator(), ParameterExtractionEvaluator()],
    experiment_name="agents-cookbook-exp-"+str(uuid1())[:5]
)

[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


running tasks |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

ChatCompletion(id='chatcmpl-AOVTUiNSEXv52SiGgRAF0aI4DTa6X', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Could you please provide me with more details about the item you ordered recently? This could include the product name, category, or any specific features you're looking for. This will help me find a good deal or suggest something similar.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1730404224, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_f59a81427f', usage=CompletionUsage(completion_tokens=46, prompt_tokens=359, total_tokens=405, completion_tokens_details=CompletionTokensDetails(audio_tokens=None, reasoning_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0)))
ChatCompletion(id='chatcmpl-AOVTVUN2hxPa4TsDl3DbHjoI8B2qp', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, m

running experiment evaluations |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m


('RXhwZXJpbWVudDoyMDA3OkRZUHQ=',
               id                            example_id  \
 0  EXP_ID_e5db1b  1139f54b-5bd1-4c10-8484-30c4bd97651d   
 1  EXP_ID_178be7  360bd645-35b3-47e7-915f-a405237b965f   
 
                                                                                                                                                                                                                                                                                                                                                                                                                                 result  \
 0  {"question": "I was wondering if you could help me find a good deal on something I might have ordered recently, or maybe suggest something similar?  ", "tools": [], "response": "Could you please provide me with more details about the item you ordered recently? This could include the product name, category, or any specific features you're looking for. This