<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# <center>Using Arize with AI agents</center>

This guide shows you how to create and evaluate agents with Arize to improve performance. We'll go through the following steps:

* Create a customer support agent using a router template

* Trace the agent activity, including function calling

* Create a dataset to benchmark performance

* Evaluate agent performance using code, human annotation, and LLM as a judge

* Experiment with different prompts and models

# Initial setup


We'll setup our libraries, keys, and OpenAI tracing using Phoenix.

### Install Libraries

In [None]:
!pip install -qq arize-otel openai openinference-instrumentation-openai opentelemetry-sdk opentelemetry-exporter-otlp gcsfs nest_asyncio arize-phoenix "arize[Datasets]" 'httpx<0.28'

### Setup Keys

In [None]:
import os
from getpass import getpass
import nest_asyncio

nest_asyncio.apply()

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

### Setup Tracing

To follow with this tutorial, you'll need to sign up for Arize and get your API key. You can see the [guide here](https://docs.arize.com/arize/llm-tracing/quickstart-llm).

In [None]:
# Import open-telemetry dependencies
from arize_otel import register_otel, Endpoints

# Setup OTEL via our convenience function
register_otel(
    endpoints=Endpoints.ARIZE,
    space_id=getpass("🔑 Enter your Arize Space ID: "),
    api_key=getpass("🔑 Enter your Arize API key: "),
    project_name="agents-cookbook",  # name this to whatever you would like
)
# Import the automatic instrumentor from OpenInference
from openinference.instrumentation.openai import OpenAIInstrumentor

# Finish automatic instrumentation
OpenAIInstrumentor().instrument()

# Create customer support agent

We'll be creating a customer support agent using function calling following the architecture below:

<img src="https://storage.cloud.google.com/arize-assets/tutorials/images/agent_architecture.png" width="800"/>

### Setup functions and create customer support agent

We have 6 functions that we define below.

1. product_comparison
2. product_search
3. customer_support
4. track_package
5. product_details
6. apply_discount_code



In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "product_comparison",
            "description": "Compare features of two products.",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_a_id": {
                        "type": "string",
                        "description": "The unique identifier of Product A.",
                    },
                    "product_b_id": {
                        "type": "string",
                        "description": "The unique identifier of Product B.",
                    },
                },
                "required": ["product_a_id", "product_b_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "product_search",
            "description": "Search for products based on criteria.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query string.",
                    },
                    "category": {
                        "type": "string",
                        "description": "The category to filter the search.",
                    },
                    "min_price": {
                        "type": "number",
                        "description": "The minimum price of the products to search.",
                        "default": 0,
                    },
                    "max_price": {
                        "type": "number",
                        "description": "The maximum price of the products to search.",
                    },
                    "page": {
                        "type": "integer",
                        "description": "The page number for pagination.",
                        "default": 1,
                    },
                    "page_size": {
                        "type": "integer",
                        "description": "The number of results per page.",
                        "default": 20,
                    },
                },
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "customer_support",
            "description": "Get contact information for customer support regarding an issue.",
            "parameters": {
                "type": "object",
                "properties": {
                    "issue_type": {
                        "type": "string",
                        "description": "The type of issue (e.g., billing, technical support).",
                    }
                },
                "required": ["issue_type"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "track_package",
            "description": "Track the status of a package based on the tracking number.",
            "parameters": {
                "type": "object",
                "properties": {
                    "tracking_number": {
                        "type": "integer",
                        "description": "The tracking number of the package.",
                    }
                },
                "required": ["tracking_number"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "product_details",
            "description": "Returns details for a given product id",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_id": {
                        "type": "string",
                        "description": "The id of a product to look up.",
                    }
                },
                "required": ["product_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "apply_discount_code",
            "description": "Applies the discount code to a given order.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "integer",
                        "description": "The id of the order to apply the discount code to.",
                    },
                    "discount_code": {
                        "type": "string",
                        "description": "The discount code to apply",
                    },
                },
                "required": ["order_id", "discount_code"],
            },
        },
    },
]

We define a function below called run_prompt, which uses the chat completion call from OpenAI with functions

In [None]:
import os

import openai

client = openai.Client()


def run_prompt(input):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        tools=tools,
        tool_choice="auto",
        messages=[
            {
                "role": "system",
                "content": " ",
            },
            {
                "role": "user",
                "content": input,
            },
        ],
    )
    print(response)

    if (
        hasattr(response.choices[0].message, "tool_calls")
        and response.choices[0].message.tool_calls is not None
        and len(response.choices[0].message.tool_calls) > 0
    ):
        tool_calls = response.choices[0].message.tool_calls
    else:
        tool_calls = []

    if response.choices[0].message.content is None:
        response.choices[0].message.content = ""
    ret = {
        "question": input,
        "tools": tool_calls,
        "response": response.choices[0].message.content,
    }
    return ret

Let's test it and see if it returns the right function! Based on whether we set tool_choice to "auto" or "required", the router will have different behavior.

In [None]:
run_prompt("Hi, I'd like to apply to apply a discount code to my order.")

Now we have a basic agent, let's generate a dataset of questions and run the prompt against this dataset!

# Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our customer support agent.

In [None]:
GEN_TEMPLATE = """
You are an assistant that generates complex customer service questions.
The questions should often involve:

Multiple Categories: Questions that could logically fall into more than one category (e.g., combining product details with a discount code).
Vague Details: Questions with limited or vague information that require clarification to categorize correctly.
Mixed Intentions: Queries where the customer’s goal or need is unclear or seems to conflict within the question itself.
Indirect Language: Use of indirect or polite phrasing that obscures the direct need or request (e.g., using "I was wondering if..." or "Perhaps you could help me with...").
For specific categories:

Track Package: Include vague timing references (e.g., "recently" or "a while ago") instead of specific dates.
Product Comparison and Product Search: Include generic descriptors without specific product names or IDs (e.g., "high-end smartphones" or "energy-efficient appliances").
Apply Discount Code: Include questions about discounts that might apply to hypothetical or past situations, or without mentioning if they have made a purchase.
Product Details: Ask for comparisons or details that involve multiple products or categories ambiguously (e.g., "Tell me about your range of electronics that are good for home office setups").

Examples of More Challenging Questions
"There's an issue with one of the items I think I bought last month—what should I do?"
"I need help with something I ordered, or maybe I'm just looking for something new. Can you help?"

Some questions should be straightforward uses of the provided functions

Respond with a list, one question per line. Do not include any numbering at the beginning of each line. Do not include any category headings.
Generate 25 questions. Be sure there are no duplicate questions.
"""

In [None]:
import nest_asyncio
import pandas as pd

nest_asyncio.apply()
from phoenix.evals import OpenAIModel

pd.set_option("display.max_colwidth", 500)

model = OpenAIModel(model="gpt-4o", max_tokens=1300)

In [None]:
resp = model(GEN_TEMPLATE)

In [None]:
split_response = resp.strip().split("\n")

questions_df = pd.DataFrame(split_response, columns=["question"])
questions_df["tools"] = ""
questions_df["response"] = ""
print(questions_df)

Now let's run it and manually inspect the traces! Change the value in `.head(2)` from any number between 1 and 25 to run it on that many data points from the questions we generated earlier.

Then manually inspect the outputs in Phoenix.

In [None]:
response_df = (
    questions_df["question"].head(2).apply(run_prompt).apply(pd.Series)
)

In [None]:
response_df

# Evaluating your agent

Now that we have a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.

Here, we are defining our evaluation templates to judge whether the router selected a function correctly, whether it selected the right function, and whether it filled the arguments correctly.

In [None]:
ROUTER_EVAL_TEMPLATE = """ You are comparing a response to a question, and verifying whether that response should have made a function call instead of responding directly. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Response]: {tools}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the Question above to the response. You must determine whether the reponse
decided to call the correct function.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the agent should have made function call instead of responding directly and did not, or the function call chosen was the incorrect one.
"correct" means the selected function would correctly and fully answer the user's question.

Here is more information on each function:
product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.
product_search: Search for products based on criteria.
track_package: Track the status of a package based on the tracking number.
customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.
apply_discount_code: Applies a discount code to an order.
product_details: Get detailed features on one product.
"""

FUNCTION_SELECTION_EVAL_TEMPLATE = """ You are comparing a function call response to a question and trying to determine if the generated call is correct. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Response]: {tools}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the Question above to the function call. You must determine whether the function call
will return the answer to the Question. Please focus on whether the very specific
question can be answered by the function call.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the function call will not provide an answer to the Question.
"correct" means the function call will definitely provide an answer to the Question.

Here is more information on each function:
product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.
product_search: Search for products based on criteria.
track_package: Track the status of a package based on the tracking number.
customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.
apply_discount_code: Applies a discount code to an order.
product_details: Get detailed features on one product.
"""

PARAMETER_EXTRACTION_EVAL_TEMPLATE = """ You are comparing a function call response to a question and trying to determine if the generated call has extracted the exact right parameters from the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Response]: {tools}
    ************
    [LLM Response]: {response}
    ************
    [END DATA]

Compare the parameters in the generated function against the JSON provided below.
The parameters extracted from the question must match the JSON below exactly.
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the parameters in the function do not match the JSON schema below exactly, or the generated function does not correctly answer the user's question.
You should also respond with "incorrect" if the response makes up information that is not in the JSON schema.
"correct" means the function call parameters match the JSON below and provides only relevant information.

Here is more information on each function:
product_comparison: Compare features of two products. Should include either the product id or name. If the name or id is present in the question and not present in the generated function, the response is incorrect.
product_search: Search for products based on criteria.
track_package: Track the status of a package based on the tracking number.
customer_support: Get contact information for customer support regarding an issue. The response should always include an email or phone number.
apply_discount_code: Applies a discount code to an order.
product_details: Get detailed features on one product.
"""

Let's run evaluations using Phoenix's llm_classify function for our responses dataframe we generated above!

In [None]:
from phoenix.evals import OpenAIModel, llm_classify

rails = ["incorrect", "correct"]

router_eval_df = llm_classify(
    dataframe=response_df,
    template=ROUTER_EVAL_TEMPLATE,
    model=OpenAIModel("gpt-4o"),
    rails=rails,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

function_selection_eval_df = llm_classify(
    dataframe=response_df,
    template=FUNCTION_SELECTION_EVAL_TEMPLATE,
    model=OpenAIModel("gpt-4o"),
    rails=rails,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

parameter_extraction_eval_df = llm_classify(
    dataframe=response_df,
    template=PARAMETER_EXTRACTION_EVAL_TEMPLATE,
    model=OpenAIModel("gpt-4o"),
    rails=rails,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

Let's look at and inspect the results of our evaluatiion!

In [None]:
router_eval_df

In [None]:
function_selection_eval_df

In [None]:
parameter_extraction_eval_df

# Create an experiment

With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

In [None]:
from arize.experimental.datasets import ArizeDatasetsClient
import os
from uuid import uuid1
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    Evaluator,
)
from arize.experimental.datasets.utils.constants import GENERATIVE

import pandas as pd

# Your developer key
developer_key = "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpYXQiOjE3MTM0NjEwMjksInVzZXJJZCI6MTAwMDgsInV1aWQiOiI4NmIwOWUxMC0wMTk2LTRhMDUtODhiZS1iZGZjNGI5ZGRmNWQiLCJpc3MiOiJodHRwczovL2FwcC5hcml6ZS5jb20ifQ.ZX6NjfZ37pW1nwOH0IlhAWf1NyJuouTuMifs6TtKAdA"
api_key = "416ad605925bf226fd9"
space_id = "U3BhY2U6NjM3MjoyMXJG"

# Set up the arize client
arize_client = ArizeDatasetsClient(developer_key=developer_key, api_key=api_key)


dataset_id = arize_client.create_dataset(
    space_id=space_id,
    dataset_name="agents-cookbook-" + str(uuid1()),
    dataset_type=GENERATIVE,
    data=questions_df.head(2),
)
dataset = arize_client.get_dataset(space_id=space_id, dataset_id=dataset_id)
print(dataset)

In [None]:
import nest_asyncio
from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)


class RouterEvaluator(Evaluator):
    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        question = output.get("question")
        tools = output.get("tools")
        response = output.get("response")

        df_in = pd.DataFrame(
            {"question": question, "tools": str(tools), "response": response},
            index=[0],
        )

        rails = ["correct", "incorrect"]
        expect_df = llm_classify(
            dataframe=df_in,
            template=ROUTER_EVAL_TEMPLATE,
            model=OpenAIModel(model="gpt-4o"),
            rails=rails,
            provide_explanation=True,
            run_sync=True,
        )

        label = expect_df["label"][0]
        score = (
            1 if rails and label == rails[0] else 0
        )  # Choose the 0 item in rails as the correct "1" label
        explanation = expect_df["explanation"][0]
        return EvaluationResult(
            score=score, label=label, explanation=explanation
        )


class FunctionSelectionEvaluator(Evaluator):
    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        question = output.get("question")
        tools = output.get("tools")
        response = output.get("response")

        df_in = pd.DataFrame(
            {"question": question, "tools": str(tools), "response": response},
            index=[0],
        )
        rails = ["correct", "incorrect"]
        expect_df = llm_classify(
            dataframe=df_in,
            template=FUNCTION_SELECTION_EVAL_TEMPLATE,
            model=OpenAIModel(model="gpt-4o"),
            rails=rails,
            provide_explanation=True,
            run_sync=True,
        )

        label = expect_df["label"][0]
        score = (
            1 if rails and label == rails[0] else 0
        )  # Choose the 0 item in rails as the correct "1" label
        explanation = expect_df["explanation"][0]
        return EvaluationResult(
            score=score, label=label, explanation=explanation
        )


class ParameterExtractionEvaluator(Evaluator):
    def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult:
        question = output.get("question")
        tools = output.get("tools")
        response = output.get("response")

        df_in = pd.DataFrame(
            {"question": question, "tools": str(tools), "response": response},
            index=[0],
        )
        rails = ["correct", "incorrect"]
        expect_df = llm_classify(
            dataframe=df_in,
            template=PARAMETER_EXTRACTION_EVAL_TEMPLATE,
            model=OpenAIModel(model="gpt-4o"),
            rails=rails,
            provide_explanation=True,
            run_sync=True,
        )

        label = expect_df["label"][0]
        score = (
            1 if rails and label == rails[0] else 0
        )  # Choose the 0 item in rails as the correct "1" label
        explanation = expect_df["explanation"][0]
        return EvaluationResult(
            score=score, label=label, explanation=explanation
        )

In [None]:
dataset

In [None]:
def prompt_gen_task(example):
    return run_prompt(example.dataset_row.get("question"))


## Run Experiment
arize_client.run_experiment(
    space_id=space_id,
    dataset_id=dataset_id,
    task=prompt_gen_task,
    evaluators=[
        RouterEvaluator(),
        FunctionSelectionEvaluator(),
        ParameterExtractionEvaluator(),
    ],
    experiment_name="agents-cookbook-exp-" + str(uuid1())[:5],
)