[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/experiments/tool_calling_eval_dataset.ipynb)


<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# Tool Calling Evaluation — Dataset Preparation

This notebook is a companion resource to the **How to Evaluate Tool-Calling Agents with Phoenix** tutorial (link here).

It uploads the `travel-assistant-tool-calling` dataset and `travel-assistant` prompt to your Phoenix instance — the starting point for the full evaluation workflow covered in the tutorial.

## Install Dependencies

In [None]:
!pip install "arize-phoenix>=13" pandas

# Section 1: Define the Tool Set

Six tools define the capabilities of the travel planning assistant used in the tutorial.

| Tool | Description |
|---|---|
| `search_flights` | Search available flights between two cities on a given date |
| `get_weather` | Get current weather or forecast for a location |
| `search_hotels` | Find hotels in a city for given dates and guest count |
| `get_directions` | Get travel directions and estimated time between two locations |
| `convert_currency` | Convert an amount from one currency to another |
| `search_restaurants` | Find restaurants in a location by cuisine or criteria |

In [None]:
TRAVEL_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "search_flights",
            "description": "Search available flights between two cities on a given date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "origin": {
                        "type": "string",
                        "description": "Departure city or airport code (e.g. New York or JFK)",
                    },
                    "destination": {
                        "type": "string",
                        "description": "Arrival city or airport code (e.g. Los Angeles or LAX)",
                    },
                    "date": {"type": "string", "description": "Travel date in YYYY-MM-DD format"},
                    "cabin_class": {
                        "type": "string",
                        "enum": ["economy", "business", "first"],
                        "description": "Cabin class preference",
                    },
                    "num_passengers": {"type": "integer", "description": "Number of passengers"},
                },
                "required": ["origin", "destination", "date"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather or forecast for a location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name or location (e.g. Paris, France)",
                    },
                    "date": {
                        "type": "string",
                        "description": "Date for forecast in YYYY-MM-DD format. Omit for current weather.",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit",
                    },
                },
                "required": ["location"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_hotels",
            "description": "Find hotels in a city for given check-in and check-out dates.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City to search for hotels"},
                    "check_in": {
                        "type": "string",
                        "description": "Check-in date in YYYY-MM-DD format",
                    },
                    "check_out": {
                        "type": "string",
                        "description": "Check-out date in YYYY-MM-DD format",
                    },
                    "guests": {"type": "integer", "description": "Number of guests"},
                    "max_price_per_night": {
                        "type": "number",
                        "description": "Maximum price per night in USD",
                    },
                },
                "required": ["city", "check_in", "check_out"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_directions",
            "description": "Get travel directions and estimated travel time between two locations.",
            "parameters": {
                "type": "object",
                "properties": {
                    "origin": {"type": "string", "description": "Starting location or address"},
                    "destination": {
                        "type": "string",
                        "description": "Destination location or address",
                    },
                    "mode": {
                        "type": "string",
                        "enum": ["driving", "walking", "transit", "cycling"],
                        "description": "Mode of transport",
                    },
                },
                "required": ["origin", "destination"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "convert_currency",
            "description": "Convert an amount from one currency to another using current exchange rates.",
            "parameters": {
                "type": "object",
                "properties": {
                    "amount": {"type": "number", "description": "Amount to convert"},
                    "from_currency": {
                        "type": "string",
                        "description": "Source currency code (e.g. USD, EUR)",
                    },
                    "to_currency": {
                        "type": "string",
                        "description": "Target currency code (e.g. GBP, JPY)",
                    },
                },
                "required": ["amount", "from_currency", "to_currency"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_restaurants",
            "description": "Find restaurants in a location by cuisine type or other criteria.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City or neighborhood to search"},
                    "cuisine": {
                        "type": "string",
                        "description": "Cuisine type (e.g. Italian, Japanese, vegan)",
                    },
                    "min_rating": {
                        "type": "number",
                        "description": "Minimum rating on a 1.0 to 5.0 scale",
                    },
                    "price_range": {
                        "type": "string",
                        "enum": ["$", "$$", "$$$", "$$$$"],
                        "description": "Price range",
                    },
                },
                "required": ["location"],
            },
        },
    },
]

print("Defined tools:", [t["function"]["name"] for t in TRAVEL_TOOLS])

# Section 2: Load the Evaluation Dataset

The evaluation dataset contains 30 travel assistant queries with ground truth tool calls, covering three scenarios:

| Pattern | Count | Description |
|---|---|---|
| Single-tool | 18 | One tool needed; tests parameter extraction, implicit dates, ambiguous phrasing |
| Parallel (2 tools) | 10 | Two tools needed simultaneously; all 10 two-tool combinations represented |
| No tool needed | 2 | General travel questions the assistant should answer directly |

Each query has an `expected_tool_calls` label with the full tool name and arguments.

In [None]:
import json
import urllib.request

# Load the curated dataset from GCP
_dataset_url = "https://storage.googleapis.com/arize-phoenix-assets/assets/datasets/travel_assistant_tool_calling_eval_dataset.json"
with urllib.request.urlopen(_dataset_url) as response:
    examples = json.load(response)

print(f"Loaded {len(examples)} examples")

# Show distribution

single = [
    e for e in examples if "," not in e["expected_tool_name"] and e["expected_tool_name"] != "none"
]
parallel = [e for e in examples if "," in e["expected_tool_name"]]
no_tool = [e for e in examples if e["expected_tool_name"] == "none"]

print(f"  Single-tool: {len(single)}")
print(f"  Parallel:    {len(parallel)}")
print(f"  No tool:     {len(no_tool)}")

In [None]:
# Preview a few examples
print("=== Single-tool example ===")
ex = next(e for e in examples if e["expected_tool_name"] == "get_directions")
print("Input:          ", ex["input"])
print("Expected tool:  ", ex["expected_tool_name"])
print("Expected call:  ", json.dumps(ex["expected_tool_calls"], indent=2))

print("\n=== Parallel example ===")
ex = next(e for e in examples if "," in e["expected_tool_name"])
print("Input:          ", ex["input"])
print("Expected tools: ", ex["expected_tool_name"])
print("Expected calls: ", json.dumps(ex["expected_tool_calls"], indent=2))

print("\n=== No-tool example ===")
ex = next(e for e in examples if e["expected_tool_name"] == "none")
print("input:          ", ex["input"])
print("Expected tools: none")

# Section 3: Build the Dataset DataFrame

The dataset has two columns:

| Column | Type | Purpose |
|---|---|---|
| `query` | string | User's travel query — mapped to `{{query}}` in the experiment prompt |
| `expected_tool_calls` | JSON string | Full name + arguments for each call — used for invocation alignment evaluation |


In [None]:
import pandas as pd

df = pd.DataFrame(
    {
        "query": [e["input"] for e in examples],
        "expected_tool_calls": [json.dumps(e["expected_tool_calls"]) for e in examples],
    }
)

print(f"Shape: {df.shape}")
df.head(10)

# Section 4: Upload to Phoenix

The cell below launches an in-process Phoenix server — no additional setup required. Run it and open the printed URL to access the UI.

If you'd prefer to connect to an existing instance, skip that cell and set your connection details before running the upload cells:

- **Phoenix Cloud**: Set `PHOENIX_COLLECTOR_ENDPOINT` to your workspace URL and `PHOENIX_API_KEY` to your API key (both available at [phoenix.arize.com](https://phoenix.arize.com) under **Settings → API Keys**).
- **Existing local server**: If Phoenix is already running (e.g. `python -m phoenix.server.main serve`), `Client()` connects to `http://localhost:6006` automatically — no env vars needed.

In [None]:
import phoenix as px

# Launch an in-process Phoenix server and open the UI inline.
# Skip this cell if you're connecting to Phoenix Cloud or an existing local server.
session = px.launch_app()
session.view()

In [None]:
from phoenix.client import Client

client = Client()

dataset = client.datasets.create_dataset(
    name="travel-assistant-tool-calling",
    dataframe=df,
    input_keys=["query"],
    output_keys=["expected_tool_calls"],
)

print("Dataset uploaded successfully.")
print(f"Dataset ID: {dataset.id}")
print(f"Examples:   {len(df)}")

# Section 5: Create a Phoenix Prompt with the Tool Set

This creates a versioned `travel-assistant` prompt in Phoenix with all six tool schemas attached. Once pushed, it will be available in **Phoenix UI → Prompts → `travel-assistant`** and can be selected directly when creating an experiment.

In [None]:
from phoenix.client.__generated__ import v1
from phoenix.client.types.prompts import PromptVersion

PROMPT_NAME = "travel-assistant"

# Wrap TRAVEL_TOOLS (defined in Section 1) in Phoenix's PromptTools structure.
# tool_choice "zero_or_more" is equivalent to OpenAI's "auto": call any number
# of tools (including none) depending on the query.
prompt_tools: v1.PromptTools = {
    "type": "tools",
    "tools": [
        {
            "type": "function",
            "function": {
                "name": tool["function"]["name"],
                "description": tool["function"]["description"],
                "parameters": tool["function"]["parameters"],
            },
        }
        for tool in TRAVEL_TOOLS
    ],
    "tool_choice": {"type": "zero_or_more"},
}

# Two messages: a fixed system prompt and a user turn with {{query}}.
# template_format="MUSTACHE" enables {{query}} substitution at experiment time,
# where each dataset row's `input` column is injected into the user message.
prompt_version = PromptVersion(
    [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "You are a helpful travel planning assistant. Use the available tools "
                        "to help the user with their travel-related requests. If no tool is "
                        "needed to answer the question, respond directly."
                    ),
                }
            ],
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": "{{query}}"}],
        },
    ],
    model_name="gpt-4o",
    model_provider="OPENAI",
    template_format="MUSTACHE",
    description="Travel assistant with all six tool schemas attached",
)

# Tools are not part of the PromptVersion constructor — attach them directly.
# Each call to client.prompts.create() pushes a new immutable version; existing
# versions are never modified.
prompt_version._tools = prompt_tools  # noqa: SLF001

new_version = client.prompts.create(
    name=PROMPT_NAME,
    version=prompt_version,
    prompt_description="Travel planning assistant with tool-calling capabilities",
)

tools_out = new_version._tools  # noqa: SLF001
tool_names = [t["function"]["name"] for t in tools_out["tools"]] if tools_out else []
print(f"Prompt '{PROMPT_NAME}' — new version id: {new_version.id}")
print(f"Tools attached ({len(tool_names)}): {tool_names}")

# Next Steps

The `travel-assistant-tool-calling` dataset and `travel-assistant` prompt are now in Phoenix. The full walkthrough is covered in the tutorial — here's a quick reference for the steps that happen in the UI.

### 1. Run an experiment

Open Phoenix → **Datasets** → `travel-assistant-tool-calling` → **New Experiment**.

Select the `travel-assistant` prompt in the playground and run the experiment.

### 2. Add evaluators

After the experiment completes, click **Add Evaluator**:

- **Tool Selection** — from the built-in template; map `input` to your dataset's `input` column
- **Tool Invocation** — same input mapping
- **Matches Expected** (optional) — create a custom LLM evaluator to compare output tool calls against the labeled `expected_tool_calls` column

### 3. Inspect and iterate

Review per-example explanations to identify failure patterns. Look for:
- Systematic issues (like date assumptions) → fix the system prompt
- Evaluator over-strictness → adjust the evaluator prompt
- Missing capabilities (like "current date") → extend the tool set

Rerun the experiment and compare versions side by side.