# Evaluate your AI Agent using Vertex AI Gen AI Evaluation service
## Overview

This notebook guides you on how to evaluate an ADK (Agent Development Kit) agent using Vertex AI Gen AI Evaluation for agent evaluation.

## Learning Goals

By the end of this notebook, you will understand how to:
* Setup local ADK agent for evaluation with Vertex AI Gen AI Evaluation service
* Prepare Agent Evaluation dataset
* Set up and use single-tool usage evaluation
* Use the Trajectory evaluation
* Use the Response evaluation

## Setup
This lab needs a special kernel to run, please run the following cell.
**NOTE: You can skip this step if you have already built the ADK Kernel from the previous Lab**

In [None]:
!echo "Kernel installation started."
!cd ../../.. && make adk_kernel > /dev/null 2>&1
!echo "Kernel installation completed."

When it's completed, select the **`ADK Kernel`** on the top right before going forward in the notebook.<br>
It may take ~1 minutes until the kernel is shown after the installation.

## Install Packages

In [None]:
import asyncio
import importlib
import json
import os
import warnings

import pandas as pd
from google.adk.agents import Agent
from google.adk.models.lite_llm import LiteLlm  # For multi-model support
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.tools.tool_context import ToolContext
from google.genai import types  # For creating message Content/Parts
from IPython.display import HTML, Markdown, display

# Ignore all warnings
warnings.filterwarnings("ignore")

import logging

logging.basicConfig(level=logging.ERROR)

In [None]:
LOCATION = "us-central1"
os.environ["GOOGLE_CLOUD_LOCATION"] = LOCATION
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "TRUE"  # Use Vertex AI API

In [None]:
%%bash
echo > adk_agents/.env "GOOGLE_CLOUD_LOCATION=$GOOGLE_CLOUD_LOCATION
GOOGLE_GENAI_USE_VERTEXAI=$GOOGLE_GENAI_USE_VERTEXAI
"

In [None]:
MODEL = "gemini-2.0-flash"

## Define helper functions

Initiate a set of plotting helper functions to visualize our evaluation results,
including tables, bar charts, and radar charts.

In [None]:
import plotly.graph_objects as go
import plotly.io as pio

pio.renderers.default = "notebook"


def plot_bar_plot(
    eval_result: pd.DataFrame, title: str, metrics: list[str] = None
) -> None:
    """Displays a grouped bar chart with specified evaluation metrics from a pandas DataFrame."""
    data = []

    summary_metrics = eval_result.summary_metrics
    if metrics:
        summary_metrics = {
            k: summary_metrics[k]
            for k, v in summary_metrics.items()
            if any(selected_metric in k for selected_metric in metrics)
        }

    data.append(
        go.Bar(
            x=list(summary_metrics.keys()),
            y=list(summary_metrics.values()),
            name=title,
        )
    )

    fig = go.Figure(data=data)

    # Change the bar mode
    fig.update_layout(barmode="group")
    fig.show()


def format_output_as_markdown(output: dict) -> str:
    """Convert the output dictionary to a formatted markdown string."""
    markdown = "### AI Response\n"
    markdown += f"{output['response']}\n\n"

    if output["predicted_trajectory"]:
        output["predicted_trajectory"] = json.loads(
            output["predicted_trajectory"]
        )
        markdown += "### Function Calls\n"
        for call in output["predicted_trajectory"]:
            markdown += f"- **Function**: `{call['tool_name']}`\n"
            markdown += "  - **Arguments**:\n"
            for key, value in call["tool_input"].items():
                markdown += f"    - `{key}`: `{value}`\n"

    return markdown


def display_dataframe_rows(
    df: pd.DataFrame,
    columns: list[str] | None = None,
    num_rows: int = 3,
    display_drilldown: bool = False,
) -> None:
    """Displays a subset of rows from a DataFrame, optionally including a drill-down view."""

    if columns:
        df = df[columns]

    base_style = "font-family: monospace; font-size: 14px; white-space: pre-wrap; width: auto; overflow-x: auto;"
    header_style = base_style + "font-weight: bold;"

    for _, row in df.head(num_rows).iterrows():
        for column in df.columns:
            display(
                HTML(
                    f"<span style='{header_style}'>{column.replace('_', ' ').title()}: </span>"
                )
            )
            display(
                HTML(f"<span style='{base_style}'>{row[column]}</span><br>")
            )

        display(HTML("<hr>"))

        if (
            display_drilldown
            and "predicted_trajectory" in df.columns
            and "reference_trajectory" in df.columns
        ):
            display_drilldown(row)


def display_radar_plot(eval_results, title: str, metrics=None):
    """Plot the radar plot."""
    fig = go.Figure()
    summary_metrics = eval_results.summary_metrics
    if metrics:
        summary_metrics = {
            k: summary_metrics[k]
            for k, v in summary_metrics.items()
            if any(selected_metric in k for selected_metric in metrics)
        }

    min_val = 0  # = min(summary_metrics.values())
    max_val = max(summary_metrics.values())

    fig.add_trace(
        go.Scatterpolar(
            r=list(summary_metrics.values()),
            theta=list(summary_metrics.keys()),
            fill="toself",
            name=title,
        )
    )
    fig.update_layout(
        title=title,
        polar=dict(radialaxis=dict(visible=True, range=[min_val, max_val])),
        showlegend=True,
    )
    fig.show()

## Basic App: Weather Lookup
**NOTE: You can skip this step if you have already created "./adk_agents/agent1_weather_lookup/tools.py" file from the previous Lab**

Let's begin by building the fundamental component of our Weather Bot: a single agent capable of performing a specific task – looking up weather information. This involves creating two core pieces:

- A Tool: A Python function that equips the agent with the ability to fetch weather data.
- An Agent: The AI "brain" that understands the user's request, knows it has a weather tool, and decides when and how to use it.

### Define the Tool (get_weather)

In ADK, **Tools** are the building blocks that give agents concrete capabilities beyond just text generation. They are typically regular Python functions that perform specific actions, like calling an API, querying a database, or performing calculations.

Our first tool will provide a *mock* weather report. This allows us to focus on the agent structure without needing external API keys yet. Later, you could easily swap this mock function with one that calls a real weather service.

**Key Concept: Docstrings are Crucial\!** The agent's LLM relies heavily on the function's **docstring** to understand:

* *What* the tool does.  
* *When* to use it.  
* *What arguments* it requires (`city: str`).  
* *What information* it returns.

**Best Practice:** Write clear, descriptive, and accurate docstrings for your tools. This is essential for the LLM to use the tool correctly.

In [None]:
%%writefile ./adk_agents/agent1_weather_lookup/tools.py
def get_weather(city: str) -> dict:
    """Retrieves the current weather report for a specified city.

    Args:
        city (str): The name of the city (e.g., "New York", "London", "Tokyo").

    Returns:
        dict: A dictionary containing the weather information.
              Includes a 'status' key ('success' or 'error').
              If 'success', includes a 'report' key with weather details.
              If 'error', includes an 'error_message' key.
    """
    print(f"--- Tool: get_weather called for city: {city} ---") # Log tool execution
    city_normalized = city.lower().replace(" ", "") # Basic normalization

    # Mock weather data
    mock_weather_db = {
        "newyork": {"status": "success", "report": "The weather in New York is sunny with a temperature of 25°C."},
        "london": {"status": "success", "report": "It's cloudy in London with a temperature of 15°C."},
        "tokyo": {"status": "success", "report": "Tokyo is experiencing light rain and a temperature of 18°C."},
    }

    if city_normalized in mock_weather_db:
        return mock_weather_db[city_normalized]
    else:
        return {"status": "error", "error_message": f"Sorry, I don't have weather information for '{city}'."}

### Define the Agent (`weather_agent`)

### NOTE: You can skip this step if you have already created "./adk_agents/agent1_weather_lookup/agent.py" file from the previous Lab

Now, let's create the **Agent** itself. An `Agent` in ADK orchestrates the interaction between the user, the LLM, and the available tools.

We configure it with several key parameters:

* `name`: A unique identifier for this agent (e.g., "weather\_agent\_v1").  
* `model`: Specifies which LLM to use (e.g., `gemini-2.0-flash`).
* `description`: A concise summary of the agent's overall purpose. This becomes crucial later when other agents need to decide whether to delegate tasks to *this* agent.  
* `instruction`: Detailed guidance for the LLM on how to behave, its persona, its goals, and specifically *how and when* to utilize its assigned `tools`.  
* `tools`: A list containing the actual Python tool functions the agent is allowed to use (e.g., `[get_weather]`).

**Best Practices:** 
- Choose descriptive `name` and `description` values. These are used internally by ADK and are vital for features like automatic delegation (covered later).
- Provide clear and specific `instruction` prompts. The more detailed the instructions, the better the LLM can understand its role and how to use its tools effectively. Be explicit about error handling if needed.

In [None]:
%%writefile ./adk_agents/agent1_weather_lookup/agent.py
from google.adk.agents import Agent
MODEL = "gemini-2.0-flash"

from .tools import get_weather

root_agent = Agent(
    name="weather_agent_v1",
    model=MODEL, # Can be a string for Gemini or a LiteLlm object
    description="Provides weather information for specific cities.",
    instruction="You are a helpful weather assistant. "
                "When the user asks for the weather in a specific city, "
                "use the 'get_weather' tool to find the information. "
                "If the tool returns an error, inform the user politely. "
                "If the tool is successful, present the weather report clearly.",
    tools=[get_weather], # Pass the function directly
)

In [None]:
from adk_agents.agent1_weather_lookup import agent

importlib.reload(agent)  # Force reload

# Example tool usage (optional test)
print(agent.get_weather("New York"))
print(agent.get_weather("Paris"))

### Setup Runner and Session Service

To manage conversations and execute the agent, we need two more components:

* `SessionService`: Responsible for managing conversation history and state for different users and sessions. The `InMemorySessionService` is a simple implementation that stores everything in memory, suitable for testing and simple applications. It keeps track of the messages exchanged. We'll explore state persistence more in Step 4\.  
* `Runner`: The engine that orchestrates the interaction flow. It takes user input, routes it to the appropriate agent, manages calls to the LLM and tools based on the agent's logic, handles session updates via the `SessionService`, and yields events representing the progress of the interaction.

Let's define some constants first.

In [None]:
APP_NAME = "weather_info_app"
USER_ID = "user_1"
SESSION_ID = "session_001"  # Using a fixed ID for simplicity

#### Define a function for Session and Runner

In [None]:
async def setup_session_and_runner():
    session_service = InMemorySessionService()
    example_session = await session_service.create_session(
        app_name=APP_NAME, user_id=USER_ID, session_id=SESSION_ID
    )

    print(f"--- Examining Session Properties ---")
    print(f"ID (`id`):                {example_session.id}")
    print(f"Application Name (`app_name`): {example_session.app_name}")
    print(f"User ID (`user_id`):         {example_session.user_id}")
    print(
        f"State (`state`):           {example_session.state}"
    )  # Note: Only shows initial state here
    print(
        f"Events (`events`):         {example_session.events}"
    )  # Initially empty
    print(
        f"Last Update (`last_update_time`): {example_session.last_update_time:.2f}"
    )
    print(f"---------------------------------")

    runner = Runner(
        agent=agent.root_agent,
        app_name=APP_NAME,
        session_service=session_service,
    )
    return example_session, runner

### Interact with the Agent

We need a way to send messages to our agent and receive its responses. Since LLM calls and tool executions can take time, ADK's `Runner` operates asynchronously.

We'll define an `async` helper function (`call_agent_async`) that:

1. Takes a user query string.  
2. Packages it into the ADK `Content` format.  
3. Calls `runner.run_async`, providing the user/session context and the new message.  
4. Iterates through the **Events** yielded by the runner. Events represent steps in the agent's execution (e.g., tool call requested, tool result received, intermediate LLM thought, final response).  
5. Identifies and prints the **final response** event using `event.is_final_response()`.

**Why `async`?** Interactions with LLMs and potentially tools (like external APIs) are I/O-bound operations. Using `asyncio` allows the program to handle these operations efficiently without blocking execution.

### Run the Conversation

Finally, let's test our setup by sending a few queries to the agent. We wrap our `async` calls in a main `async` function and run it using `await`.

Watch the output:

* See the user queries.  
* Notice the `--- Tool: get_weather called... ---` logs when the agent uses the tool.  
* Observe the agent's final responses, including how it handles the case where weather data isn't available (for Paris).

In [None]:
# Agent Interaction


async def call_agent_async(query):

    print(f"\n>>> User Query: {query}")

    content = types.Content(role="user", parts=[types.Part(text=query)])
    session, runner = await setup_session_and_runner()
    events = runner.run_async(
        user_id=USER_ID, session_id=SESSION_ID, new_message=content
    )

    final_response = ""
    predicted_trajectory_list = []

    async for event in events:
        # Ensure content and parts exist before accessing them
        if not event.content or not event.content.parts:
            continue

        # Iterate through ALL parts in the event's content
        for part in event.content.parts:
            if part.function_call:
                tool_info = {
                    "tool_name": part.function_call.name,
                    "tool_input": dict(part.function_call.args),
                }
                # Ensure we don't add duplicates if the same call appears somehow
                if tool_info not in predicted_trajectory_list:
                    predicted_trajectory_list.append(tool_info)

            # The final text response is usually in the last event from the model
            if event.content.role == "model" and part.text:
                # Overwrite response; the last text response found is likely the final one
                final_response = part.text.strip()

        if event.is_final_response():
            final_response = event.content.parts[0].text
            print("Agent Response: ", final_response)

    # Dump the collected trajectory list into a JSON string
    final_output = {
        "response": str(final_response),
        "predicted_trajectory": json.dumps(predicted_trajectory_list),
    }
    return final_output

In [None]:
await call_agent_async("Tell me the weather in New York")

## Evaluating a ADK agent with Vertex AI Gen AI Evaluation

When working with AI agents, it's important to keep track of their performance and how well they're working. You can look at this in two main ways: **monitoring** and **observability**.

Monitoring focuses on how well your agent is performing specific tasks:

* **Single Tool Selection**: Is the agent choosing the right tools for the job?

* **Multiple Tool Selection (or Trajectory)**: Is the agent making logical choices in the order it uses tools?

* **Response generation**: Is the agent's output good, and does it make sense based on the tools it used?

Observability is about understanding the overall health of the agent:

* **Latency**: How long does it take the agent to respond?

* **Failure Rate**: How often does the agent fail to produce a response?

Vertex AI Gen AI Evaluation service helps you to assess all of these aspects both while you are prototyping the agent or after you deploy it in production. It provides [pre-built evaluation criteria and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval) so you can see exactly how your agents are doing and identify areas for improvement.

### Prepare Agent Evaluation dataset

To evaluate your AI agent using the Vertex AI Gen AI Evaluation service, you need a specific dataset depending on what aspects you want to evaluate of your agent.  

This dataset should include the prompts given to the agent. It can also contain the ideal or expected response (ground truth) and the intended sequence of tool calls the agent should take (reference trajectory) representing the sequence of tools you expect agent calls for each given prompt.

Below you have an example of dataset you might have with a customer support agent with user prompt and the reference trajectory.

In [None]:
eval_data = {
    "prompt": [
        "Tell me the weather in New York",
        "What is the weather like in London?",
        "How about Paris?",
        "Tell me the weather in New York and London?",
        "What is the weather like in New York and London?",
        "Tell me the weather in New York, London and Paris?",
    ],
    "reference_trajectory": [
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "Paris"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "Paris"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
        ],
    ],
}

eval_sample_dataset = pd.DataFrame(eval_data)

Print some samples from the dataset.

In [None]:
display(eval_sample_dataset)

In [None]:
display_dataframe_rows(eval_sample_dataset, num_rows=3)

### Single tool usage evaluation

After you've set your AI agent and the evaluation dataset, you start evaluating if the agent is choosing the correct single tool for a given task.

#### Set single tool usage metrics

The `trajectory_single_tool_use` metric in Vertex AI Gen AI Evaluation gives you a quick way to evaluate whether your agent is using the tool you expect it to use, regardless of any specific tool order. It's a basic but useful way to start evaluating if the right tool was used at some point during the agent's process.

To use the `trajectory_single_tool_use` metric, you need to set what tool should have been used for a particular user's request. For example, if a user asks to "send an email", you might expect the agent to use an "send_email" tool, and you'd specify that tool's name when using this metric.

In [None]:
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.evaluation.metrics import (
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
    TrajectorySingleToolUse,
)

In [None]:
single_tool_usage_metrics = [TrajectorySingleToolUse(tool_name="get_weather")]

#### Run an evaluation task

To run the evaluation, you initiate an `EvalTask` using the pre-defined dataset (`eval_sample_dataset`) and metrics (`single_tool_usage_metrics` in this case) within an experiment. Then, you run the evaluation using agent_parsed_outcome function and assigns a unique identifier to this specific evaluation run, storing and visualizing the evaluation results.

In [None]:
EXPERIMENT_NAME = "evaluate-adk-agent-v1"
PROJECT = !gcloud config list --format 'value(core.project)'
PROJECT = PROJECT[0]
BUCKET_NAME = f"agent-evaluation-{PROJECT}-bucket"
BUCKET_URI = f"gs://{BUCKET_NAME}"

**Checking for the existence of BUCKET. Creating it if it doesn't exist:**

In [None]:
!gsutil ls $BUCKET_URI || gsutil mb -l $LOCATION $BUCKET_URI

In [None]:
import random
import string
from typing import Any


def get_id(length: int = 8) -> str:
    """Generate a uuid of a specified length (default=8)."""
    return "".join(
        random.choices(string.ascii_lowercase + string.digits, k=length)
    )

Here we wrap the `call_agent_async` in a synchronous function so that we can pass it to the evaluation service.

In [None]:
def agent_parsed_outcome(query):
    return asyncio.run(call_agent_async(query))

In [None]:
EXPERIMENT_RUN = f"single-metric-eval-{get_id()}"

single_tool_call_eval_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=single_tool_usage_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/single-metric-eval",
)

single_tool_call_eval_result = single_tool_call_eval_task.evaluate(
    runnable=agent_parsed_outcome, experiment_run_name=EXPERIMENT_RUN
)

print(single_tool_call_eval_result)

In [None]:
def display_eval_report(eval_result: pd.DataFrame) -> None:
    """Display the evaluation results."""
    metrics_df = pd.DataFrame.from_dict(
        eval_result.summary_metrics, orient="index"
    ).T
    display(Markdown("### Summary Metrics"))
    display(metrics_df)

    display(Markdown("### Row-wise Metrics"))
    display(eval_result.metrics_table)

In [None]:
display_eval_report(single_tool_call_eval_result)

### Trajectory Evaluation

After evaluating the agent's ability to select the single most appropriate tool for a given task, you generalize the evaluation by analyzing the tool sequence choices with respect to the user input (trajectory). This assesses whether the agent not only chooses the right tools but also utilizes them in a rational and effective order.

#### Set trajectory metrics

To evaluate agent's trajectory, Vertex AI Gen AI Evaluation provides several ground-truth based metrics:

* `trajectory_exact_match`: identical trajectories (same actions, same order)

* `trajectory_in_order_match`: reference actions present in predicted trajectory, in order (extras allowed)

* `trajectory_any_order_match`: all reference actions present in predicted trajectory (order, extras don't matter).

* `trajectory_precision`: proportion of predicted actions present in reference

* `trajectory_recall`: proportion of reference actions present in predicted.  

All metrics score 0 or 1, except `trajectory_precision` and `trajectory_recall` which range from 0 to 1.

In [None]:
trajectory_metrics = [
    "trajectory_exact_match",
    "trajectory_in_order_match",
    "trajectory_any_order_match",
    "trajectory_precision",
    "trajectory_recall",
]

#### Run an evaluation task

Submit an evaluation by running `evaluate` method of the new `EvalTask`.

In [None]:
EXPERIMENT_RUN = f"trajectory-{get_id()}"

trajectory_eval_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=trajectory_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/multiple-metric-eval",
)

trajectory_eval_result = trajectory_eval_task.evaluate(
    runnable=agent_parsed_outcome, experiment_run_name=EXPERIMENT_RUN
)

In [None]:
display_eval_report(trajectory_eval_result)

#### Visualize evaluation results

Print and visualize a sample of evaluation results.

In [None]:
display_dataframe_rows(trajectory_eval_result.metrics_table, num_rows=3)

In [None]:
plot_bar_plot(
    trajectory_eval_result,
    title="Trajectory Metrics",
    metrics=[f"{metric}/mean" for metric in trajectory_metrics],
)

### Evaluate final response

Similar to model evaluation, you can evaluate the final response of the agent using Vertex AI Gen AI Evaluation.

#### Set response metrics

After agent inference, Vertex AI Gen AI Evaluation provides several metrics to evaluate generated responses. You can use computation-based metrics to compare the response to a reference (if needed) and using existing or custom model-based metrics to determine the quality of the final response.

Check out the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval) to learn more.


In [None]:
response_metrics = ["safety", "coherence"]

#### Run an evaluation task

To evaluate agent's generated responses, use the `evaluate` method of the EvalTask class.

In [None]:
EXPERIMENT_RUN = f"response-{get_id()}"

response_eval_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=response_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/response-metric-eval",
)

response_eval_result = response_eval_task.evaluate(
    runnable=agent_parsed_outcome, experiment_run_name=EXPERIMENT_RUN
)

display_eval_report(response_eval_result)

#### Visualize evaluation results


Print new evaluation result sample.

In [None]:
display_dataframe_rows(response_eval_result.metrics_table, num_rows=3)

### Evaluate generated response conditioned by tool choosing

When evaluating AI agents that interact with environments, standard text generation metrics like coherence may not be sufficient. This is because these metrics primarily focus on text structure, while agent responses should be assessed based on their effectiveness within the environment.

Instead, use custom metrics that assess whether the agent's response logically follows from its tools choices like the one you have in this section.

#### Define a custom metric

According to the [documentation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval#model-based-metrics), you can define a prompt template for evaluating whether an AI agent's response follows logically from its actions by setting up criteria and a rating system for this evaluation.

Define a `criteria` to set the evaluation guidelines and a `pointwise_rating_rubric` to provide a scoring system (1 or 0). Then use a `PointwiseMetricPromptTemplate` to create the template using these components.


In [None]:
criteria = {
    "Follows trajectory": (
        "Evaluate whether the agent's response logically follows from the "
        "sequence of actions it took. Consider these sub-points:\n"
        "  - Does the response reflect the information gathered during the trajectory?\n"
        "  - Is the response consistent with the goals and constraints of the task?\n"
        "  - Are there any unexpected or illogical jumps in reasoning?\n"
        "Provide specific examples from the trajectory and response to support your evaluation."
    )
}

pointwise_rating_rubric = {
    "1": "Follows trajectory",
    "0": "Does not follow trajectory",
}

response_follows_trajectory_prompt_template = PointwiseMetricPromptTemplate(
    criteria=criteria,
    rating_rubric=pointwise_rating_rubric,
    input_variables=["prompt", "predicted_trajectory"],
)

Print the prompt_data of this template containing the combined criteria and rubric information ready for use in an evaluation.

In [None]:
print(response_follows_trajectory_prompt_template.prompt_data)

After you define the evaluation prompt template, set up the associated metric to evaluate how well a response follows a specific trajectory. The `PointwiseMetric` creates a metric where `response_follows_trajectory` is the metric's name and `response_follows_trajectory_prompt_template` provides instructions or context for evaluation you set up before.

In [None]:
response_follows_trajectory_metric = PointwiseMetric(
    metric="response_follows_trajectory",
    metric_prompt_template=response_follows_trajectory_prompt_template,
)

#### Set response metrics

Set new generated response evaluation metrics by including the custom metric.


In [None]:
response_tool_metrics = [
    "trajectory_exact_match",
    "trajectory_in_order_match",
    "safety",
    response_follows_trajectory_metric,
]

#### Run an evaluation task

Run a new agent's evaluation.

In [None]:
EXPERIMENT_RUN = f"response-over-tools-{get_id()}"

response_eval_tool_task = EvalTask(
    dataset=eval_sample_dataset,
    metrics=response_tool_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/reasoning-metric-eval",
)

response_eval_tool_result = response_eval_tool_task.evaluate(
    runnable=agent_parsed_outcome, experiment_run_name=EXPERIMENT_RUN
)

display_eval_report(response_eval_tool_result)

#### Visualize evaluation results

Visualize evaluation result sample.

In [None]:
display_dataframe_rows(response_eval_tool_result.metrics_table, num_rows=3)

In [None]:
plot_bar_plot(
    response_eval_tool_result,
    title="Response Metrics",
    metrics=[f"{metric}/mean" for metric in response_tool_metrics],
)

## Bonus: Bring-Your-Own-Dataset (BYOD) and evaluate a ADK agent using Vertex AI Gen AI Evaluation

In Bring Your Own Dataset (BYOD) [scenarios](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-dataset), you provide both the predicted trajectory and the generated response from the agent.


### Bring your own evaluation dataset

Define the evaluation dataset with the predicted trajectory and the generated response.

In [None]:
byod_eval_data = {
    "prompt": [
        "Tell me the weather in New York",
        "What is the weather like in London?",
        "How about Paris?",
        "Tell me the weather in New York and London?",
        "What is the weather like in New York and London?",
        "Tell me the weather in New York, London and Paris?",
    ],
    "reference_trajectory": [
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "Paris"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "Paris"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
        ],
    ],
    "predicted_trajectory": [
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "Paris"},
            }
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
        ],
        [
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "London"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "Paris"},
            },
            {
                "tool_name": "get_weather",
                "tool_input": {"city": "New York"},
            },
        ],
    ],
    "response": [
        "The weather in New York is sunny with a temperature of 25°C.",
        "The weather in London is sunny with a temperature of 25°C.",
        "The weather in New York is sunny with a temperature of 25°C.",
        "The weather in New York is sunny with a temperature of 25°C.",
        "The weather in New York is sunny with a temperature of 25°C.",
        "The weather in New York is sunny with a temperature of 25°C.",
    ],
}

byod_eval_sample_dataset = pd.DataFrame(byod_eval_data)
byod_eval_sample_dataset["predicted_trajectory"] = byod_eval_sample_dataset[
    "predicted_trajectory"
].apply(json.dumps)
byod_eval_sample_dataset["reference_trajectory"] = byod_eval_sample_dataset[
    "reference_trajectory"
].apply(json.dumps)

### Run an evaluation task

Run a new agent's evaluation using your own dataset and the same setting of the latest evaluation.

In [None]:
EXPERIMENT_RUN_NAME = f"response-over-tools-byod-{get_id()}"

byod_response_eval_tool_task = EvalTask(
    dataset=byod_eval_sample_dataset,
    metrics=response_tool_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/byod-eval",
)

byod_response_eval_tool_result = byod_response_eval_tool_task.evaluate(
    experiment_run_name=EXPERIMENT_RUN_NAME
)

display_eval_report(byod_response_eval_tool_result)

### Visualize evaluation results

Visualize evaluation result sample.

In [None]:
display_dataframe_rows(byod_response_eval_tool_result.metrics_table, num_rows=3)

In [None]:
display_radar_plot(
    byod_response_eval_tool_result,
    title="ADK agent evaluation",
    metrics=[f"{metric}/mean" for metric in response_tool_metrics],
)

Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.