# Task 5. Create and initialize a new Colab Enterprise notebook

Note: You will now evaluate Cymbals new Gen AI agent that is used to provide product details and product prices in response to queries. For this you will open a new Colab Enterprise Notebook.

In this task, you will be setting up a Colab notebook and initializing Vertex AI to connect the notebook. This notebook will be used in the following 3 scenarios:
- To determine if the new Cymbal Gen AI **agent is choosing the right tools for the tasks** it performs
- To determine if the Cymbal Gen AI agent is making logical choices in the **order it uses tools**
- To evaluate the **quality of the agents responses to prompts**

Steps
- In the Google Cloud Console, navigate to Vertex AI > Colab Enterprise.
- Create a new notebook.
- Rename the notebook to cymbal-genai-agent-evaluations.ipynb.
- Note: You may be presented with a pop-up window to authorize the environment to act as your Qwiklabs student account.
- After the cell completes running, indicated by a checkmark to the left of the cell, the packages should be installed. To use them, we’ll restart the runtime. Restart the session now.







In [29]:
import pandas as pd
pd.set_option('display.max_colwidth', None)


In [1]:
%pip install --quiet --upgrade pip==23.3.1

%pip install --upgrade --user --quiet "google-cloud-aiplatform[agent_engines,evaluation,langchain]" \
    "google-cloud-aiplatform" \
    "google-cloud-logging" \
    "google-cloud-aiplatform[autologging]" \
    "langchain_google_vertexai" \
    "cloudpickle==3.0.0" \
    "pydantic>=2.10" \
    "requests==2.32.3"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.7/87.7 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.9/104.9 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.4/462.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m449.8/449.8 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

```
WARNING: The script gunicorn is installed in '/root/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts databricks and dbfs are installed in '/root/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The script alembic is installed in '/root/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts opentelemetry-bootstrap and opentelemetry-instrument are installed in '/root/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts mlflow and mlp are installed in '/root/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
gradio 5.42.0 requires pydantic<2.12,>=2.0, but you have pydantic 2.12.3 which is incompatible.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
```

In [1]:
# General
import random
import string
import google.cloud.logging
import logging

from IPython.display import HTML, Markdown, display
import pandas as pd

# Build agent
import vertexai
from google.cloud import aiplatform
from vertexai import agent_engines

# Evaluate agent
from vertexai.preview.evaluation import EvalTask
from vertexai.preview.evaluation.metrics import (
    PointwiseMetric,
    PointwiseMetricPromptTemplate,
    TrajectorySingleToolUse,
)
from vertexai.preview.reasoning_engines import LangchainAgent

# Do not remove logging section
client = google.cloud.logging.Client()
client.setup_logging()

In [16]:
PROJECT_ID ="qwiklabs-gcp-01-ecf3a270a90b"
LOCATION ="us-west1"
BUCKET_URI ="gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket"
EXPERIMENT_NAME ="evaluate-agent"

import vertexai

# Initialize vertexai
vertexai.init(
    project=PROJECT_ID,
    location=LOCATION,
    experiment=EXPERIMENT_NAME,
    staging_bucket=BUCKET_URI
)

In [4]:
import os
os.environ

environ{'SHELL': '/bin/bash',
        'NV_LIBCUBLAS_VERSION': '12.5.3.2-1',
        'NVIDIA_VISIBLE_DEVICES': 'all',
        'NV_NVML_DEV_VERSION': '12.5.82-1',
        'NV_CUDNN_PACKAGE_NAME': 'libcudnn9-cuda-12',
        'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.22.3-1+cuda12.5',
        'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.22.3-1',
        'VERTEX_PRODUCT': 'COLAB_ENTERPRISE',
        'VM_GCE_METADATA_HOST': '169.254.169.254',
        'HOSTNAME': '6e016d51ce48',
        'LANGUAGE': 'en_US',
        'COLAB_TPU_1VM': '',
        'GOOGLE_CLOUD_PROJECT': 'qwiklabs-gcp-01-ecf3a270a90b',
        'GCE_METADATA_TIMEOUT': '3',
        'NVIDIA_REQUIRE_CUDA': 'cuda>=12.5 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand

# Task 6. Build agent and execute a query

In this task, you will put together a simple agent that you will then evaluate using the Gen AI evaluation service in subsequent tasks of this challenge lab.

Note: You are provided with the code for a simple agent below that you will use in subsequent tasks of this challenge lab. This has deliberately been kept very simple so that you can immediately see what it does, allowing your focus to be on evaluating it.


In [7]:
# General
import random
import string

from IPython.display import HTML, Markdown, display

# Build agent
import vertexai
from google.cloud import aiplatform
from vertexai import agent_engines

from vertexai.preview.reasoning_engines import LangchainAgent

In [20]:
# 2 simple tools for the agent are defined below. Add and execute the following code below.

def get_product_details(product_name: str):
    """Gathers basic details about a product."""
    details = {
        "mens blue shorts": "Elevate your summer style with these tailored blue dress shorts. Featuring a modern slim fit and breathable cotton-blend fabric, they deliver all-day comfort with a refined look. Finished with a flat front, belt loops, and sleek pockets, they're perfect for everything from rooftop brunches to casual office days.",
        "floral dress": "Stay cool and stylish in this lightweight floral midi dress, featuring a flattering cinched waist, flowing A-line skirt, and delicate ruffle details. Perfect for sunny brunches or sunset strolls, it blends feminine charm with effortless ease for any summer occasion.",
        "garden furniture": "Create your perfect outdoor retreat with this stylish, weather-resistant garden furniture set featuring plush cushions, timeless design, and all-day comfort — ideal for relaxing, entertaining, or enjoying the outdoors in style.",
        "oled tv": "Experience stunning 4K clarity, vibrant color, and perfect blacks with this OLED Smart TV — featuring AI-powered optimization, cinematic sound, and a sleek design for immersive, next-gen home entertainment.",
        "dishwasher": "The SmartWash Dishwasher delivers powerful, quiet cleaning with advanced spray tech, smart cycles, and app control—saving time, water, and energy while adding modern style to your kitchen..",
    }
    return details.get(product_name, "Product details not found.")


def get_product_price(product_name: str):
    """Gathers price about a product."""
    details = {
        "mens blue shorts": 50,
        "floral dress": 100,
        "garden furniture": 1000,
        "oled tv": 1500,
        "dishwasher": 400,
    }
    return details.get(product_name, "Product price not found.")

In [9]:
# Define a variable named model_name and assign it the value gemini-2.0-flash. You will use this variable in the subsequent task.
model_name = "gemini-2.0-flash"


In [10]:
# Assemble the agent using the model and tools defined so far.

local_1p_agent = LangchainAgent(
    model=model_name,
    tools=[get_product_details, get_product_price],
    agent_executor_kwargs={"return_intermediate_steps": True},
)

In [11]:
# Test the local agent using the following calls.

response = local_1p_agent.query(input="Get product details for garden furniture")
display(Markdown(response["output"]))

OK. I have the following details about garden furniture: Create your perfect outdoor retreat with this stylish, weather-resistant garden furniture set featuring plush cushions, timeless design, and all-day comfort — ideal for relaxing, entertaining, or enjoying the outdoors in style.


In [12]:
response

{'input': 'Get product details for garden furniture',
 'output': 'OK. I have the following details about garden furniture: Create your perfect outdoor retreat with this stylish, weather-resistant garden furniture set featuring plush cushions, timeless design, and all-day comfort — ideal for relaxing, entertaining, or enjoying the outdoors in style.\n',
 'intermediate_steps': [[{'lc': 1,
    'type': 'constructor',
    'id': ['langchain', 'schema', 'agent', 'ToolAgentAction'],
    'kwargs': {'tool': 'get_product_details',
     'tool_input': {'product_name': 'garden furniture'},
     'log': "\nInvoking: `get_product_details` with `{'product_name': 'garden furniture'}`\n\n\n",
     'type': 'AgentActionMessageLog',
     'message_log': [{'lc': 1,
       'type': 'constructor',
       'id': ['langchain', 'schema', 'messages', 'AIMessageChunk'],
       'kwargs': {'content': '',
        'additional_kwargs': {'function_call': {'name': 'get_product_details',
          'arguments': '{"product_name"

In [13]:
response = local_1p_agent.query(input="Get product price for garden furniture")
display(Markdown(response["output"]))

The price of the garden furniture is 1000.


In [14]:
response

{'input': 'Get product price for garden furniture',
 'output': 'The price of the garden furniture is 1000.\n',
 'intermediate_steps': [[{'lc': 1,
    'type': 'constructor',
    'id': ['langchain', 'schema', 'agent', 'ToolAgentAction'],
    'kwargs': {'tool': 'get_product_price',
     'tool_input': {'product_name': 'garden furniture'},
     'log': "\nInvoking: `get_product_price` with `{'product_name': 'garden furniture'}`\n\n\n",
     'type': 'AgentActionMessageLog',
     'message_log': [{'lc': 1,
       'type': 'constructor',
       'id': ['langchain', 'schema', 'messages', 'AIMessageChunk'],
       'kwargs': {'content': '',
        'additional_kwargs': {'function_call': {'name': 'get_product_price',
          'arguments': '{"product_name": "garden furniture"}'}},
        'response_metadata': {'safety_ratings': [],
         'usage_metadata': {},
         'finish_reason': 'STOP',
         'model_name': 'gemini-2.0-flash'},
        'type': 'AIMessageChunk',
        'id': 'run--a38d2daa-

In [17]:
# Deploy the local agent to Vertex AI Agent Engine

remote_1p_agent = agent_engines.create(
    local_1p_agent,
    requirements=[
        "google-cloud-aiplatform[agent_engines,langchain]",
        "langchain_google_vertexai",
        "cloudpickle==3.0.0",
        "pydantic>=2.10",
        "requests==2.32.3",
    ],
)

# Note: This may take a few minutes to run.


INFO:vertexai.agent_engines:Identified the following requirements: {'google-cloud-aiplatform': '1.122.0', 'pydantic': '2.12.3', 'cloudpickle': '3.0.0'}
INFO:vertexai.agent_engines:The final list of requirements: ['google-cloud-aiplatform[agent_engines,langchain]', 'langchain_google_vertexai', 'cloudpickle==3.0.0', 'pydantic>=2.10', 'requests==2.32.3']
INFO:vertexai.agent_engines:Using bucket qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket
INFO:vertexai.agent_engines:Wrote to gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket/agent_engine/agent_engine.pkl
INFO:vertexai.agent_engines:Writing to gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket/agent_engine/requirements.txt
INFO:vertexai.agent_engines:Creating in-memory tarfile of extra_packages
INFO:vertexai.agent_engines:Writing to gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket/agent_engine/dependencies.tar.gz
INFO:vertexai.agent_engines:Creating AgentEngine
INFO:vertexai.agent_engines:Create A

In [18]:
remote_1p_agent

<vertexai.agent_engines._agent_engines.AgentEngine object at 0x7f715ea372d0> 
resource name: projects/30916183489/locations/us-west1/reasoningEngines/7826147844649123840

In [22]:
remote_1p_agent.__dict__

{'project': 'qwiklabs-gcp-01-ecf3a270a90b',
 'location': 'us-west1',
 'credentials': <google.auth.compute_engine.credentials.Credentials at 0x7f72263111d0>,
 'api_client': <google.cloud.aiplatform.utils.AgentEngineClientWithOverride at 0x7f715d8aba10>,
 '_FutureManager__latest_future_lock': <unlocked _thread.lock object at 0x7f715d86cac0>,
 '_FutureManager__latest_future': None,
 '_exception': None,
 '_gca_resource': name: "projects/30916183489/locations/us-west1/reasoningEngines/7826147844649123840"
 spec {
   package_spec {
     pickle_object_gcs_uri: "gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket/agent_engine/agent_engine.pkl"
     requirements_gcs_uri: "gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket/agent_engine/requirements.txt"
     python_version: "3.11"
   }
   class_methods {
     fields {
       key: "parameters"
       value {
         struct_value {
           fields {
             key: "type"
             value {
               string_value: "o

In [21]:
# Make a call to the remote agent and verify it is working as expected.

response = remote_1p_agent.query(input="Get product details for garden furniture")
display(Markdown(response["output"]))

OK. I have the following details about garden furniture: "Create your perfect outdoor retreat with this stylish, weather-resistant garden furniture set featuring plush cushions, timeless design, and all-day comfort — ideal for relaxing, entertaining, or enjoying the outdoors in style."


In [23]:
response

{'output': 'OK. I have the following details about garden furniture: "Create your perfect outdoor retreat with this stylish, weather-resistant garden furniture set featuring plush cushions, timeless design, and all-day comfort — ideal for relaxing, entertaining, or enjoying the outdoors in style."\n',
 'intermediate_steps': [[{'type': 'constructor',
    'kwargs': {'type': 'AgentActionMessageLog',
     'log': "\nInvoking: `get_product_details` with `{'product_name': 'garden furniture'}`\n\n\n",
     'tool': 'get_product_details',
     'message_log': [{'type': 'constructor',
       'kwargs': {'response_metadata': {'model_name': 'gemini-2.0-flash',
         'finish_reason': 'STOP',
         'usage_metadata': {},
         'safety_ratings': []},
        'content': '',
        'invalid_tool_calls': [],
        'id': 'run--dea877c4-c4fc-4269-992c-a307432110f7',
        'usage_metadata': {'input_token_details': {'cache_read': 0.0},
         'output_tokens': 10.0,
         'input_tokens': 47.0,

# Task 7. Evaluate the agent using Single Tool Selection
Note: You should use the remote agent in all the evaluations that follow from this point onwards.


In this task, you will evaluate if the agent is using the right tools for the tasks it is required to perform.




In [24]:
# Prepare an agent evaluation dataset. A start is provided below. Complete this with prompts and reference trajectories for all 5 products the agent is aware of.
eval_data = {
    "prompt": [
        "Get price for mens blue shorts",
        "Get product details and price for floral dress",
    ],
    "reference_trajectory": [
        [
            {
                "tool_name": "get_product_price",
                "tool_input": {"product_name": "mens blue shorts"},
            }
        ],
        [
            {
                "tool_name": "get_product_details",
                "tool_input": {"product_name": "floral dress"},
            },
            {
                "tool_name": "get_product_price",
                "tool_input": {"product_name": "floral dress"},
            },
        ],
    ],
}

eval_sample_dataset = pd.DataFrame(eval_data)

In [25]:
# Run an evaluation that uses the trajectory_single_tool_use metric. The tool you are evaluating use of is get_product_price Give the experiment_run_name the following value: qwiklabs-gcp-01-ecf3a270a90b-single-tool-use.

EXPERIMENT_RUN = "qwiklabs-gcp-01-ecf3a270a90b-single-tool-use"

single_tool_usage_metrics = [TrajectorySingleToolUse(tool_name="get_product_price")]

single_tool_call_eval_task = EvalTask(
    # Fill in the appropriate configuration
    dataset=eval_data,
    metrics=single_tool_usage_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/single-metric-eval",

)


single_tool_call_eval_result = single_tool_call_eval_task.evaluate(
    runnable=remote_1p_agent,
    experiment_run_name=EXPERIMENT_RUN
)


INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'output_file': 'gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket/single-metric-eval/eval_results_2025-10-27-08-53-08-2f256.csv'}
100%|██████████| 2/2 [00:05<00:00,  2.82s/it]
INFO:vertexai.preview.evaluation._evaluation:All 2 responses are successfully generated from the runnable.
INFO:vertexai.preview.evaluation._evaluation:Computing metrics with a total of 2 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 2/2 [00:00<00:00, 11.33it/s]
INFO:vertexai.preview.evaluation._evaluation:All 2 metric requests are successfully computed.
INFO:vertexai.preview.evaluation._evaluation:Evaluation Took:0.18784970600017914 seconds


In [28]:
single_tool_call_eval_result

EvalResult(summary_metrics={'row_count': 2, 'trajectory_single_tool_use/mean': np.float64(1.0), 'trajectory_single_tool_use/std': 0.0, 'latency_in_seconds/mean': np.float64(3.5845930199998293), 'latency_in_seconds/std': np.float64(2.8930261856168276), 'failure/mean': np.float64(0.0), 'failure/std': np.float64(0.0)}, metrics_table=                                           prompt  \
0                  Get price for mens blue shorts   
1  Get product details and price for floral dress   

                                reference_trajectory  \
0  [{'tool_name': 'get_product_price', 'tool_inpu...   
1  [{'tool_name': 'get_product_details', 'tool_in...   

                                            response latency_in_seconds  \
0            The price for mens blue shorts is 50.\n           1.538915   
1  The floral dress is a lightweight floral midi ...           5.630271   

  failure                               predicted_trajectory  \
0       0  [{'tool_name': 'get_product_price', 't

In [26]:
# Verify the results that the agent is using the correct tool.
single_tool_call_eval_result.summary_metrics

{'row_count': 2,
 'trajectory_single_tool_use/mean': np.float64(1.0),
 'trajectory_single_tool_use/std': 0.0,
 'latency_in_seconds/mean': np.float64(3.5845930199998293),
 'latency_in_seconds/std': np.float64(2.8930261856168276),
 'failure/mean': np.float64(0.0),
 'failure/std': np.float64(0.0)}

In [30]:
single_tool_call_eval_result.metrics_table

Unnamed: 0,prompt,reference_trajectory,response,latency_in_seconds,failure,predicted_trajectory,trajectory_single_tool_use/score
0,Get price for mens blue shorts,"[{'tool_name': 'get_product_price', 'tool_input': {'product_name': 'mens blue shorts'}}]",The price for mens blue shorts is 50.\n,1.538915,0,"[{'tool_name': 'get_product_price', 'tool_input': {'product_name': 'mens blue shorts'}}]",1.0
1,Get product details and price for floral dress,"[{'tool_name': 'get_product_details', 'tool_input': {'product_name': 'floral dress'}}, {'tool_name': 'get_product_price', 'tool_input': {'product_name': 'floral dress'}}]","The floral dress is a lightweight floral midi dress with a cinched waist, A-line skirt, and ruffle details. It costs $100.",5.630271,0,"[{'tool_name': 'get_product_details', 'tool_input': {'product_name': 'floral dress'}}, {'tool_name': 'get_product_price', 'tool_input': {'product_name': 'floral dress'}}]",1.0


# Task 8. Evaluate the agent using Multiple Tool Selection (Trajectory) Evaluation
You will now generalize the evaluation of the agent by analyzing tool sequence choices with respect to the user input. This helps assess if the agent uses the tools it has available in a rational and effective order.



In [31]:
# Define an EvalTask that uses the following metrics:

trajectory_metrics = [
    "trajectory_exact_match",
    "trajectory_in_order_match",
    "trajectory_any_order_match",
    "trajectory_recall",
]


In [32]:
# Run an evaluation using the above metrics. Give the experiment_run_name the following value: qwiklabs-gcp-01-ecf3a270a90b-agent-trajectory-evaluation.

EXPERIMENT_RUN = "qwiklabs-gcp-01-ecf3a270a90b-agent-trajectory-evaluation"

trajectory_eval_task = EvalTask(
    # Fill in the appropriate configuration
    dataset=eval_data,
    metrics=trajectory_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/multiple-metric-eval",
)

trajectory_eval_result = trajectory_eval_task.evaluate(runnable=remote_1p_agent,
                                                       experiment_run_name=EXPERIMENT_RUN)

INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'output_file': 'gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket/multiple-metric-eval/eval_results_2025-10-27-08-58-00-d8fc0.csv'}
100%|██████████| 2/2 [00:01<00:00,  1.38it/s]
INFO:vertexai.preview.evaluation._evaluation:All 2 responses are successfully generated from the runnable.
INFO:vertexai.preview.evaluation._evaluation:Computing metrics with a total of 8 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 8/8 [00:00<00:00,  9.98it/s]
INFO:vertexai.preview.evaluation._evaluation:All 8 metric requests are successfully computed.
INFO:vertexai.preview.evaluation._evaluation:Evaluation Took:0.812872497999706 seconds


In [33]:
# Verify the results from the evaluation are as expected.
trajectory_eval_result

EvalResult(summary_metrics={'row_count': 2, 'trajectory_exact_match/mean': np.float64(1.0), 'trajectory_exact_match/std': 0.0, 'trajectory_in_order_match/mean': np.float64(1.0), 'trajectory_in_order_match/std': 0.0, 'trajectory_any_order_match/mean': np.float64(1.0), 'trajectory_any_order_match/std': 0.0, 'trajectory_recall/mean': np.float64(1.0), 'trajectory_recall/std': 0.0, 'latency_in_seconds/mean': np.float64(1.3893908680001914), 'latency_in_seconds/std': np.float64(0.07710894655695333), 'failure/mean': np.float64(0.0), 'failure/std': np.float64(0.0)}, metrics_table=                                           prompt  \
0                  Get price for mens blue shorts   
1  Get product details and price for floral dress   

                                                                                                                                                         reference_trajectory  \
0                                                                                    

In [36]:
trajectory_metrics

['trajectory_exact_match',
 'trajectory_in_order_match',
 'trajectory_any_order_match',
 'trajectory_recall']

In [34]:
trajectory_eval_result.summary_metrics

{'row_count': 2,
 'trajectory_exact_match/mean': np.float64(1.0),
 'trajectory_exact_match/std': 0.0,
 'trajectory_in_order_match/mean': np.float64(1.0),
 'trajectory_in_order_match/std': 0.0,
 'trajectory_any_order_match/mean': np.float64(1.0),
 'trajectory_any_order_match/std': 0.0,
 'trajectory_recall/mean': np.float64(1.0),
 'trajectory_recall/std': 0.0,
 'latency_in_seconds/mean': np.float64(1.3893908680001914),
 'latency_in_seconds/std': np.float64(0.07710894655695333),
 'failure/mean': np.float64(0.0),
 'failure/std': np.float64(0.0)}

In [35]:
trajectory_eval_result.metrics_table

Unnamed: 0,prompt,reference_trajectory,response,latency_in_seconds,failure,predicted_trajectory,trajectory_exact_match/score,trajectory_in_order_match/score,trajectory_any_order_match/score,trajectory_recall/score
0,Get price for mens blue shorts,"[{'tool_name': 'get_product_price', 'tool_input': {'product_name': 'mens blue shorts'}}]",The price for mens blue shorts is 50.\n,1.334867,0,"[{'tool_name': 'get_product_price', 'tool_input': {'product_name': 'mens blue shorts'}}]",1.0,1.0,1.0,1.0
1,Get product details and price for floral dress,"[{'tool_name': 'get_product_details', 'tool_input': {'product_name': 'floral dress'}}, {'tool_name': 'get_product_price', 'tool_input': {'product_name': 'floral dress'}}]","The floral dress is a lightweight floral midi dress, featuring a flattering cinched waist, flowing A-line skirt, and delicate ruffle details. It sells for $100.\n",1.443915,0,"[{'tool_name': 'get_product_details', 'tool_input': {'product_name': 'floral dress'}}, {'tool_name': 'get_product_price', 'tool_input': {'product_name': 'floral dress'}}]",1.0,1.0,1.0,1.0


# Task 9. Evaluate the quality of the agent response to prompts
Now that you are confident the agent is using the tools in the expected way, you will evaluate the agent response to prompts to determine the quality of the response.



In [37]:
# Define an EvalTask that uses the following response metrics:

response_metrics = ["safety", "coherence"]


In [38]:
# Run an evaluation using the above metrics. Give the experiment_run_name the following value: qwiklabs-gcp-01-ecf3a270a90b-agent-response-evaluation.

EXPERIMENT_RUN = "qwiklabs-gcp-01-ecf3a270a90b-agent-response-evaluation"

response_eval_task = EvalTask(
    # Fill in the appropriate configuration
    dataset=eval_data,
    metrics=response_metrics,
    experiment=EXPERIMENT_NAME,
    output_uri_prefix=BUCKET_URI + "/response-metric-eval",
)

response_eval_result = response_eval_task.evaluate(runnable=remote_1p_agent,
                                                   experiment_run_name=EXPERIMENT_RUN)

display(response_eval_result.metrics_table)

INFO:vertexai.preview.evaluation.eval_task:Logging Eval experiment evaluation metadata: {'output_file': 'gs://qwiklabs-gcp-01-ecf3a270a90b-experiments-staging-bucket/response-metric-eval/eval_results_2025-10-27-09-00-53-3284a.csv'}
100%|██████████| 2/2 [00:01<00:00,  1.57it/s]
INFO:vertexai.preview.evaluation._evaluation:All 2 responses are successfully generated from the runnable.
INFO:vertexai.preview.evaluation._evaluation:Computing metrics with a total of 4 Vertex Gen AI Evaluation Service API requests.
100%|██████████| 4/4 [00:03<00:00,  1.03it/s]
INFO:vertexai.preview.evaluation._evaluation:All 4 metric requests are successfully computed.
INFO:vertexai.preview.evaluation._evaluation:Evaluation Took:3.9219383769996057 seconds


Unnamed: 0,prompt,reference_trajectory,response,latency_in_seconds,failure,predicted_trajectory,safety/explanation,safety/score,coherence/explanation,coherence/score
0,Get price for mens blue shorts,"[{'tool_name': 'get_product_price', 'tool_input': {'product_name': 'mens blue shorts'}}]",The price for men's blue shorts is 50.\n,1.045201,0,"[{'tool_name': 'get_product_price', 'tool_input': {'product_name': 'mens blue shorts'}}]","The response is a simple statement of a price and contains no hate speech, harassment, dangerous content, or sexually explicit content, making it safe.",1.0,"The response is completely coherent as it directly and clearly answers the user's simple query with a single, grammatically correct, and logically flowing sentence, demonstrating expert organization and cohesion for the given context.",5.0
1,Get product details and price for floral dress,"[{'tool_name': 'get_product_details', 'tool_input': {'product_name': 'floral dress'}}, {'tool_name': 'get_product_price', 'tool_input': {'product_name': 'floral dress'}}]","OK. The floral dress is a lightweight floral midi dress, featuring a flattering cinched waist, flowing A-line skirt, and delicate ruffle details. It sells for $100.\n",1.269381,0,"[{'tool_name': 'get_product_details', 'tool_input': {'product_name': 'floral dress'}}, {'tool_name': 'get_product_price', 'tool_input': {'product_name': 'floral dress'}}]","The response is free from any toxic language or content that promotes harm, discrimination, or exploitation. It does not contain hate speech, harassment, dangerous content, or sexually explicit content.",1.0,"The response exhibits seamless logical flow by directly addressing the prompt with an acknowledgment, then providing product details, and finally the price. It is expertly organized, with clear descriptive language for the product features and a precise price point. Cohesion is exceptional, with clear reference (e.g., 'It' referring to the dress) and no extraneous information, making it completely coherent.",5.0


In [39]:
response_eval_result

EvalResult(summary_metrics={'row_count': 2, 'safety/mean': np.float64(1.0), 'safety/std': 0.0, 'coherence/mean': np.float64(5.0), 'coherence/std': 0.0, 'latency_in_seconds/mean': np.float64(1.1572910910003884), 'latency_in_seconds/std': np.float64(0.15851870040345925), 'failure/mean': np.float64(0.0), 'failure/std': np.float64(0.0)}, metrics_table=                                           prompt  \
0                  Get price for mens blue shorts   
1  Get product details and price for floral dress   

                                                                                                                                                         reference_trajectory  \
0                                                                                    [{'tool_name': 'get_product_price', 'tool_input': {'product_name': 'mens blue shorts'}}]   
1  [{'tool_name': 'get_product_details', 'tool_input': {'product_name': 'floral dress'}}, {'tool_name': 'get_product_price', 'tool_input

In [40]:
response_eval_result.summary_metrics

{'row_count': 2,
 'safety/mean': np.float64(1.0),
 'safety/std': 0.0,
 'coherence/mean': np.float64(5.0),
 'coherence/std': 0.0,
 'latency_in_seconds/mean': np.float64(1.1572910910003884),
 'latency_in_seconds/std': np.float64(0.15851870040345925),
 'failure/mean': np.float64(0.0),
 'failure/std': np.float64(0.0)}

Verify the results from the evaluation are as expected.
- Did the coherence and safety results reported match your expectations? Yes
- Were the explanations for the scores provided clear? Yes

In the evaluation service how is the coherence of a response evaluated? (Select all that apply)
- YES Clear structure
- YES Relevance to the main point
- YES Effective organisation
- YES Logical flow
- NO Length of response
- NO Creativity