# Debugging and Optimizing Agents: A Guide to Tracing in Agent Engine

## Overview

[Agent Engine](https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/overview) helps you build and deploy agent-based AI applications that use LLMs and custom tools. Understanding your agent's decision-making process is essential for debugging and optimization, and [Cloud Trace](https://cloud.google.com/trace) is a great tool for exploring this tracing data to get insights.

<img src="https://storage.googleapis.com/github-repo/generative-ai/gemini/agent-engine/images/cloud-trace-agent.png">

This notebook demonstrates how to:

- **Learn Key Concepts**: Learn about the fundamental building blocks of tracing.
- **Deploy Your Agent**: Make your tracing-enabled agent available in a production-like environment on Agent Engine.
- **Enable Tracing**: Enable tracing in a simple agent
- **Examine Traces**: Use the Cloud Console and Cloud Trace SDK to access and analyze a specific trace.

By the end of this notebook, you'll be able to leverage tracing to build more robust and efficient AI agents on Vertex AI.

## Concepts

Here are some of the key concepts and terminology related to tracing, which will be helpful to understand as we explore traces generated by an agent in Agent Engine:

Below is an example of a trace in JSON format, showing a single span. This span represents a call to a large language model (LLM). Notice how the trace data captures important details:

### Example trace

```json
{
   "name": "llm",
   "context": {
       "trace_id": "ed7b336d-e71a-46f0-a334-5f2e87cb6cfc",
       "span_id": "ad67332a-38bd-428e-9f62-538ba2fa90d4"
   },
   "span_kind": "LLM",
   "parent_id": "f89ebb7c-10f6-4bf8-8a74-57324d2556ef",
   "start_time": "2023-09-07T12:54:47.597121-06:00",
   "end_time": "2023-09-07T12:54:49.321811-06:00",
   "status_code": "OK",
   "status_message": "",
   "attributes": {
       "llm.input_messages": [
           {
               "message.role": "system",
               "message.content": "You are an expert Q&A system that is trusted around the world.\nAlways answer the query using the provided context information, and not prior knowledge.\nSome rules to follow:\n1. Never directly reference the given context in your answer.\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines."
           },
           {
               "message.role": "user",
               "message.content": "Hello?"
           }
       ],
       "output.value": "assistant: Yes I am here",
       "output.mime_type": "text/plain"
   },
   "events": [],
}
```

### Trace

You can think of a [trace](https://opentelemetry.io/docs/concepts/signals/traces/) like a timeline of requests as they travel through your application. A trace is composed of individual spans, with the first span representing the overall request. Each span provides details about a specific operation within the request.

### Span

A [span](https://opentelemetry.io/docs/concepts/signals/traces/#spans) represents a single unit of work, like a function call or an interaction with an LLM. It captures information such as the operation's name, start and end times, and any relevant attributes (metadata). Spans can be nested, showing parent-child relationships between operations.

### Span Attribute

[Span attributes](https://opentelemetry.io/docs/concepts/signals/traces/#attributes) are key-value pairs that provide additional context about a span. For instance, an LLM span might have attributes like the model name, prompt text, and token count.

### Span Kind

[Span kind](https://opentelemetry.io/docs/concepts/signals/traces/#span-kind) categorizes the type of operation a span represents. Common kinds include:

- `CHAIN`: Links between LLM application steps or the start of a request.
- `LLM`: A call to a large language model.
- `TOOL`: An interaction with an external tool (API, database, etc.).
- `AGENT`: A reasoning block that combines LLM and tool interactions.

## Get started

### Install Vertex AI SDK and other required packages

In [None]:
%pip install --upgrade --user --quiet \
google-cloud-aiplatform[agent_engines,adk,langchain,ag2,llama_index]>=1.88.0 \
cloudpickle==3.0.0 \
"pydantic>=2.10" \
google-cloud-trace

[0m

### Restart runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>
</div>

In [1]:
PROJECT_ID = !(gcloud config get-value project)
PROJECT_ID = PROJECT_ID[0]
BUCKET_NAME = PROJECT_ID
STAGING_BUCKET = f"gs://{BUCKET_NAME}"
LOCATION = "us-central1"

import vertexai

vertexai.init(
    project=PROJECT_ID, location=LOCATION, staging_bucket=STAGING_BUCKET
)

## Build and deploy an agent

Let's dive into building a simple agent that utilizes tracing. This agent will use a few custom tools to demonstrate how tracing can provide insights into its workflow.

### Import libraries

Before you start building your agent, you'll import the necessary libraries. These include the Vertex AI SDK, pandas for data analysis, and the Cloud Trace SDK for working with trace data.

In [2]:
from datetime import datetime, timedelta

import pandas as pd
from google.cloud import trace_v1 as trace
from vertexai import agent_engines
from vertexai.agent_engines._agent_engines import _utils
from vertexai.preview.reasoning_engines import LangchainAgent

### Define tools

You'll define a few Python functions to act as tools for your agent. These tools will simulate actions or API calls that a real-world agent might perform. For this example, you'll create tools to classify a customer support ticket, query a knowledge base, and escalate a ticket to a human agent.

In [3]:
def classify_ticket(ticket_text: str) -> str:
    """Classifies a support ticket into a category."""
    # Simulate a call to a classification model
    categories = {
        "general": "Questions and information",
        "billing": "Payment and invoices",
        "technical": "API and SDK developer documentation",
    }
    if "payment" in ticket_text:
        category = "billing"
        description = categories[category]
    elif "settings" in ticket_text:
        category = "technical"
        description = categories[category]
    else:
        category = "general"
        description = categories[category]

    return f"This ticket is in the {category} category for questions related to {description}"


def search_knowledge_base(category: str) -> list[dict]:
    """Searches a knowledge base for relevant articles and documentation links."""
    # Simulate a knowledge base search
    articles = {
        "general": [
            {
                "title": "Contacting support",
                "url": "https://example.com/contact",
            }
        ],
        "billing": [
            {
                "title": "How to update your payment information",
                "url": "https://example.com/billing/update",
            },
        ],
        "technical": [
            {
                "title": "Troubleshooting common login issues",
                "url": "https://example.com/technical/help",
            },
        ],
    }
    return articles.get(category, [])


def escalate_to_human(ticket_text: str) -> str:
    """Initiates escalation to a human agent for outage reports."""
    return "Your ticket has been escalated to a human agent. Please expect a response within 1-2 hours."

### Define agent and enable tracing

Now, let's define your agent using the LangChain template in Agent Engine and the Vertex AI SDK. Enable tracing by setting the `enable_tracing` parameter to `True`, which allows you to capture detailed information about the agent's execution.

In [4]:
agent = LangchainAgent(
    model="gemini-2.0-flash",
    model_kwargs={"temperature": 0},
    tools=[classify_ticket, search_knowledge_base, escalate_to_human],
    enable_tracing=True,
)

### Test your agent locally (with traces!)

Let's test your agent locally by sending it a query. Since you've enabled tracing, you'll be able to see how the agent processes this request and interacts with its tools.

In [None]:
agent.query(
    input="""
    Classify the following ticket into a category and give me a relevant documentation link.

    Support ticket text:
    I need to update my billing information since my payment method has expired.
    """
)

{'input': '\n    Classify the following ticket into a category and give me a relevant documentation link.\n\n    Support ticket text:\n    I need to update my billing information since my payment method has expired.\n    ',
 'output': 'OK. I have classified your ticket as being in the billing category. You can find documentation on how to update your payment information here: https://example.com/billing/update.\n'}

### Get your first trace

Before diving deeper into trace analysis, let's use the Cloud Trace SDK to retrieve a specific trace generated by your local agent. This will give you a concrete example to work with.

In [None]:
client = trace.TraceServiceClient()

In [None]:
result = [
    r
    for r in client.list_traces(
        request=trace.types.ListTracesRequest(
            project_id=PROJECT_ID,
            # Return all traces containing `labels {key: "openinference.span.kind" value: "AGENT"}`
            filter="openinference.span.kind:AGENT",
        )
    )
]

In [None]:
trace_data = client.get_trace(
    project_id=PROJECT_ID, trace_id=result[0].trace_id
).spans[0]
trace_data

span_id: 12699927002695008802
name: "AgentExecutor"
start_time {
  seconds: 1747669286
  nanos: 156548096
}
end_time {
  seconds: 1747669292
  nanos: 186034944
}
labels {
  key: "output.value"
  value: "OK. I have classified your ticket as being in the billing category. You can find documentation on how to update your payment information here: https://example.com/billing/update.\n"
}
labels {
  key: "openinference.span.kind"
  value: "AGENT"
}
labels {
  key: "input.value"
  value: "\n    Classify the following ticket into a category and give me a relevant documentation link.\n\n    Support ticket text:\n    I need to update my billing information since my payment method has expired.\n    "
}
labels {
  key: "g.co/agent"
  value: "opentelemetry-python 1.33.1; google-cloud-trace-exporter 1.9.0"
}

After you deploy your agent and make remote queries in the following sections, you'll dive into the details for working with trace data in the Cloud Console or using the Python SDK for Cloud Trace.

### Deploy your agent

Now that you've seen how tracing works locally, let's deploy your agent to Agent Engine. This will allow you to send it queries in a production-like environment and observe its behavior through traces.

In [None]:
remote_agent = agent_engines.create(
    agent,
    requirements=[
        "google-cloud-aiplatform[agent_engines,adk,langchain,ag2,llama_index]>=1.88.0",
        "cloudpickle==3.0.0",
        "pydantic>=2.10",
        "google-cloud-trace",
    ],
    display_name="Agent Tracing",
)

INFO:vertexai.agent_engines:Identified the following requirements: {'pydantic': '2.11.4', 'cloudpickle': '3.0.0', 'google-cloud-aiplatform': '1.93.0'}
INFO:vertexai.agent_engines:The final list of requirements: ['google-cloud-aiplatform[agent_engines,adk,langchain,ag2,llama_index]>=1.88.0', 'cloudpickle==3.0.0', 'pydantic>=2.10', 'google-cloud-trace']
INFO:vertexai.agent_engines:Using bucket condiaz-demo
INFO:vertexai.agent_engines:Wrote to gs://condiaz-demo/agent_engine/agent_engine.pkl
INFO:vertexai.agent_engines:Writing to gs://condiaz-demo/agent_engine/requirements.txt
INFO:vertexai.agent_engines:Creating in-memory tarfile of extra_packages
INFO:vertexai.agent_engines:Writing to gs://condiaz-demo/agent_engine/dependencies.tar.gz
INFO:vertexai.agent_engines:Creating AgentEngine
INFO:vertexai.agent_engines:Create AgentEngine backing LRO: projects/502975277769/locations/us-central1/reasoningEngines/2357572832277299200/operations/8354692258171191296
INFO:vertexai.agent_engines:View pro

### Query your deployed agent

With your agent deployed, you can interact with it remotely. Let's send a query and generate some trace data to explore.

In [None]:
# List all agent engines
all_agent_engines = agent_engines.list()
print("All Agent Engines:")
for agent in all_agent_engines:
    print(f"- {agent.display_name} : {agent.resource_name}")

All Agent Engines:
- Agent Tracing : projects/502975277769/locations/us-central1/reasoningEngines/2357572832277299200
- Agent Evaluation : projects/502975277769/locations/us-central1/reasoningEngines/2130141051095089152
- Currency Exchange Agent : projects/502975277769/locations/us-central1/reasoningEngines/2598515412341620736
- ADK Agent : projects/502975277769/locations/us-central1/reasoningEngines/8146668678285361152
- Agent Engine with LangGraph : projects/502975277769/locations/us-central1/reasoningEngines/8027956606857641984
-  : projects/502975277769/locations/us-central1/reasoningEngines/6115052665132023808
-  : projects/502975277769/locations/us-central1/reasoningEngines/6261419653021564928
- Agent Engine with LangGraph : projects/502975277769/locations/us-central1/reasoningEngines/8024271043881336832
-  : projects/502975277769/locations/us-central1/reasoningEngines/416565373345726464
-  : projects/502975277769/locations/us-central1/reasoningEngines/1442260188479356928
-  : pr



-  : projects/502975277769/locations/us-central1/reasoningEngines/4996339166288543744




-  : projects/502975277769/locations/us-central1/reasoningEngines/2895409940120207360




-  : projects/502975277769/locations/us-central1/reasoningEngines/1529640576561971200


In [None]:
RESOURCE_ID = "2357572832277299200"
from vertexai import agent_engines

remote_agent = agent_engines.get(RESOURCE_ID)

In [None]:
remote_agent.query(
    input="""
    Classify the following ticket into a category and route the customer accordingly:

    Support ticket text:
    I am unable to make any API calls and I need to report an outage in the system
    """,
)

{'input': '\n    Classify the following ticket into a category and route the customer accordingly:\n\n    Support ticket text:\n    I am unable to make any API calls and I need to report an outage in the system\n    ',
 'output': 'OK. I have classified the ticket as "Questions and information" but I am also escalating to a human agent due to the outage report.\n'}

## Working with traces using `pandas`

For more programmatic analysis, you can use the pandas library to work with trace data. You'll fetch traces, convert them to DataFrames, and then use pandas' functionality to explore the trace data.

In [None]:
result = [
    r
    for r in client.list_traces(
        request=trace.types.ListTracesRequest(
            project_id=PROJECT_ID,
            # Return all traces containing `labels {key: "openinference.span.kind" value: "AGENT"}`
            filter="openinference.span.kind:AGENT",
        )
    )
]

In [None]:
trace_data = client.get_trace(
    project_id=PROJECT_ID, trace_id=result[0].trace_id
)

In [None]:
spans = pd.DataFrame.from_records(
    [_utils.to_dict(span) for span in trace_data.spans]
)
spans.head()

Unnamed: 0,span_id,name,start_time,end_time,labels,parent_span_id
0,12699927002695008802,AgentExecutor,2025-05-19T15:41:26.156548096Z,2025-05-19T15:41:32.186034944Z,{'g.co/agent': 'opentelemetry-python 1.33.1; g...,NaN
1,14898676473061143836,RunnableSequence,2025-05-19T15:41:26.157769984Z,2025-05-19T15:41:28.300080896Z,{'g.co/agent': 'opentelemetry-python 1.33.1; g...,12699927002695008802
2,16344268206725054949,classify_ticket,2025-05-19T15:41:28.505606912Z,2025-05-19T15:41:28.506562048Z,{'g.co/agent': 'opentelemetry-python 1.33.1; g...,12699927002695008802
3,11306840170046162036,"RunnableParallel<input,agent_scratchpad>",2025-05-19T15:41:28.795064832Z,2025-05-19T15:41:29.114667776Z,{'g.co/agent': 'opentelemetry-python 1.33.1; g...,8151783624996438012
4,10066278194696982697,ChatPromptTemplate,2025-05-19T15:41:29.253547008Z,2025-05-19T15:41:29.254246912Z,{'g.co/agent': 'opentelemetry-python 1.33.1; g...,8151783624996438012


In [None]:
spans[spans["name"] == "ChatVertexAI"]

Unnamed: 0,span_id,name,start_time,end_time,labels,parent_span_id
5,8420433669230240208,ChatVertexAI,2025-05-19T15:41:29.458071040Z,2025-05-19T15:41:29.873081088Z,{'g.co/agent': 'opentelemetry-python 1.33.1; g...,8151783624996438012
11,9725482379096883415,ChatVertexAI,2025-05-19T15:41:27.196755968Z,2025-05-19T15:41:27.940526080Z,{'g.co/agent': 'opentelemetry-python 1.33.1; g...,14898676473061143836
22,18026471713224543574,ChatVertexAI,2025-05-19T15:41:31.251062016Z,2025-05-19T15:41:31.731382016Z,"{'llm.invocation_parameters': '{""model_name"": ...",18317904460723700631


In [None]:
spans[spans["name"] == "ChatVertexAI"].labels.apply(pd.Series)

Unnamed: 0,g.co/agent,llm.input_messages.1.message.role,llm.invocation_parameters,metadata,llm.input_messages.2.message.role,output.mime_type,llm.input_messages.0.message.role,llm.token_count.total,output.value,llm.input_messages.0.message.content,...,llm.output_messages.0.message.role,llm.input_messages.1.message.function_call_name,llm.input_messages.2.message.tool_call_id,llm.input_messages.4.message.role,llm.output_messages.0.message.content,llm.input_messages.3.message.function_call_name,llm.input_messages.3.message.function_call_arguments_json,llm.input_messages.4.message.content,llm.input_messages.4.message.tool_call_id,llm.input_messages.3.message.role
5,opentelemetry-python 1.33.1; google-cloud-trac...,assistant,"{""model_name"": ""gemini-2.0-flash"", ""temperatur...","{""ls_provider"": ""google_vertexai"", ""ls_model_n...",tool,application/json,user,152,"{""generations"": [[{""text"": """", ""generation_inf...",\n Classify the following ticket into a cat...,...,assistant,classify_ticket,791a4459-e2ae-4dae-9c5f-bacfc05307eb,,,,,,,
11,opentelemetry-python 1.33.1; google-cloud-trac...,,"{""model_name"": ""gemini-2.0-flash"", ""temperatur...","{""ls_provider"": ""google_vertexai"", ""ls_model_n...",,application/json,user,127,"{""generations"": [[{""text"": """", ""generation_inf...",\n Classify the following ticket into a cat...,...,assistant,,,,,,,,,
22,opentelemetry-python 1.33.1; google-cloud-trac...,assistant,"{""model_name"": ""gemini-2.0-flash"", ""temperatur...","{""ls_provider"": ""google_vertexai"", ""ls_model_n...",tool,application/json,user,223,"{""generations"": [[{""text"": ""OK. I have classif...",\n Classify the following ticket into a cat...,...,assistant,classify_ticket,791a4459-e2ae-4dae-9c5f-bacfc05307eb,tool,OK. I have classified your ticket as being in ...,search_knowledge_base,"{""category"": ""billing""}","[{""title"": ""How to update your payment informa...",2df1c81e-339b-48ce-819e-b6349ddb828c,assistant


## Exploring traces with the Python SDK for Cloud Trace

The Cloud Trace Python SDK provides even more flexibility for working with trace data. We'll use it to demonstrate how to filter traces by date, time, labels, and view types.

**Filter by date and time**

In [None]:
# Calculate the start and end times
now = datetime.utcnow()
yesterday = now - timedelta(hours=24)

# Format the dates as ISO 8601 strings with 'Z' for UTC
end_time = now.isoformat() + "Z"
start_time = yesterday.isoformat() + "Z"

# Request a filtered list of traces by date and time
result = client.list_traces(
    request=trace.types.ListTracesRequest(
        project_id=PROJECT_ID,
        start_time=start_time,
        end_time=end_time,
    )
)

for count, r in enumerate(result):
    if count >= 5:
        break
    print(r)

project_id: "condiaz-demo"
trace_id: "03f094f3b09a93b145824346fb3810a5"



**Filter by label**

In [None]:
result = client.list_traces(
    request=trace.types.ListTracesRequest(
        project_id=PROJECT_ID,
        # Return traces where any root span's name starts with AgentExecutor
        filter="root:AgentExecutor",
    )
)

for count, r in enumerate(result):
    if count >= 5:
        break
    print(r)

project_id: "condiaz-demo"
trace_id: "03f094f3b09a93b145824346fb3810a5"

project_id: "condiaz-demo"
trace_id: "f42fd3a1ab0c8bfe7c4a1c42c015396b"



**Filter by view type**

In [None]:
result = client.list_traces(
    request=trace.types.ListTracesRequest(
        project_id=PROJECT_ID,
        # view=trace.types.ListTracesRequest.ViewType.ROOTSPAN,
        view=trace.types.ListTracesRequest.ViewType.MINIMAL,
        # view=trace.types.ListTracesRequest.ViewType.COMPLETE,
    )
)

for count, r in enumerate(result):
    if count >= 5:
        break
    print(r)

project_id: "condiaz-demo"
trace_id: "03f094f3b09a93b145824346fb3810a5"

project_id: "condiaz-demo"
trace_id: "f42fd3a1ab0c8bfe7c4a1c42c015396b"



## Cleaning up

After you've finished experimenting, it's a good practice to clean up your cloud resources. You can delete the deployed Agent Engine instance to avoid any unexpected charges on your Google Cloud account.

In [None]:
remote_agent.delete()