[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/drive/folders/1IrwoNrb3AWLAhAqjlAkJNYa39p9eT9ui?usp=sharing)


# Flotorch Agent Evaluation - All Metrics Combined

This comprehensive notebook demonstrates how to measure and analyze **all agent evaluation metrics** for **Flotorch ADK agents** using the **Flotorch Eval** framework.

The evaluation relies on **OpenTelemetry Traces** generated during the agent's run to provide comprehensive insights across multiple dimensions of agent performance.

**Key Feature**: This notebook uses **ONE agent** and **ONE AgentEvaluator** to evaluate all metrics, making it efficient and easy to use.

---

## Metrics Covered

This notebook includes evaluation for the following metrics:

1. **Latency Metric** (`latency_summary`) - Measures agent response time and performance breakdown
2. **Trajectory Evaluation with LLM** (`trajectory_evaluation_with_llm`) - Assesses overall trajectory quality using LLM-based evaluation
3. **Trajectory Evaluation with Reference** (`trajectory_evaluation_with_reference`) - Compares trajectory against a reference trajectory
4. **Tool Call Accuracy** (`toolcall_accuracy`) - Evaluates accuracy and appropriateness of tool usage decisions
5. **Agent Goal Accuracy** (`agent_goal_accuracy`) - Assesses whether the agent successfully accomplished the user's true goal
6. **Usage Metric** (`usage_summary`) - Provides detailed breakdown of token usage and costs

---

### Architecture Overview

![Workflow Diagram](diagrams/09_unified-agent_Workflow_Diagram.drawio.png)
*Figure 2: Detailed workflow diagram showing the step-by-step process of unified agent evaluation with all metrics from agent execution through trace collection to comprehensive metric computations.*

---

---

## Requirements

* Flotorch account with configured models
* Valid Flotorch API key and gateway base URL
* Agent configured with OpenTelemetry tracing enabled
* For Latency Metric: Knowledge Base containing Flotorch documentation (files provided in `agents/flotorch_assistant_kb/`)

---


## Agent Setup in Flotorch Console

**Important**: Before running this notebook, you need to create an agent in the Flotorch Console. This section provides step-by-step instructions on how to set up the agent.

### Step 1: Access Flotorch Console

1. **Log in to Flotorch Console**:
   - Navigate to your Flotorch Console (e.g., `https://dev-console.flotorch.cloud`)
   - Ensure you have the necessary permissions to create agents

2. **Navigate to Agents Section**:
   - Click on **"Agents"** in the left sidebar
   - You should see the "Agent Builder" option selected

### Step 2: Create New Agent

1. **Click "Create FloTorch Agent"**:
   - Look for the blue **"+ Create FloTorch Agent"** button in the top right corner
   - Click it to start creating a new agent

2. **Agent Configuration**:
   - **Agent Name**: Choose a unique name for your agent (e.g., `unified-agent`, `all-metrics-agent`)
     - **Important**: The name should only contain alphanumeric characters and dashes (a-z, A-Z, 0-9, -)
     - **Note**: Copy this agent name - you'll need to use it in the `agent_name` variable later
   - **Description** (Optional): Add a description if desired

### Step 3: Configure Agent Details

After creating the agent, you'll be directed to the agent configuration page. Configure the following:

#### Required Configuration:

1. **Model** (`* Model`):
   - **Required**: Select a model from the available options
   - Example: `gpt-model` or any available model from your Flotorch gateway
   - Click the edit icon to configure

2. **Agent Details** (`* Agent Details`):
   - **Required**: Configure agent details
   - **System Prompt**: Copy and paste the following system prompt:

You are a versatile AI assistant capable of handling multiple types of tasks:

1. **Flotorch Documentation Assistant**: When users ask about Flotorch search the knowledge base to provide accurate, context-aware responses based on Flotorch documentation.

2. **Research Assistant**: When users need information from the web, use web search to find current, relevant information and synthesize comprehensive answers.

3. **Code Reviewer**: When users provide code for review, analyze it thoroughly for correctness, code quality, readability, performance, and provide structured feedback with ratings.

4. **Weather Assistant**: When users ask about weather, use the weather tool to fetch real-time weather data including temperature, wind speed, humidity, and location details.

5. **Travel Planner**: When users request travel planning, help create comprehensive travel plans including flight options, hotel recommendations, daily itineraries, and cost estimates. Use your knowledge and reasoning to provide detailed, practical travel advice.

6. **News Assistant**: When users request news, fetch the latest news articles and provide summaries with key information, prioritizing relevant and recent content.

Always:

Use the appropriate tools when needed (knowledge base search, web search, weather API, news API)

Provide clear, accurate, and helpful responses

Structure your answers logically

Cite sources when using external information

Be concise but comprehensive

   - **Goal**: Copy and paste the following goal:

To be a versatile, multi-capability AI assistant that can:

Answer questions about Flotorch using knowledge base search (RAG)

Perform web-based research and synthesize information

Review code and provide structured feedback

Provide real-time weather information

Create comprehensive travel plans

Fetch and summarize news articles

The agent should intelligently select and use the appropriate tools based on user queries, providing accurate, helpful, and well-structured responses for each type of task.

#### Optional Configuration:

1. **Tools**:
   - Tools will be added programmatically via the notebook (see Section 8)
   - You can leave this as "Not Configured" in the console

2. **Input Schema**:
   - Optional: Leave as "Not Configured" for this use case

3. **Output Schema**:
   - Optional: Leave as "Not Configured" for this use case

### Step 4: Publish the Agent

1. **Review Configuration**:
   - Ensure the Model and Agent Details are configured correctly
   - Verify the System Prompt and Goal are set

2. **Publish Agent**:
   - After configuration, click **"Publish"** or **"Make a revision"** to publish the agent
   - Once published, the agent will have a version number (e.g., v1)

3. **Note the Agent Name**:
   - **Important**: Copy the exact agent name you used when creating the agent
   - You will need to replace `<your_agent_name>` in the `agent_name` variable in Section 2.1 (Global Provider Models and Agent Configuration)

### Step 5: Update Notebook Configuration

1. **Update Agent Name**:
   - Navigate to Section 2.1 in this notebook
   - Find the `agent_name` variable
   - Replace `<your_agent_name>` with the exact agent name you created in the console

**Example**:
- If you created an agent named `unified-agent` in the console
- Set `agent_name = "unified-agent"` in the notebook

### Summary of Required vs Optional Settings

| Setting | Required/Optional | Value |
|---------|------------------|-------|
| **Agent Name** | **Required** | Choose a unique name (copy it for notebook) |
| **Model** | **Required** | Select from available models |
| **System Prompt** | **Required** | Use the system prompt provided above |
| **Goal** | **Required** | Use the goal provided above |
| **Tools** | **Optional** | Will be added via notebook code |
| **Input Schema** | **Optional** | Can leave as "Not Configured" |
| **Output Schema** | **Optional** | Can leave as "Not Configured" |

**Note**: The tools (Knowledge Base, Web Search, Weather, News) will be added to the agent programmatically in the notebook code, so you don't need to configure them manually in the console.

---


## 1. Setup and Installation

### Purpose
Install the necessary packages for the Flotorch Evaluation framework required for all agent evaluation metrics.

### Key Components
- **`flotorch-eval`**: Flotorch evaluation framework with all dependencies for all metrics
- **`flotorch[adk]`**: Flotorch ADK (Agent Development Kit) package with all dependencies for building and running Flotorch agents


In [None]:
# Install Flotorch Eval packages
# flotorch-eval: Flotorch evaluation framework with all dependencies
# flotorch[adk]: Flotorch ADK package with all dependencies

%pip install flotorch-eval==2.0.0b1 flotorch[adk]==3.1.0b1


## 2. Authentication and Credentials

### Purpose
Configure your Flotorch API credentials and gateway URL for authentication.

### Key Components
This cell configures the essential authentication and connection parameters:

**Authentication Parameters**:

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key (found in your Flotorch Console). Securely entered using `getpass` to avoid displaying in the notebook | `sk_...` |
| `FLOTORCH_BASE_URL` | Your Flotorch gateway endpoint URL | `https://dev-console.flotorch.cloud` |

**Note**: Use secure credential management in production environments.


In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  
    print(f"âœ“ FLOTORCH_API_KEY set successfully")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""
    print(f"âœ— FLOTORCH_API_KEY not set")

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # https://dev-gateway.flotorch.cloud
print(f"âœ“ FLOTORCH_BASE_URL set: {FLOTORCH_BASE_URL}")

print("âœ“ All credentials configured successfully!")


### 2.1. Global Provider Models and Agent Configuration

### Purpose
Define available models from the Flotorch gateway and configure agent-specific parameters.

### Key Components

**Global Provider Models**: These are the available models from the Flotorch gateway that can be used for evaluation and agent operations:

| Model Variable | Model Name | Description |
|----------------|------------|-------------|
| `MODEL_CLAUDE_HAIKU` | `flotorch/flotorch-claude-haiku-4-5` | Claude Haiku model via Flotorch gateway |
| `MODEL_CLAUDE_SONNET` | `flotorch/flotorch-claude-sonnet-3-5-v2` | Claude Sonnet model via Flotorch gateway |
| `MODEL_AWS_NOVA_PRO` | `flotorch/flotorch-aws-nova-pro` | AWS Nova Pro model via Flotorch gateway |
| `MODEL_AWS_NOVA_LITE` | `flotorch/flotorch-aws-nova-lite` | AWS Nova Lite model via Flotorch gateway |
| `MODEL_AWS_NOVA_MICRO` | `flotorch/flotorch-aws-nova-micro` | AWS Nova Micro model via Flotorch gateway |

**Agent Configuration Parameters**:

| Parameter | Description | Example |
|-----------|-------------|---------|
| `default_evaluator` | The LLM model used for evaluation (can use MODEL_* variables above) | `MODEL_CLAUDE_SONNET` or `flotorch/flotorch-model` |
| `agent_name` | The name of your Flotorch ADK agent | `flotorch-agent` |
| `app_name` | The application name identifier | `agent-evaluation-app-name` |
| `user_id` | The user identifier | `agent-evaluation-user` |


In [None]:
# ============================================================================
# Global Provider Models (Flotorch Gateway Models)
# ============================================================================
# These models are available from the Flotorch gateway and can be used
# for evaluation, agent operations, and other tasks.

MODEL_CLAUDE_HAIKU = "flotorch/flotorch-claude-haiku-4-5"
MODEL_CLAUDE_SONNET = "flotorch/flotorch-claude-sonnet-3-5-v2"
MODEL_AWS_NOVA_PRO = "flotorch/flotorch-aws-nova-pro"
MODEL_AWS_NOVA_LITE = "flotorch/flotorch-aws-nova-lite"
MODEL_AWS_NOVA_MICRO = "flotorch/flotorch-aws-nova-micro"

print("âœ“ Global provider models defined")

# The LLM model used for evaluation. 
# Can be modified to use any MODEL_* constant above (e.g., MODEL_CLAUDE_SONNET, MODEL_AWS_NOVA_PRO)
# You can use your own models from Flotorch Console as well
default_evaluator = MODEL_CLAUDE_HAIKU

# Agent configuration - can be customized per metric section
agent_name = "<your_agent_name>"  # The name of your Flotorch ADK agent                                        || ex : flotorch-agent
app_name = "<your_app_name>"  # The application name identifier                                                || ex : agent-evaluation-app-name
user_id = "<your_user_id>"  # The user identifier                                                              || ex : agent-evaluation-user

print("âœ“ Agent Configuration Parameter defined ")


## 3. Import Required Libraries

### Purpose
Import all required components for evaluating Flotorch ADK agents across all metrics using Flotorch Eval.

### Key Components
- **`AgentEvaluator`**: Core client for agent evaluation orchestration and trace fetching
- **`LatencyMetric`**: Flotorch Eval metric that measures agent response time and performance breakdown
- **`TrajectoryEvalWithLLM`**: Flotorch Eval metric that assesses overall trajectory quality using LLM-based evaluation
- **`TrajectoryEvalWithLLMWithReference`**: Flotorch Eval metric that compares trajectory against a reference trajectory using LLM-based evaluation
- **`ToolCallAccuracy`**: Flotorch Eval metric that evaluates the accuracy and appropriateness of tool usage decisions
- **`AgentGoalAccuracy`**: Flotorch Eval metric that assesses whether the agent successfully accomplished the user's true goal
- **`UsageMetric`**: Flotorch Eval metric that provides detailed breakdown of token usage and costs
- **`ReferenceTrajectory`**: Schema for defining the ideal "golden path" trajectory for comparison in TrajectoryEvalWithLLMWithReference
- **`MetricConfig`**: Configuration class for customizing metric parameters and settings
- **`FlotorchADKAgent`**: Creates and configures Flotorch ADK agents with custom tools and tracing
- **`FlotorchADKSession`**: Manages agent sessions for multi-turn conversations
- **`FlotorchVectorStore`**: Provides access to Flotorch knowledge bases for RAG-based searches
- **`extract_vectorstore_texts`**: Utility function for extracting text results from vector store searches
- **`Runner`**: Executes agent queries and coordinates the agent execution flow
- **`FunctionTool`**: Wraps Python functions as tools that can be used by the agent
- **`types`**: Google ADK types for creating message content and handling agent events
- **`pandas`**: Data manipulation and display for formatted results tables
- **`display`**: IPython display utility for rendering formatted outputs in notebooks
- **`json`**: JSON parsing utilities for handling metric details and results

In [None]:
# Required imports
# Flotorch Eval components
from flotorch_eval.agent_eval.core.client import AgentEvaluator
from flotorch_eval.agent_eval.metrics.latency_metrics import LatencyMetric
from flotorch_eval.agent_eval.metrics.llm_evaluators import (
    TrajectoryEvalWithLLM,
    TrajectoryEvalWithLLMWithReference,
    ToolCallAccuracy,
    AgentGoalAccuracy
)
from flotorch_eval.agent_eval.metrics.usage_metrics import UsageMetric
from flotorch_eval.agent_eval.core.schemas import ReferenceTrajectory
from flotorch_eval import MetricConfig

# Flotorch ADK components
from flotorch.adk.agent import FlotorchADKAgent
from flotorch.adk.sessions import FlotorchADKSession
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils.memory_utils import extract_vectorstore_texts

# Google ADK components
from google.adk.runners import Runner
from google.adk.tools import FunctionTool
from google.genai import types

# Utilities
import pandas as pd
from IPython.display import display
import json

print("âœ“ Imported necessary libraries successfully")


## Knowledge Base Setup:

**Important**: Before using this tool, you need to create a Knowledge Base in the Flotorch Console.

We have provided the files needed to create the Knowledge Base for this use case in the `agents/flotorch_assistant_kb/` directory. This folder contains all the files required to set up the knowledge base.

**Instructions**:
1. **Navigate to the Knowledge Base folder**: Go to the `agents/flotorch_assistant_kb/` directory
2. **Create Knowledge Base in Flotorch Console**:
   - Log in to your Flotorch Console
   - Navigate to the Knowledge Bases section
   - Create a new Knowledge Base (e.g., name it "flotorch-documentation" or "flotorch-assistant-kb")
   - Upload all the files from the `agents/flotorch_assistant_kb/` folder
3. **Get Knowledge Base ID**:
   - After creating the Knowledge Base in the console, copy its ID
   - Update the `KNOWLEDGE_BASE` variable in the configuration section (Section 4) with your Knowledge Base ID

**Note**: Make sure to replace `<your_knowledge_base_id>` with your actual Knowledge Base ID in the configuration cell.

## 4. Knowledge Base Tool (for Latency Metric)

**Purpose**: This tool searches the Flotorch knowledge base for information. It's used for the Latency Metric to enable RAG-based knowledge base search.

**Functionality**:
- Takes a search query as input
- Searches the Flotorch Vector Store (Knowledge Base)
- Returns relevant search results extracted from the knowledge base


In [None]:
# ============================================================================
# Tool 1: Knowledge Base Tool (for Latency Metric)
# ============================================================================
# Initialize the Flotorch Vector Store (Knowledge Base)
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=KNOWLEDGE_BASE
)

def kb_tool(query: str) -> dict:
    """Custom tool for searching the Flotorch knowledge base."""
    try:
        if not query:
            return {"success": False, "results": [], "error": "Empty query provided."}
        context = kb.search(query)
        results = extract_vectorstore_texts(context)
        return {"success": True, "results": results, "error": None}
    except Exception as e:
        return {"success": False, "results": [], "error": str(e)}

print("âœ“ Knowledge Base Tool (for Latency Metric)")


## 5. Web Search Tool (for Trajectory Evaluation)

**Purpose**: This tool performs Google searches and returns top results. It's used for Trajectory Evaluation metrics to enable the agent to search for information.

**Functionality**:
- Takes a search query as input
- Uses Google Custom Search API to retrieve results
- Returns formatted search results with titles, snippets, and links


In [None]:
# ============================================================================
# Tool 2: Web Search Tool (for Trajectory Evaluation)
# ============================================================================
import requests

def web_search(query: str) -> str:
    """Perform a Google search and return top results."""

    api_key = "<YOUR_GOOGLE_API_KEY>"
    cse_id = "<YOUR_GOOGLE CSE_ID>"
    url = "https://www.googleapis.com/customsearch/v1"
    
    params = {"key": api_key, "cx": cse_id, "q": query, "num": 5}
    response = requests.get(url, params=params)
    data = response.json()
    if "items" not in data:
        return "No results found."
    results = []
    for item in data["items"]:
        title = item.get("title", "")
        snippet = item.get("snippet", "")
        link = item.get("link", "")
        results.append(f"ðŸ”¹ {title}\n{snippet}\n{link}")
    return "\n\n".join(results)

print("âœ“ Web Search Tool (for Trajectory Evaluation)")


## 6. Weather Tool (for Tool Call Accuracy)

**Purpose**: This tool retrieves current weather information for a given city. It's used for the Tool Call Accuracy metric to test the agent's ability to correctly select and use appropriate tools.

**Functionality**:
- Takes a city name as input
- Uses geocoding API to get latitude and longitude
- Fetches current weather data including temperature, wind speed, and humidity
- Returns comprehensive weather information for the city


In [None]:
# ============================================================================
# Tool 3: Weather Tool (for Tool Call Accuracy)
# ============================================================================
def get_weather(city_name: str) -> dict:
    """Return latitude, longitude, and current weather for a given city name."""
    geo_url = "https://geocoding-api.open-meteo.com/v1/search"
    geo_params = {"name": city_name, "count": 1, "language": "en", "format": "json"}
    geo_res = requests.get(geo_url, params=geo_params).json()
    if "results" not in geo_res:
        raise ValueError(f"City '{city_name}' not found")
    city = geo_res["results"][0]
    lat, lon = city["latitude"], city["longitude"]
    weather_url = "https://api.open-meteo.com/v1/forecast"
    weather_params = {
        "latitude": lat,
        "longitude": lon,
        "current": "temperature_2m,wind_speed_10m,relative_humidity_2m"
    }
    weather_res = requests.get(weather_url, params=weather_params).json()
    current_weather = weather_res.get("current", {})
    return {
        "city": city["name"],
        "country": city.get("country"),
        "latitude": lat,
        "longitude": lon,
        "weather": current_weather
    }

print("âœ“ Weather Tool (for Tool Call Accuracy)")


## 7. News Tool (for Usage Metric)

**Purpose**: This tool fetches the latest news articles from around the world. It's used for the Usage Metric to measure token usage and costs when the agent processes news-related queries.

**Functionality**:
- Takes a limit parameter for the number of articles to fetch
- Parses RSS feeds from Google News, with priority on India news
- Returns formatted news articles with title, description, URL, and metadata
- Handles errors gracefully and returns structured data for agent processing


In [None]:
# ============================================================================
# Tool 4: News Tool (for Usage Metric)
# ============================================================================
from datetime import datetime
import xml.etree.ElementTree as ET

def get_top_news(limit: int = 7) -> dict:
    """Get the latest top news articles from worldwide, with priority on India."""
    try:
        articles = []
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
        
        def parse_rss_feed(url: str, max_items: int) -> list:
            parsed_articles = []
            try:
                response = requests.get(url, timeout=10, headers=headers)
                if response.status_code == 200:
                    root = ET.fromstring(response.content)
                    for item in root.findall('.//item'):
                        if len(parsed_articles) >= max_items:
                            break
                        title_elem = item.find('title')
                        link_elem = item.find('link')
                        desc_elem = item.find('description')
                        title = title_elem.text if title_elem is not None else "No title"
                        if " - " in title:
                            title = title.split(" - ", 1)[-1]
                        if not any(a["title"] == title for a in articles + parsed_articles):
                            parsed_articles.append({
                                "title": title,
                                "description": desc_elem.text if desc_elem is not None else "",
                                "url": link_elem.text if link_elem is not None else "",
                                "publishedAt": str(datetime.now()),
                                "source": {"name": "Google News"}
                            })
            except Exception:
                pass
            return parsed_articles
        
        india_news_urls = ["https://news.google.com/rss/search?q=india+when:1d&hl=en-IN&gl=IN&ceid=IN:en"]
        for url in india_news_urls:
            if len(articles) >= limit:
                break
            new_articles = parse_rss_feed(url, limit - len(articles))
            articles.extend(new_articles)
        articles = articles[:limit]
        return {"status": "success", "totalResults": len(articles), "articles": articles, "fetchedAt": str(datetime.now())}
    except Exception as e:
        return {"status": "error", "message": f"Failed to fetch news: {str(e)}", "articles": []}

# ============================================================================
# Combine All Tools
# ============================================================================
all_tools = [
    FunctionTool(kb_tool),
    FunctionTool(web_search),
    FunctionTool(get_weather),
    FunctionTool(get_top_news)
]

print("âœ“ News Tool (for Usage Metric)")


## 8. Agent, Session, and Evaluator Initialization

### Purpose
Initialize all the necessary components to run and evaluate the Flotorch ADK agent with OpenTelemetry tracing enabled.

### Key Components
1. **FlotorchADKAgent** (`agent_client`):
   - Creates the agent client with custom tools (Web Search, Weather)
   - Configures `tracer_config` with `enabled: True` and `sampling_rate: 1` to capture 100% of traces
   - Sets tracing endpoint: `"https://dev-observability.flotorch.cloud/v1/traces"`
   - Essential for evaluation as traces contain complete execution information
2. **FlotorchADKSession** (`session_service`): Manages agent sessions for multi-turn conversations
3. **Runner** (`runner`): Executes agent queries and coordinates the agent execution flow
4. **AgentEvaluator** (`evaluator`):
   - Initializes the evaluation client for metric assessment
   - Configures the default evaluator model (LLM) for metric evaluation
   - Used to evaluate agent performance across all  metrics

These components work together to run the Flotorch ADK Agent and generate OpenTelemetry traces for evaluation across all  metrics.


In [None]:
# Initialize ONE Flotorch ADK Agent with all tools
agent_client = FlotorchADKAgent(
    agent_name=agent_name,
    custom_tools=all_tools,
    base_url=FLOTORCH_BASE_URL,
    api_key=FLOTORCH_API_KEY,
    tracer_config={
        "enabled": True,  # Enable tracing for metrics
        "endpoint": "https://dev-observability.flotorch.cloud/v1/traces",
        "sampling_rate": 1  # Sample 100% of traces
    }
)
agent = agent_client.get_agent()

# Initialize session service
session_service = FlotorchADKSession(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
)

# Create the ADK Runner to execute agent queries
runner = Runner(
    agent=agent,
    app_name=app_name,
    session_service=session_service
)

# Initialize ONE AgentEvaluator for all metrics
evaluator = AgentEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    default_evaluator=default_evaluator
)

print("âœ“ This agent and evaluator will be used for all metrics")


## 9. Helper Function for Running a Query

### Purpose
Define a helper function that executes a single-turn query with the agent and extracts the final response. The agent execution is automatically traced for evaluation.

### Functionality
The `run_single_turn` function:
- Accepts a `Runner`, query string, session ID, and user ID as parameters
- Creates a user message using Google ADK types (`types.Content` and `types.Part`)
- Executes the query through the runner which coordinates the agent execution flow
- Iterates through events returned by the runner to find and return the final agent response
- Returns a fallback message ("No response from agent.") if no response is found
- Automatically generates OpenTelemetry traces during execution for metric evaluation

This function simplifies the process of running queries and ensures trace generation during execution, which is essential for evaluating the all metrics (LatencyMetric, TrajectoryEvalWithLLM, TrajectoryEvalWithLLMWithReference, ToolCallAccuracy, AgentGoalAccuracy, UsageMetric).

In [None]:
def run_single_turn(runner: Runner, query: str, session_id: str, user_id: str) -> str:
    """
    Execute a single-turn query with the agent and return the final response.
    The agent execution is traced automatically.
    """
    content = types.Content(role="user", parts=[types.Part(text=query)])
    events = runner.run(user_id=user_id, session_id=session_id, new_message=content)

    # Extract the final response
    for event in events:
        if event.is_final_response() and event.content and event.content.parts:
            return event.content.parts[0].text
    return "No response from agent."

print("âœ“ Helper function 'run_single_turn' defined successfully")


## 10. Define Queries for Different Metrics

### Purpose
Define sample queries for different metrics that will be executed by the Flotorch ADK agent to generate OpenTelemetry traces for evaluation.

### Key Components
- **`queries`**: A dictionary containing sample questions for each metric type
  - **`"trajectory"`**: A query for TrajectoryEvalWithLLM metric that triggers the agent to search and provide information (e.g., "Tell me about Google ADK")
  - **`"reference"`**: A query for TrajectoryEvalWithLLMWithReference metric that requires code review/analysis (e.g., "Review this Python function and tell me if it works and how good the code quality is")
  - **`"toolcall"`**: A query for ToolCallAccuracy metric that requires the agent to use a specific tool (e.g., "what is the weather in the hyderabad")
  - Each query will trigger agent execution that is automatically traced to capture execution information
  - Different queries are used to ensure each metric is evaluated with the most appropriate scenario

The queries can be modified to test different scenarios and measure performance for various types of questions. Each query generates a separate trace that is used for evaluating its corresponding metric.


In [None]:
# Define queries for different metrics
queries = {
    "latency": "How to create agents in flotorch console?",
    "trajectory": "Tell me about Google ADK",
    "reference": """Review this Python function and tell me if it works and how good the code quality is.

def add_items(items):
    total = 0
    for i in range(len(items)):
        total = total + items[i]
    return total""",
    "toolcall": "what is the weather in the hyderabad",
    "goal": "Plan a 5-day trip to Singapore from Bangalore in March, including flight options, hotel recommendations, daily itinerary, and estimated total cost.",
    "usage": "Get me the top 7 latest news articles from around the world, especially from India"
}

print(f"âœ“ Queries defined for {len(queries)} metrics")


## 11. Run Queries and Get Trace IDs

### Purpose
Execute sample queries for different metrics to generate OpenTelemetry traces. All queries use the same agent and each query execution generates a separate trace ID that will be used for metric evaluation.

### Functionality
This section:
- Creates a `trace_ids` dictionary to store trace IDs for each metric
- Iterates through each query in the `queries` dictionary (6 queries total: latency, trajectory, reference, toolcall, goal, usage)
- For each query:
  - Creates a new session using the session service
  - Tracks existing tracer IDs before executing the query
  - Executes the query using the agent via `run_single_turn` function
  - Waits 4 seconds for trace generation to complete
  - Compares tracer IDs after execution to identify new trace IDs
  - Stores the most recent new trace ID in the `trace_ids` dictionary
- Verifies that all 6 required trace IDs are collected successfully:
  - `latency` - for Latency Metric
  - `trajectory` - for Trajectory Evaluation with LLM
  - `reference` - for Trajectory Evaluation with Reference
  - `toolcall` - for Tool Call Accuracy
  - `goal` - for Agent Goal Accuracy
  - `usage` - for Usage Metric
- Prints warnings if any trace IDs are missing

The trace IDs are essential for evaluation as they contain the complete execution information needed by each metric.


In [None]:
import time

# Store trace IDs for each metric
trace_ids = {}

# Run queries and collect trace IDs
for metric_name, query in queries.items():
    print(f"\n--- Running query for {metric_name.upper()} metric ---")
    
    # Create session and execute query
    session = await runner.session_service.create_session(
        app_name=app_name, 
        user_id=user_id
    )
    tracer_ids_before = set(agent_client.get_tracer_ids() or [])
    
    response = run_single_turn(
        runner=runner, 
        query=query, 
        session_id=session.id, 
        user_id=user_id
    )
    
    # Wait for trace generation and collect trace ID
    time.sleep(4)
    tracer_ids_after = agent_client.get_tracer_ids() or []
    new_trace_ids = [tid for tid in tracer_ids_after if tid not in tracer_ids_before]
    
    if new_trace_ids:
        trace_ids[metric_name] = new_trace_ids[-1]
        print(f"âœ“ {metric_name.upper()} trace ID: {trace_ids[metric_name]}")
    else:
        print(f"âœ— No trace ID found for {metric_name}")

# Verify all required trace IDs collected
required_keys = ["latency", "trajectory", "reference", "toolcall", "goal", "usage"]
missing = [k for k in required_keys if k not in trace_ids]

if missing:
    print(f"\n WARNING: Missing trace IDs: {missing}")
else:
    print(f"\nâœ“ All {len(trace_ids)} trace IDs collected successfully!")
    for name, tid in trace_ids.items():
        print(f"  - {name}: {tid}")


## 12. Reference Trajectory Setup :
 `(For TrajectoryEvalWithLLMWithReference)`

### Key Components
- **`REFERENCE_TRAJECTORY`**: A dictionary that defines the ideal trajectory for the agent
  - **`input`**: The user's query or prompt that will trigger the agent's response
  - **`expected_steps`**: A list of expected steps the agent should take, where each step contains:
    - **`thought`**: The agent's reasoning or thought process at this step
    - **`final_response`**: The expected final response from the agent
  - Example: A code review query with expected analysis covering correctness, code quality, readability, performance, and suggestions

- **`ReferenceTrajectory`**: Schema validation class that ensures the reference trajectory structure is correct
  - Validates that all required fields are present and properly formatted
  - Used to ensure the reference trajectory meets the requirements for TrajectoryEvalWithLLMWithReference metric

This reference trajectory will be compared against the actual agent execution to evaluate how closely the agent follows the expected behavior and reasoning path.


In [None]:
# Define reference trajectory for TrajectoryEvalWithLLMWithReference metric
REFERENCE_TRAJECTORY = {
    "input": """Review this Python function and tell me if it works and how good the code quality is.

def add_items(items):
    total = 0
    for i in range(len(items)):
        total = total + items[i]
    return total""",
    "expected_steps": [
        {
            "thought": "The user wants a review of the Python function, focusing on whether it works and the quality of the implementation. I will analyze correctness, edge cases, style, performance, and then provide a structured review.",
            "final_response": """### Summary

- The function `add_items` sums all elements in the `items` list.
- It likely works correctly when `items` is a list of numeric values.
- The implementation is not very idiomatic Python and can be simplified.
- There is no validation for invalid or unexpected input types.

### Correctness

- For a typical case like `add_items([1, 2, 3])`, the function will return `6`, which is correct.
- If `items` contains non-numeric values (e.g., strings or `None`), it will raise a `TypeError` when trying to add them.
- If `items` is not indexable or does not implement `__len__` (e.g., a generator), the function will also fail.
- There is no explicit handling of `None` or other invalid inputs.

### Code Quality & Readability

- The logic is straightforward and easy to understand.
- Using `range(len(items))` is not idiomatic Python for iterating over a list.
- A more Pythonic version would iterate directly over the elements or use the built-in `sum` function.
- Variable naming is acceptable, but the function could be shorter and clearer.

### Performance

- Time complexity is O(n), where n is the length of `items`, which is optimal for summing a list.
- There is a minor overhead from indexing (`items[i]`) instead of iterating directly over the elements.
- For most real-world cases, this overhead is negligible, but the idiomatic version is both clearer and slightly more efficient.

### Improved Version

A more Pythonic and concise implementation would be:

```python
def add_items(items):
    return sum(items)
```

Ratings
Correctness: 8/10 (works for standard numeric lists, no validation)
Code Quality: 6/10 (clear but non-idiomatic and slightly verbose)
Overall: 7/10"""
        }
    ]
}

validated_ref = ReferenceTrajectory(**REFERENCE_TRAJECTORY)
print("âœ“ Reference trajectory defined for TrajectoryEvalWithLLMWithReference metric")


## 13. Initialize All Metrics

### Purpose
Initialize all metrics that will be evaluated. Each metric is configured with its specific requirements and will be evaluated using the traces collected from the agent execution.

### Key Components
- **`all_metrics`**: A list containing instances of all 6 metrics
  - **`LatencyMetric()`**: Metric that measures agent response time and performance breakdown
  - **`TrajectoryEvalWithLLM()`**: Metric that assesses overall trajectory quality using LLM-based evaluation
  - **`TrajectoryEvalWithLLMWithReference()`**: Metric that compares trajectory against a reference trajectory using LLM-based evaluation
    - Requires `default_evaluator` (LLM model) for evaluation
    - Requires `MetricConfig` with `reference` parameter set to the `REFERENCE_TRAJECTORY` defined earlier
  - **`ToolCallAccuracy()`**: Metric that evaluates the accuracy and appropriateness of tool usage decisions
  - **`AgentGoalAccuracy()`**: Metric that assesses whether the agent successfully accomplished the user's true goal
  - **`UsageMetric()`**: Metric that provides detailed breakdown of token usage and costs

All metrics are initialized together so they can be evaluated in a unified manner using the same evaluator and traces. The code prints each metric name for verification.


In [None]:
# Initialize ALL metrics at once
all_metrics = [
    LatencyMetric(),
    TrajectoryEvalWithLLM(),
    TrajectoryEvalWithLLMWithReference(
        llm=default_evaluator,
        config=MetricConfig(
            metric_params={"reference": REFERENCE_TRAJECTORY}
        )
    ),
    ToolCallAccuracy(),
    AgentGoalAccuracy(),
    UsageMetric()
]

print(f"âœ“ Initialized {len(all_metrics)} metrics:")
for i, metric in enumerate(all_metrics, 1):
    print(f"  {i}. {metric.__class__.__name__}")


## 14. Configure Metric-to-Trace Mapping

### Purpose
Define the mapping between each metric and its corresponding trace. Each metric uses a different trace (from different queries) to ensure accurate evaluation.

### Key Components
- **`metric_configs`**: A list of dictionaries, each mapping a metric to its trace. Contains 6 metric configurations:
  - **`name`**: Human-readable metric name
  - **`trace_key`**: Key from the `trace_ids` dictionary that identifies which trace to use
  - **`metric`**: The metric instance (created fresh for each configuration)
  - **`reference`**: Reference trajectory (only for TrajectoryEvalWithLLMWithReference, otherwise None)

**Metric-to-Trace Mapping**:
1. **Latency Metric** â†’ Uses `"latency"` trace (Flotorch console question)
2. **Trajectory Evaluation** â†’ Uses `"trajectory"` trace (Google ADK question)
3. **Trajectory Evaluation with Reference** â†’ Uses `"reference"` trace (code review question)
   - Includes `REFERENCE_TRAJECTORY` in the reference field
4. **Tool Call Accuracy** â†’ Uses `"toolcall"` trace (weather question)
5. **Agent Goal Accuracy** â†’ Uses `"goal"` trace (travel planning question)
6. **Usage Metric** â†’ Uses `"usage"` trace (news question)

This mapping ensures each metric is evaluated with the most appropriate trace for accurate assessment. The code also imports `EvaluationResult` schema which will be used to combine all evaluation results.


In [None]:
# Evaluate ALL metrics at once, each with its own trace
# Note: The evaluate() method accepts ONE trace per call, so we evaluate each metric separately

from flotorch_eval.agent_eval.core.schemas import EvaluationResult

# Define metric-to-trace mapping and metric instances
metric_configs = [
    {
        "name": "Latency",
        "trace_key": "latency",
        "metric": LatencyMetric(),
        "reference": None
    },
    {
        "name": "Trajectory",
        "trace_key": "trajectory",
        "metric": TrajectoryEvalWithLLM(),
        "reference": None
    },
    {
        "name": "Reference",
        "trace_key": "reference",
        "metric": TrajectoryEvalWithLLMWithReference(
            llm=default_evaluator,
            config=MetricConfig(metric_params={"reference": REFERENCE_TRAJECTORY})
        ),
        "reference": REFERENCE_TRAJECTORY
    },
    {
        "name": "ToolCall",
        "trace_key": "toolcall",
        "metric": ToolCallAccuracy(),
        "reference": None
    },
    {
        "name": "Goal",
        "trace_key": "goal",
        "metric": AgentGoalAccuracy(),
        "reference": None
    },
    {
        "name": "Usage",
        "trace_key": "usage",
        "metric": UsageMetric(),
        "reference": None
    }
]

print("Evaluating ALL metrics, each with its corresponding trace...\n")

## 15. Run Evaluation for All Metrics

### Purpose
Evaluate all 6 metrics, each with its own trace (different question). The `evaluate()` method accepts ONE trace per call, so we loop through each metric and evaluate it separately with its corresponding trace, then combine all results.

### Functionality
This section:
- Initializes empty lists: `all_scores` to collect all metric scores and `all_trajectory_ids` to track trajectory IDs
- Iterates through each metric configuration in `metric_configs` (6 total metrics)
- For each metric:
  - Checks if the required trace ID exists, skipping with a warning if missing
  - Fetches the trace for the metric using `evaluator.fetch_traces(trace_id)`
  - Prepares evaluation arguments with the trace and metric instance
  - Adds reference trajectory for metrics that require it (TrajectoryEvalWithLLMWithReference)
  - Evaluates the metric using `evaluator.evaluate()` with error handling
  - Extends `all_scores` with the result scores and appends trajectory ID info
  - Prints success or failure messages for each metric
- After all evaluations:
  - Creates a combined `EvaluationResult` object with all scores if any were collected
  - Uses the first trajectory ID from the collected IDs or "combined" as fallback
  - Sets timestamp to current UTC time
  - Prints summary message with total number of metrics evaluated successfully

The evaluation results are stored in `all_results` for display in the next section. Each metric is evaluated independently, so if one fails, others can still complete successfully.


In [None]:
all_scores = []
all_trajectory_ids = []

# Evaluate each metric with its own trace
for i, config in enumerate(metric_configs, 1):
    trace_key = config["trace_key"]
    metric_name = config["name"]
    
    if trace_key not in trace_ids:
        print(f" Skipping {metric_name}: No trace ID for '{trace_key}'")
        continue
    
    trace_id = trace_ids[trace_key]
    print(f"{i}. Evaluating {metric_name} Metric...")
    
    # Fetch and evaluate
    trace = evaluator.fetch_traces(trace_id)
    eval_kwargs = {"trace": trace, "metrics": [config["metric"]]}
    
    if config["reference"]:
        eval_kwargs["reference"] = config["reference"]
    
    try:
        result = await evaluator.evaluate(**eval_kwargs)
        all_scores.extend(result.scores)
        all_trajectory_ids.append((metric_name, result.trajectory_id))
        print(f"   âœ“ {metric_name} Metric completed")
    except Exception as e:
        print(f"   âœ— {metric_name} Metric failed: {str(e)[:100]}")

# Create combined results
if all_scores:
    from datetime import datetime
    all_results = EvaluationResult(
        trajectory_id=all_trajectory_ids[0][1] if all_trajectory_ids else "combined",
        timestamp=datetime.utcnow(),
        scores=all_scores
    )
    print(f"\nâœ“ All {len(all_scores)} metrics evaluated successfully!")
else:
    print("âœ— No metrics evaluated. Please check trace IDs.")



## 16. Display All Metrics Results

### Purpose
Define helper function to format and display the evaluation output clearly, showing all three metric results in a readable format.

### Functionality
The `display_all_metrics` function:
- Extracts all three metrics from the evaluation results (`trajectory_evaluation`, `trajectory_evaluation_with_reference`, `toolcall_accuracy`)
- Formats the score and details for each metric type appropriately
- Creates a structured display showing:
  - Metric name
  - Score (0.0 to 1.0 scale)
  - Detailed evaluation results for each metric
- Uses pandas DataFrame with styled formatting for clean presentation

This function provides a user-friendly way to visualize all three metric evaluation results in a single formatted table.



In [None]:
def display_all_metrics(result):
    """
    Display ALL metrics in a single formatted table.
    This function handles all 6 metrics and formats them appropriately.
    """
    if not result or not result.scores:
        print("No metrics found in results.")
        return
    
    print(f"Trajectory ID: {result.trajectory_id}")
    print(f"Timestamp: {result.timestamp}\n")
    
    # Prepare data for all metrics
    metrics_data = []
    
    for metric in result.scores:
        metric_name = metric.name
        details_text = ""
        
        # Format details based on metric type
        if metric_name == "latency_summary":
            d = metric.details
            breakdown = "\n".join(
                f"    - {s['step_name']}: {s['latency_ms']} ms"
                for s in d.get("latency_breakdown", [])
            )
            details_text = (
                f"Total Latency: {d.get('total_latency_ms')} ms\n"
                f"Average Step Latency: {d.get('average_step_latency_ms')} ms\n"
                f"Breakdown:\n{breakdown}"
            )
        
        elif metric_name == "trajectory_evaluation":
            d = metric.details
            details_text = f"Score: {metric.score:.2f} / 1.0\n\nDetails:\n{d.get('details', 'No details available.')}"
        
        elif metric_name == "trajectory_evaluation_with_reference":
            details_text = "\n".join(f"{k}: {v}" for k, v in metric.details.items())
        
        elif metric_name == "toolcall_accuracy":
            d = metric.details
            details_text = f"Score: {metric.score:.2f} / 1.0\n\nDetails:\n{d.get('details', 'No details available.')}"
        
        elif metric_name == "agent_goal_accuracy":
            raw_details = metric.details.get("details", "{}")
            try:
                parsed = json.loads(raw_details)
                details_text = "\n".join(f"{k}: {v}" for k, v in parsed.items())
            except:
                details_text = raw_details
        
        elif metric_name == "usage_summary":
            d = metric.details
            total_cost = d.get("total_cost", "0.000000")
            avg_cost = d.get("average_cost_per_call", "0.000000")
            cost_breakdown = d.get("cost_breakdown", [])
            
            if cost_breakdown:
                breakdown_lines = []
                for item in cost_breakdown:
                    if isinstance(item, dict):
                        item_str = f"    - {item.get('operation', 'Unknown')}: ${item.get('cost', '0.000000')}"
                        if 'count' in item:
                            item_str += f" ({item['count']} calls)"
                        breakdown_lines.append(item_str)
                    else:
                        breakdown_lines.append(f"    - {item}")
                breakdown_text = "\n".join(breakdown_lines)
            else:
                breakdown_text = "    No breakdown available."
            
            details_text = (
                f"Total Cost: ${total_cost}\n"
                f"Average Cost per Call: ${avg_cost}\n"
                f"Cost Breakdown:\n{breakdown_text}"
            )
        
        else:
            # Generic formatting for unknown metrics
            details_text = str(metric.details)
        
        # Format score based on metric type
        if metric_name == "usage_summary":
            score_display = f"${metric.details.get('total_cost', '0.000000')}"
        else:
            score_display = f"{metric.score:.2f}"
        
        metrics_data.append({
            "Metric": metric_name.replace("_", " ").title(),
            "Score": score_display,
            "Details": details_text
        })
    
    # Create DataFrame and display
    df = pd.DataFrame(metrics_data)
    
    display(df.style.set_properties(
        subset=['Details'],
        **{'white-space': 'pre-wrap', 'text-align': 'left'}
    ).set_table_styles([
        {'selector': 'th', 'props': [('text-align', 'left')]}
    ]))

# Display all metrics
if 'all_results' in locals():
    display_all_metrics(all_results)
else:
    print("No results found. Please run the evaluation section first.")


---

# Summary

This comprehensive notebook demonstrates how to evaluate Flotorch ADK agents across **all 6 evaluation metrics** using a **unified approach**:

## Key Features

1. **Single Agent**: ONE `FlotorchADKAgent` with all necessary tools (KB, Web Search, Weather, News)
2. **Single Evaluator**: ONE `AgentEvaluator` instance for all metrics
3. **Unified Evaluation**: All 6 metrics evaluated at once in a single call
4. **Unified Display**: ONE display function shows all metric results together

## Metrics Evaluated

1. **Latency Metric** (`latency_summary`) - Performance and response time analysis
2. **Trajectory Evaluation with LLM** (`trajectory_evaluation`) - Overall trajectory quality assessment
3. **Trajectory Evaluation with Reference** (`trajectory_evaluation_with_reference`) - Comparison against reference trajectories
4. **Tool Call Accuracy** (`toolcall_accuracy`) - Tool usage decision evaluation
5. **Agent Goal Accuracy** (`agent_goal_accuracy`) - Goal accomplishment assessment
6. **Usage Metric** (`usage_summary`) - Token usage and cost analysis

## Benefits

- **Efficiency**: Single agent and evaluator reduce setup overhead
- **Consistency**: All metrics use the same agent configuration
- **Simplicity**: One evaluation call and one display function
- **Comprehensive**: All metrics displayed together for easy comparison

The notebook provides comprehensive evaluation capabilities for monitoring and optimizing Flotorch ADK agents across multiple dimensions of performance and quality.
