[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/drive/folders/1IrwoNrb3AWLAhAqjlAkJNYa39p9eT9ui?usp=sharing)

# Flotorch Agent Trajectory Evaluation with Reference (Code Review Agent Use Case)

This notebook demonstrates how to measure and analyze the **trajectory evaluation with reference** of a **Flotorch ADK agent** (configured as a **Code Review Agent** that analyzes code for issues and suggests improvements and optimizations) using the **Flotorch Eval** framework.

The evaluation relies on **OpenTelemetry Traces** generated during the agent's run to assess the overall quality and effectiveness of the agent's trajectory by comparing it against a reference trajectory using LLM-based evaluation.

---

## Key Concepts

* **Code Review Agent**: An agent designed to analyze code for issues and suggest improvements and optimizations.
* **OpenTelemetry Traces**: Detailed records of the agent's execution steps (spans) used to analyze the complete agent trajectory.
* **TrajectoryEvalWithLLMWithReference**: A Flotorch Eval metric that uses LLM-based evaluation to assess trajectory quality by comparing against a reference trajectory. The evaluation metric used is **trajectory_evaluation_with_reference**.
* **Reference Trajectory**: A predefined expected trajectory that serves as a benchmark for evaluating the agent's actual performance.

---

### Architecture Overview

![Workflow Diagram](diagrams/03_TrajectoryEvalWithLLMWithReference_Workflow_Diagram.drawio.png)
*Figure : Detailed workflow diagram showing the step-by-step process of trajectory evaluation with reference from agent execution through trace collection to metric computation.*

---

## Requirements

* Flotorch account with configured models.
* Valid Flotorch API key and gateway base URL.
* Agent configured with OpenTelemetry tracing enabled.
* Reference trajectory definition for comparison.

---

## Agent Setup in Flotorch Console

**Important**: Before running this notebook, you need to create an agent in the Flotorch Console. This section provides step-by-step instructions on how to set up the agent.

### Step 1: Access Flotorch Console

1. **Log in to Flotorch Console**:
   - Navigate to your Flotorch Console (e.g., `https://dev-console.flotorch.cloud`)
   - Ensure you have the necessary permissions to create agents

2. **Navigate to Agents Section**:
   - Click on **"Agents"** in the left sidebar
   - You should see the "Agent Builder" option selected

### Step 2: Create New Agent

1. **Click "Create FloTorch Agent"**:
   - Look for the blue **"+ Create FloTorch Agent"** button in the top right corner
   - Click it to start creating a new agent

2. **Agent Configuration**:
   - **Agent Name**: Choose a unique name for your agent (e.g., `code-reviewer-agent`)
     - **Important**: The name should only contain alphanumeric characters and dashes (a-z, A-Z, 0-9, -)
     - **Note**: Copy this agent name - you'll need to use it in the `agent_name` variable later
   - **Description** (Optional): Add a description if desired

### Step 3: Configure Agent Details

After creating the agent, you'll be directed to the agent configuration page. Configure the following:

#### Required Configuration:

1. **Model** (`* Model`):
   - **Required**: Select a model from the available options
   - Example: `gpt-model` or any available model from your Flotorch gateway
   - Click the edit icon to configure

2. **Agent Details** (`* Agent Details`):
   - **Required**: Configure agent details
   - **System Prompt**: Copy and paste the following system prompt:

You are SeniorCodeReviewer, an expert software engineer.
Your job is to review any code the user provides and give:

Whether the code likely works (based on static analysis; do not claim certainty).

The quality of the code: readability, maintainability, structure, performance, and security.

A numeric rating of the code (Correctness / Quality / Overall, each out of 10).

Actionable improvements and best-practice suggestions.

Guidelines:

Be direct, technical, and professional.

Point out bugs, edge cases, smells, and anti-patterns.

Suggest better patterns or small rewritten snippets only when useful.

Follow language-appropriate conventions (PEP8 for Python, etc.).

Output in clear sections: Summary, Correctness, Quality, Performance, Security (if relevant), Ratings, Improvements, Suggested Tests.


   - **Goal**: Copy and paste the following goal:
   
Review any given source code and provide an accurate assessment of whether it is likely to work as intended, how good its quality is (readability, maintainability, performance, security), a numeric rating, and specific, actionable improvement suggestions.


#### Optional Configuration:

1. **Tools**:
   - Tools will be added programmatically via the notebook (see Section 8)
   - You can leave this as "Not Configured" in the console

2. **Input Schema**:
   - Optional: Leave as "Not Configured" for this use case

3. **Output Schema**:
   - Optional: Leave as "Not Configured" for this use case

### Step 4: Publish the Agent

1. **Review Configuration**:
   - Ensure the Model and Agent Details are configured correctly
   - Verify the System Prompt and Goal are set

2. **Publish Agent**:
   - After configuration, click **"Publish"** or **"Make a revision"** to publish the agent
   - Once published, the agent will have a version number (e.g., v1)

3. **Note the Agent Name**:
   - **Important**: Copy the exact agent name you used when creating the agent
   - You will need to replace `<your_agent_name>` in the `agent_name` variable in Section 2.1 (Global Provider Models and Agent Configuration)

### Step 5: Update Notebook Configuration

1. **Update Agent Name**:
   - Navigate to Section 2.1 in this notebook
   - Find the `agent_name` variable
   - Replace `<your_agent_name>` with the exact agent name you created in the console

**Example**:
- If you created an agent named `code-reviewer-agent` in the console
- Set `agent_name = "code-reviewer-agent"` in the notebook

### Summary of Required vs Optional Settings

| Setting | Required/Optional | Value |
|---------|------------------|-------|
| **Agent Name** | **Required** | Choose a unique name (copy it for notebook) |
| **Model** | **Required** | Select from available models |
| **System Prompt** | **Required** | Use the system prompt provided above |
| **Goal** | **Required** | Use the goal provided above |
| **Tools** | **Optional** | Will be added via notebook code |
| **Input Schema** | **Optional** | Can leave as "Not Configured" |
| **Output Schema** | **Optional** | Can leave as "Not Configured" |

**Note**: The tools (Knowledge Base, Web Search, Weather, News) will be added to the agent programmatically in the notebook code, so you don't need to configure them manually in the console.

---


## 1. Setup and Installation

### Purpose
Install the necessary packages for the Flotorch Evaluation framework required for agent trajectory evaluation with reference.

### Key Components
- **`flotorch-eval`**: Flotorch evaluation framework with all dependencies for trajectory evaluation with reference metrics



In [None]:
# Install Flotorch Eval packages
# flotorch-eval: Flotorch evaluation framework with all dependencies

%pip install flotorch-eval==2.0.0b1 flotorch[adk]==3.1.0b1

## 2.Authentication and Credentials

### Purpose
Configure your Flotorch API credentials and gateway URL for authentication.

### Key Components
This cell configures the essential authentication and connection parameters:

**Authentication Parameters**:

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key (found in your Flotorch Console). Securely entered using `getpass` to avoid displaying in the notebook | `sk_...` |
| `FLOTORCH_BASE_URL` | Your Flotorch gateway endpoint URL | `https://dev-console.flotorch.cloud` |

**Note**: Use secure credential management in production environments.


In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  
    print(f"✓ FLOTORCH_API_KEY set successfully")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""
    print(f"✗ FLOTORCH_API_KEY not set")

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # Prefect gateway or cloud endpoint          || https://dev-console.flotorch.cloud
print(f"✓ FLOTORCH_BASE_URL set: {FLOTORCH_BASE_URL}")

print("✓ All credentials configured successfully!")

### 2.1. Global Provider Models and Agent Configuration

### Purpose
Define available models from the Flotorch gateway and configure agent-specific parameters.

### Key Components

**Global Provider Models**: These are the available models from the Flotorch gateway that can be used for evaluation and agent operations:

| Model Variable | Model Name | Description |
|----------------|------------|-------------|
| `MODEL_CLAUDE_HAIKU` | `flotorch/flotorch-claude-haiku-4-5` | Claude Haiku model via Flotorch gateway |
| `MODEL_CLAUDE_SONNET` | `flotorch/flotorch-claude-sonnet-3-5-v2` | Claude Sonnet model via Flotorch gateway |
| `MODEL_AWS_NOVA_PRO` | `flotorch/flotorch-aws-nova-pro` | AWS Nova Pro model via Flotorch gateway |
| `MODEL_AWS_NOVA_LITE` | `flotorch/flotorch-aws-nova-lite` | AWS Nova Lite model via Flotorch gateway |
| `MODEL_AWS_NOVA_MICRO` | `flotorch/flotorch-aws-nova-micro` | AWS Nova Micro model via Flotorch gateway |

**Agent Configuration Parameters**:

| Parameter | Description | Example |
|-----------|-------------|---------|
| `default_evaluator` | The LLM model used for evaluation (can use MODEL_* variables above) | `MODEL_CLAUDE_SONNET` or `flotorch/flotorch-model` |
| `agent_name` | The name of your Flotorch ADK agent | `code-reviewer-agent` |
| `app_name` | The application name identifier | `agent-evaluation-app-name_03` |
| `user_id` | The user identifier | `agent-evaliation-user-03` |


In [None]:
# ============================================================================
# Global Provider Models (Flotorch Gateway Models)
# ============================================================================
# These models are available from the Flotorch gateway and can be used
# for evaluation, agent operations, and other tasks.

MODEL_CLAUDE_HAIKU = "flotorch/flotorch-claude-haiku-4-5"
MODEL_CLAUDE_SONNET = "flotorch/flotorch-claude-sonnet-3-5-v2"
MODEL_AWS_NOVA_PRO = "flotorch/flotorch-aws-nova-pro"
MODEL_AWS_NOVA_LITE = "flotorch/flotorch-aws-nova-lite"
MODEL_AWS_NOVA_MICRO = "flotorch/flotorch-aws-nova-micro"

print("✓ Global provider models defined")

# The LLM model used for evaluation.
# Can be modified to use any MODEL_* constant above (e.g., MODEL_CLAUDE_SONNET, MODEL_AWS_NOVA_PRO)
# You can use your own models from Flotorch Console as well
default_evaluator = MODEL_CLAUDE_HAIKU

agent_name = "<your_agent_name>"  # The name of your Flotorch ADK agent                                        || ex : code-reviewer-agent
app_name = "<your_app_name>"  # The application name identifier                                                || ex : agent-evaluation-app-name_03
user_id = "<your_user_id>"  # The user identifier                                                              || ex : agent-evaliation-user-03

print("✓ Agent Configuration Parameter defined ")


In [None]:
FLOTORCH_API_KEY="sk_MEB9a2iaB0pXf2P8IVjldtw8OvLgwRo5ReXgdbOYKpA=_MTAwNjNjNDQtYjk3YS00NjdhLWFkYTgtZWJkYzU0OTFlNjY5_OTFiYWY4ZDQtNWFjNy00YWJmLWI3OGYtODMxZWFlOWY1ZWY5"
FLOTORCH_BASE_URL="https://dev-gateway.flotorch.cloud/"
agent_name="code-reviewer-agent"

## 3. Import Required Libraries

### Purpose
Import all required components for evaluating the Code Review Agent trajectory with reference using Flotorch Eval.

### Key Components
- **`AgentEvaluator`**: Core client for agent evaluation orchestration and trace fetching
- **`TrajectoryEvalWithLLMWithReference`**: Flotorch Eval metric that uses LLM-based evaluation to assess trajectory quality by comparing against a reference trajectory
- **`ReferenceTrajectory`**: Schema for defining reference trajectories for comparison
- **`MetricConfig`**: Configuration for metric parameters including reference trajectory
- **`FlotorchADKAgent`**: Creates and configures Flotorch ADK agents with tracing
- **`FlotorchADKSession`**: Manages agent sessions for multi-turn conversations
- **`Runner`**: Executes agent queries and coordinates the agent execution flow
- **`types`**: Google ADK types for creating message content and handling agent events
- **`pandas`**: Data manipulation and display for formatted results tables
- **`display`**: IPython display utility for rendering formatted outputs

In [None]:
# Required imports
# Flotorch Eval components
from flotorch_eval.agent_eval.core.client import AgentEvaluator
from flotorch_eval.agent_eval.metrics.llm_evaluators import TrajectoryEvalWithLLMWithReference
from flotorch_eval.agent_eval.core.schemas import ReferenceTrajectory
from flotorch_eval import MetricConfig

# Flotorch ADK components
from flotorch.adk.agent import FlotorchADKAgent
from flotorch.adk.sessions import FlotorchADKSession

# Google ADK components
from google.adk.runners import Runner
from google.genai import types

# Utilities
import pandas as pd
from IPython.display import display

print("✓ Imported necessary libraries successfully")


## 4. Code Review Agent Setup

### Purpose
Set up the Code Review Agent with OpenTelemetry tracing enabled to capture detailed execution data for trajectory evaluation with reference.

### Key Components
1. **FlotorchADKAgent** (`agent_client`):
   - Initializes the agent for code review tasks
   - Configures `tracer_config` with `enabled: True` and `sampling_rate: 1` to capture 100% of traces
   - Essential for evaluation as traces contain complete trajectory information
2. **FlotorchADKSession** (`session_service`): Manages agent sessions for multi-turn conversations
3. **Runner** (`runner`): Executes agent queries and coordinates the agent execution flow

These components work together to run the Code Review Agent and generate OpenTelemetry traces for trajectory evaluation with reference analysis.

In [None]:
# Initialize Flotorch ADK Agent with tracing enabled
agent_client = FlotorchADKAgent(
    agent_name=agent_name,
    base_url=FLOTORCH_BASE_URL,
    api_key=FLOTORCH_API_KEY,
    tracer_config={
        "enabled": True,                                                   # Enable tracing for latency measurement
        "endpoint": "https://dev-observability.flotorch.cloud/v1/traces",  # Dev observability OTLP HTTP endpoint (used by QA)
        "sampling_rate": 1                                                 # Sample 100% of traces
    }
)
agent = agent_client.get_agent()

# Initialize session service
session_service = FlotorchADKSession(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
)

# Create the ADK Runner to execute agent queries
runner = Runner(
    agent=agent,
    app_name=app_name,
    session_service=session_service
)

print("✓ Agent and runner and session initialized successfully")

## 5. Helper Function for Running a Query

### Purpose
Define a helper function that executes a single-turn query with the agent and extracts the final response. The agent execution is automatically traced for trajectory evaluation.

### Functionality
The `run_single_turn` function:
- Accepts a `Runner`, query string, session ID, and user ID as parameters
- Creates a user message using Google ADK types
- Executes the query through the runner
- Iterates through events to find and return the final agent response
- Returns a fallback message if no response is found

This function simplifies the process of running queries and ensures trace generation during execution.

In [None]:
def run_single_turn(runner: Runner, query: str, session_id: str, user_id: str) -> str:
    """
    Execute a single-turn query with the agent and return the final response.
    The agent execution is traced automatically.
    """
    content = types.Content(role="user", parts=[types.Part(text=query)])
    events = runner.run(user_id=user_id, session_id=session_id, new_message=content)

    # Extract the final response
    for event in events:
        if event.is_final_response() and event.content and event.content.parts:
            return event.content.parts[0].text
    return "No response from agent."

print("✓ Helper function defined successfully")

## 6. Define Query

### Purpose
Define the sample code review query that will be executed by the Code Review Agent to generate OpenTelemetry traces for trajectory evaluation with reference.

### Key Components
- **`query`**: A sample code review request that will be processed by the agent
  - This query contains a Python function that needs to be reviewed
  - The query will trigger the agent to analyze the code, check correctness, evaluate code quality, and provide feedback
  - The execution will be automatically traced to capture the complete agent trajectory
  - The trajectory will be evaluated using LLM-based assessment against a reference trajectory to measure quality and effectiveness
  - Example: A code review request with a Python function to analyze

The query can be modified to test different code review scenarios and evaluate trajectory quality against reference trajectories for various types of code.


In [None]:
# Execute the query to generate traces

query = """Review this Python function and tell me if it works and how good the code quality is."

def add_items(items):
    total = 0
    for i in range(len(items)):
        total = total + items[i]
    return total"""

print(f"✓ Query defined: {query}")

## 7. Run the Query and Get Trace ID

### Purpose
Execute a sample code review query with the Code Review Agent to generate OpenTelemetry traces that contain trajectory data for evaluation.

### Process
1. **Create Session**: Initialize a new session for the agent interaction
2. **Execute Query**: Run a sample code review query through the agent
3. **Retrieve Trace IDs**: Extract the generated trace IDs from the agent client
4. **Display Results**: Print the agent response and trace ID for verification

The execution automatically generates OpenTelemetry traces that record the complete agent trajectory, which will be used for trajectory evaluation with reference.

In [None]:
# Create a new session
session = await runner.session_service.create_session(
    app_name=app_name,
    user_id=user_id,
)
print(f"Session created: {session.id}")

response = run_single_turn(
    runner=runner,
    query=query,
    session_id=session.id,
    user_id=user_id
)

# Retrieve the generated trace IDs
trace_ids = agent_client.get_tracer_ids()
print("Agent Response:")
print(response[:200] + "..." if len(response) > 200 else response)
print(f"Found {len(trace_ids)} trace(s). First trace ID: {trace_ids[0] if trace_ids else 'N/A'}")
print(f"✓ Query execution completed successfully")

## 8. Define Reference Trajectory

### Purpose
Define a reference trajectory that represents the expected behavior for the code review task. This reference serves as a benchmark for evaluating the agent's actual performance.

### Key Components
1. **Reference Trajectory Structure**:
   - **`input`**: The expected input query (code review request)
   - **`expected_steps`**: Array of expected steps, each containing:
     - **`thought`**: Expected reasoning/thought process for the step
     - **`final_response`**: Expected final response structure and content
2. **ReferenceTrajectory Schema**: Validates the reference trajectory structure using the `ReferenceTrajectory` schema
3. **Validation**: Ensures the reference trajectory is properly formatted before use in evaluation

This reference trajectory will be used by the TrajectoryEvalWithLLMWithReference metric to compare the agent's actual trajectory against the expected behavior.

In [None]:
REFERENCE_TRAJECTORY = {
    "input": """Review this Python function and tell me if it works and how good the code quality is.

def add_items(items):
    total = 0
    for i in range(len(items)):
        total = total + items[i]
    return total
""",
    "expected_steps": [
        {
            "thought": "The user wants a review of the Python function, focusing on whether it works and the quality of the implementation. I will analyze correctness, edge cases, style, performance, and then provide a structured review.",
            "final_response": """### Summary

- The function `add_items` sums all elements in the `items` list.
- It likely works correctly when `items` is a list of numeric values.
- The implementation is not very idiomatic Python and can be simplified.
- There is no validation for invalid or unexpected input types.

### Correctness

- For a typical case like `add_items([1, 2, 3])`, the function will return `6`, which is correct.
- If `items` contains non-numeric values (e.g., strings or `None`), it will raise a `TypeError` when trying to add them.
- If `items` is not indexable or does not implement `__len__` (e.g., a generator), the function will also fail.
- There is no explicit handling of `None` or other invalid inputs.

### Code Quality & Readability

- The logic is straightforward and easy to understand.
- Using `range(len(items))` is not idiomatic Python for iterating over a list.
- A more Pythonic version would iterate directly over the elements or use the built-in `sum` function.
- Variable naming is acceptable, but the function could be shorter and clearer.

### Performance

- Time complexity is O(n), where n is the length of `items`, which is optimal for summing a list.
- There is a minor overhead from indexing (`items[i]`) instead of iterating directly over the elements.
- For most real-world cases, this overhead is negligible, but the idiomatic version is both clearer and slightly more efficient.

### Improved Version

A more Pythonic and concise implementation would be:

```python
def add_items(items):
    return sum(items)
If you want to be safer about input types, you could add basic validation:

python
Copy code
def add_items(items):
    if items is None:
        return 0
    return sum(items)
Ratings
Correctness: 8/10 (works for standard numeric lists, no validation)

Code Quality: 6/10 (clear but non-idiomatic and slightly verbose)

Overall: 7/10"""
}
]
}

validated_ref = ReferenceTrajectory(**REFERENCE_TRAJECTORY)
print(validated_ref)

## 9. Trajectory Evaluation with Reference

### Purpose
Initialize the `AgentEvaluator`, define a reference trajectory, fetch the OpenTelemetry trace, and run the `TrajectoryEvalWithLLMWithReference` metric to evaluate trajectory quality against a reference. The evaluation metric **trajectory_evaluation_with_reference** provides comprehensive assessment of the Code Review Agent's trajectory compared to an expected reference.

### Key Components
1. **Reference Trajectory**: Defines the expected trajectory with input, expected steps, thoughts, and final response
2. **TrajectoryEvalWithLLMWithReference**: Initializes the trajectory evaluation metric that uses LLM-based evaluation to compare actual trajectory against reference
3. **AgentEvaluator** (`client`):
   - Connects to the Flotorch gateway using API credentials
   - Configured with a default evaluator model
   - Provides methods to fetch and evaluate traces
4. **Trace Fetching**: Retrieves the complete trace data using the trace ID generated during agent execution

The fetched trace contains detailed information about the complete agent trajectory, which will be analyzed by the TrajectoryEvalWithLLMWithReference metric to compute the trajectory_evaluation_with_reference score by comparing against the reference trajectory.

In [None]:
# Initialize the TrajectoryEvalWithLLMWithReference metric with reference trajectory
metrics = [TrajectoryEvalWithLLMWithReference(
    llm=default_evaluator,
    config=MetricConfig(
        metric_params={"reference": REFERENCE_TRAJECTORY}
    )
)]

# Initialize the AgentEvaluator client
client = AgentEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    default_evaluator=default_evaluator
)

traces = None
if trace_ids:
    # Fetch the trace data from the Flotorch gateway
    traces = client.fetch_traces(trace_ids[0])
    print(f"✓ Trace fetched successfully")
else:
    print("✗ No trace IDs found to fetch.")

## 10. Run Evaluation

### Purpose
Execute the trajectory evaluation with reference by processing the fetched OpenTelemetry trace using the TrajectoryEvalWithLLMWithReference metric to assess trajectory quality against the reference.

### Process
- Calls `client.evaluate()` with the trace data, TrajectoryEvalWithLLMWithReference metric, and reference trajectory
- The evaluator processes the trace to analyze the complete agent trajectory
- Compares the actual trajectory against the reference trajectory
- Computes the **trajectory_evaluation_with_reference** metric which includes:
  - Quality score (0.0 to 1.0) indicating how well the trajectory matches the reference
  - Detailed LLM-based evaluation explanation comparing actual vs. expected trajectory
  - Assessment of alignment with reference in terms of reasoning, tool usage, and response quality
- Returns evaluation results with trajectory ID and metric scores

This step generates the trajectory evaluation with reference analysis that will be displayed in the next section.

In [None]:
if traces:
    # Evaluate the trace using the TrajectoryEvalWithLLMWithReference metric
    results = await client.evaluate(
        trace=traces,
        metrics=metrics,
        reference=REFERENCE_TRAJECTORY
    )

    print("✓ Evaluation completed successfully!")
else:
    print("Cannot evaluate: No traces were available.")

## 11. Display and Interpret Results

### Purpose
Define helper functions to format and display the evaluation output clearly, showing the trajectory_evaluation_with_reference metric results in a readable format.

### Functionality
The `display_metrics` function:
- Extracts the `trajectory_evaluation_with_reference` metric from evaluation results
- Formats the quality score and evaluation details
- Creates a structured display showing:
  - Trajectory Quality Score (0.0 to 1.0) compared to reference
  - Detailed LLM-based evaluation explanation comparing actual vs. expected trajectory
- Uses pandas DataFrame with styled formatting for clean presentation

This function provides a user-friendly way to visualize trajectory evaluation with reference metrics.

In [None]:
import pandas as pd
from IPython.display import display

def display_metrics(result):
    """
    Display only the 'trajectory_evaluation_with_reference' metric.
    """
    print(f"Trajectory ID: {result.trajectory_id}")
    print(f"Timestamp    : {result.timestamp}\n")

    # Find the metric
    metric = next(
        (m for m in result.scores if m.name == "trajectory_evaluation_with_reference"),
        None
    )
    if not metric:
        print("Metric 'trajectory_evaluation_with_reference' not found.")
        return

    # Format details (simple key:value lines)
    details = "\n".join(f"{k}: {v}" for k, v in metric.details.items())

    # Build DataFrame
    df = pd.DataFrame([{
        "Metric": metric.name,
        "Score": f"{metric.score:.2f}",
        "Details": details
    }])

    # Display with multiline support
    display(
        df.style.set_properties(
            subset=["Details"],
            **{"white-space": "pre-wrap", "text-align": "left"}
        )
    )

print("✓ Display metrics function defined successfully")

## 12. View Trajectory Evaluation Results

### Purpose
Display the trajectory evaluation with reference results in a formatted table showing the complete assessment for the Code Review Agent.

### Output
The displayed table includes:
- **Metric**: The evaluation metric name (trajectory_evaluation_with_reference)
- **Score**: The trajectory quality score (0.0 to 1.0) compared to reference
- **Details**: Comprehensive evaluation showing:
  - Quality score out of 1.0
  - Detailed LLM-based explanation comparing actual vs. expected trajectory
  - Assessment of alignment with reference in reasoning, tool usage, and response quality

This visualization helps identify trajectory quality issues and optimize the agent's code review capabilities to better match expected behavior.

In [None]:
if 'results' in locals():
    display_metrics(results)
else:
    print("No results object found. Please run sections 5 and 6 first.")

### Interpreting the Trajectory Evaluation with Reference Results

The **trajectory_evaluation_with_reference** metric is a vital tool for quality monitoring of the Code Review Agent:

* **Quality Score (0.0 to 1.0)**: Indicates how well the agent's trajectory matches the reference trajectory:
    * **1.0**: Excellent alignment - trajectory closely matches reference in reasoning, analysis, and response quality
    * **0.5-0.9**: Good alignment with minor deviations from reference in some aspects
    * **0.0-0.4**: Poor alignment - significant differences from reference in reasoning or response quality
* **Evaluation Details**: Provides a detailed LLM-based explanation of:
    * **Reasoning Alignment**: Whether the agent's thought process matches the expected reasoning in the reference
    * **Analysis Quality**: Assessment of how well the code analysis compares to the reference (correctness, code quality, performance analysis)
    * **Response Quality**: Evaluation of how well the final response matches the reference in structure, completeness, and accuracy
    * **Overall Trajectory Comparison**: Comprehensive evaluation of how the actual trajectory aligns with the expected reference trajectory

For a Code Review Agent, understanding trajectory evaluation with reference helps identify:
- **Reasoning gaps**: If the agent's analysis process deviates from expected code review methodology
- **Analysis completeness**: If the agent misses important aspects (correctness, code quality, performance) compared to the reference
- **Response structure**: If the agent's response format and organization differ from the expected reference format
- **Overall effectiveness**: Monitor trajectory alignment to ensure the agent delivers comprehensive and structured code reviews that match expected quality standards

## 13. Summary of Agent Trajectory Evaluation with Reference Notebook

This notebook demonstrates the professional methodology for evaluating the trajectory quality of a **Flotorch ADK Agent** (configured as a **Code Review Agent** that analyzes code for issues and suggests improvements and optimizations) using the **Flotorch Eval framework** with reference trajectory comparison.

**Use Case**: Code Review Agent - Analyzes code for issues and suggests improvements and optimizations.

**Evaluation Metric**: trajectory_evaluation_with_reference

## Core Process

### 1. Setup and Instrumentation
- Configure a `FlotorchADKAgent` for code review tasks.
- Enable **OpenTelemetry Tracing** via the `tracer_config`.
- This instrumentation allows detailed capture of the complete agent trajectory and decision-making process.

### 2. Reference Trajectory Definition
- Define a reference trajectory that represents the expected behavior for the code review task.
- Include expected input, thought process, and final response structure.
- This reference serves as a benchmark for evaluating the agent's actual performance.

### 3. Execution and Data Generation
- Run a sample code review query through the agent using the **Runner**.
- This automatically generates an **Agent Trajectory** in the form of OpenTelemetry traces.
- The trace records the complete execution path, including:
  - Code analysis and reasoning
  - Issue identification and categorization
  - Improvement suggestions and optimizations
  - Step-by-step agent operations

### 4. Evaluation
- Use the `AgentEvaluator` client along with the specialized **TrajectoryEvalWithLLMWithReference** metric from `flotorch-eval`.
- The evaluator processes the trace data to compute trajectory quality statistics using the **trajectory_evaluation_with_reference** metric by comparing against the reference trajectory.

### 5. Analysis
- The notebook displays a thorough trajectory evaluation assessment, including:
  - **Quality Score** (0.0 to 1.0) compared to reference
  - **LLM-based Evaluation Details** explaining how the actual trajectory compares to the reference
  - Assessment of alignment in reasoning, analysis quality, and response structure

## Purpose and Benefits

This evaluation provides **actionable quality metrics** that help developers:

- Identify trajectory quality issues in the Code Review Agent compared to expected behavior  
- Optimize code review methodology to better match reference standards  
- Track quality trends over time  
- Ensure the Code Review Agent delivers **comprehensive and structured code reviews** that match expected quality standards via reference trajectory comparison