# Agent Evaluation for Customer Support: Comparing Models and Parameters

<table align="left">
  <td style="text-align: center">
    <a href="#">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="#">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="#">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="#">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="#" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="#" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="#" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="#" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="#" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| | |
|-|-|
| Authors | [Anish Shah](https://github.com/ash0ts) |

## Overview

This notebook demonstrates how to generate synthetic evaluation data and compare different models and parameters for a customer support agent. We'll explore three main types of evaluations:

1. Final Response Evaluation: Assessing the agent's final answer
2. Single Step Evaluation: Evaluating individual tool selections
3. Trajectory Evaluation: Analyzing the complete path of actions

The tutorial uses the following Google Cloud services and resources:
* [Vertex AI Gen AI Evaluation](https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-overview)
* [Weights and Biases Weave](https://wandb.me/tryweave)

The steps performed include:
* Generating synthetic evaluation data
* Setting up evaluation metrics
* Comparing model performance
* Analyzing evaluation results

**NOTE:** Please run the `setup.py` script in this folder before running the notebook.

In [1]:
IN_COLAB = False
try:
    import google.colab
    IN_COLAB = True
    !git clone https://github.com/ash0ts/generative-ai.git
    %cd generative-ai/gemini/evaluation/synthetic-data-evals
    !pip install -qqq uv 
    !uv pip install --system --requirements pyproject.toml
except:
    pass

In [None]:
from set_env import set_env
set_env("GEMINI_API_KEY")
# set_env("HUGGING_FACE_HUB_TOKEN")
set_env("VERTEX_PROJECT_ID")
set_env("VERTEX_LOCATION")
set_env("VERTEX_MODEL_ID")
set_env("VERTEX_ENDPOINT_ID")
# set_env("DEEPSEEK_ENDPOINT_ID")
print("Set API Keys")

In [None]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
from rich.console import Console
from rich.table import Table
import weave
from typing import List, Dict, Any, Tuple
import os

# Import our modules
from evaluator import AgentEvaluator, load_dataset
from dataset_generator import DatasetGenerator, create_customer_support_agent_evaluation_dataset
from customer_support_agent import create_customer_support_agent
from render_evals import render_model_comparison, render_difficulty_analysis, render_temperature_analysis, render_conclusion
from config import WEAVE_PROJECT_NAME

try:
    in_jupyter = True
except ImportError:
    in_jupyter = False
if in_jupyter:
    import nest_asyncio

    nest_asyncio.apply()

# Initialize console for rich output
console = Console()
console.rule("[bold magenta]Agent Evaluation Framework")

# Initialize Weave for experiment tracking
weave.init(WEAVE_PROJECT_NAME)

In [4]:
# Define model configurations to test
model_configs = [
    {"model_id": "google/gemini-1.5-pro", "temperature": 0.2, "name": "Gemini Pro (Low Temp)"},
    {"model_id": "google/gemini-1.5-pro", "temperature": 0.7, "name": "Gemini Pro (High Temp)"},
    {"model_id": "google/gemini-2.0-flash-lite", "temperature": 0.2, "name": "Gemini Flash Lite (Low Temp)"},
    {"model_id": "google/gemini-2.0-flash-lite", "temperature": 0.7, "name": "Gemini Flash Lite (High Temp)"},
    #Any other Gemini or OSS model from vertex
]

# Customer Support Agent Architecture

This section describes the architecture and implementation of our customer support agent system, which is built using a combination of LLM-powered components and specialized tools.

## Agent Overview

The customer support agent is designed to handle a variety of e-commerce related queries by leveraging:

1. **Foundation Models**: The agent can be powered by different models including Gemini (1.5 Pro, 2.0 Flash, etc.) through Vertex AI.

2. **Specialized Tools**: A collection of purpose-built tools that provide domain-specific functionality:
   - `ProductSearchTool`: Searches the product catalog by name, category, or description
   - `OrderStatusTool`: Checks the status of customer orders
   - `CategoryBrowseTool`: Allows browsing products by category
   - `PriceCheckTool`: Retrieves pricing information for specific products
   - `CustomerOrderHistoryTool`: Retrieves order history for customers

3. **Realistic Data**: The system uses realistic product and order data derived from Amazon reviews to provide a realistic customer support experience.

## Technical Implementation

The implementation consists of two main components:

1. **`customer_support_agent.py`**: Contains the tool implementations and agent creation logic. The `create_customer_support_agent()` function configures the agent with the specified model, temperature, planning capabilities, and tools.

2. **`vertex_model.py`**: Provides the model implementations that connect to Vertex AI:
   - `VertexAIServerModel`: Base implementation for connecting to Vertex AI endpoints
   - `WeaveVertexAIServerModel`: Extends the base model with Weights & Biases Weave tracking for experiment monitoring

## Configuration Options

The agent can be configured with various parameters:
- Model selection (Gemini 1.5 Pro, 2.0 Flash, etc.)
- Temperature settings for controlling response randomness
- Planning interval to determine how often the agent should plan its actions
- Maximum steps to limit the complexity of interactions

This architecture enables comprehensive evaluation of different model configurations and parameters to determine the optimal setup for customer support scenarios.


In [None]:
console.print("[bold blue]Creating customer support agent[/bold blue]")

# Initialize a customer support agent for generating high-quality evaluation data
base_agent = create_customer_support_agent(
    use_weave=True,                    # Enable Weave for experiment tracking
    model_id="google/gemini-2.0-flash", # Use Gemini 2.0 for fast and accurate generation
    temperature=0.1,                   # Low temperature for consistent, deterministic outputs
    planning_interval=2,               # Plan every 2 steps for better reasoning
    max_steps=4                        # Allow up to 4 steps to handle medium-complex queries
)

In [None]:
base_agent.run("What is the best item in the category of book?")

# Synthetic Dataset Generation for Agent Evaluation

This section explains how we generate realistic test data to evaluate our customer support agent across different configurations and scenarios.

## Why We Need Synthetic Evaluation Data

Testing with synthetic data helps us:

1. **Compare Models Fairly**: We can test different models (Gemini Pro vs Flash) and settings (temperature values) on the exact same customer queries.

2. **Save Time and Resources**: Creating test data is faster and cheaper than collecting real customer conversations.

3. **Cover Edge Cases**: We can include challenging scenarios that might be rare in real data but important to test.

4. **Ensure Consistent Quality**: By filtering examples that meet quality thresholds, we build a reliable benchmark dataset.

## How the Dataset Generator Works

The `DatasetGenerator` class creates evaluation data through several steps:

1. **Creating Realistic Queries**: Generates e-commerce questions using real product IDs, categories, and customer information:
   - "I'm looking for products in the Books category. What do you have?"
   - "Can you check the status of my order OD123456?"
   - "What's your best Electronics product? I need something reliable."

2. **Recording Agent Behavior**: Runs the agent on these queries and captures:
   - Which tools the agent used (ProductSearch, OrderStatus, etc.)
   - The arguments passed to each tool
   - The agent's reasoning at each step
   - The final response to the user

3. **Evaluating Quality**: Uses a judge model to score:
   - Final response quality (0-1 score)
   - Individual step effectiveness (0-1 score per step)
   - Overall trajectory coherence (0-1 score)

4. **Filtering Results**: Only keeps examples that meet quality thresholds (typically 0.7 for each dimension).

## Dataset Structure and Applications

Each example in the final dataset includes:
- The original user prompt
- Expected tool usage sequence
- Validation criteria
- Difficulty rating (easy, medium, hard)
- Metadata about model configuration

This dataset lets us:
1. Determine which model performs best for customer support
2. Find the right temperature settings for different query types
3. Measure whether planning capabilities improve results
4. Identify specific areas where the agent needs improvement

By testing systematically, we can build a more effective customer support agent that handles user queries accurately while using computational resources efficiently.


In [None]:
# Initialize dataset generator
thresholds={
            "final_response": 0.7,
            "single_step": 0.7,
            "trajectory": 0.7
        }
generator = DatasetGenerator(agent=base_agent, thresholds=thresholds, debug=True)

# Generate comprehensive dataset with different scenarios
console.print("[bold blue]Generating customer support evaluation dataset...[/bold blue]")

dataset = create_customer_support_agent_evaluation_dataset(generator, base_agent, num_prompts=5)  # Adjust number as needed

# Save generated dataset
dataset_path = "customer_support_eval.json"
generator.save_dataset(dataset, dataset_path)

console.print(f"[green]✓[/green] Dataset generation complete! Saved to {dataset_path}")

# Agent Evaluation Framework

This section explains how we evaluate different model configurations using a comprehensive evaluation framework to identify the best-performing agent setup.

## Setting Up the Evaluation Pipeline

The evaluation process begins with initializing our evaluation framework defined in `evaluator.py`.
The below code loads our previously generated synthetic dataset and formats it for evaluation with Vertex AI.

## How the Evaluation Process Works

The `AgentEvaluator` class provides a systematic approach to testing agent performance across multiple dimensions:

1. **Multi-dimensional Metrics**: We evaluate the agent on several key aspects:
   - **Tool Selection Accuracy**: How well the agent chooses appropriate tools
   - **Reasoning Quality**: The logical coherence of the agent's thinking process
   - **Response Correctness**: Accuracy and completeness of final answers
   - **Trajectory Match**: How well the agent's path aligns with expected solutions
   - **Coherence**: Overall clarity and consistency of responses

2. **Vertex AI Integration**: The evaluator uses Vertex AI's evaluation capabilities to score agent responses objectively, reducing human bias in the assessment process.

3. **Weights & Biases Weave Integration**: Results are automatically logged to Weave, enabling:
   - Interactive visualization of agent performance
   - Comparison between different model configurations
   - Tracking of experiments over time
   - Sharing results with team members

4. **Visualization Tools**: The framework generates charts and tables to help identify patterns:
   - Score distribution plots showing performance across metrics
   - Difficulty heatmaps revealing how the agent handles easy vs. hard queries
   - Correlation analysis between trajectory quality and response accuracy

## Practical Applications

This evaluation framework helps us answer key questions:

1. **Which model performs best?** Compare Gemini Pro vs. Flash models on the same test cases.

2. **What temperature setting works better?** Test whether low temperature (0.2) or high temperature (0.7) produces better results.

3. **Where are the weaknesses?** Identify specific query types or metrics where the agent underperforms.

4. **Is planning helpful?** Measure whether enabling planning capabilities improves overall performance.

By running all model configurations through this standardized evaluation process, we can make data-driven decisions about which agent setup to deploy for customer support scenarios, balancing performance and efficiency.


In [None]:
# Initialize evaluator
console.print("[bold blue]Initializing evaluator...[/bold blue]")
evaluator = AgentEvaluator(verbosity=1, project=os.getenv("VERTEX_PROJECT_ID"), location=os.getenv("VERTEX_LOCATION"))
console.print(f"[green]✓[/green] Evaluator initialized")

all_examples = load_dataset("customer_support_eval.json")
console.print(f"[bold blue]Formatting dataset with {len(all_examples)} examples for evaluation...[/bold blue]")
eval_dataset = evaluator.format_dataset_for_eval(all_examples)
console.print(f"[green]✓[/green] Dataset formatted successfully")

# Running Model Evaluations

This section shows how we test different model configurations against our evaluation dataset to find the best setup for customer support.

## Testing Multiple Configurations

We evaluate four different configurations:
- Gemini Pro with low temperature (0.2)
- Gemini Pro with high temperature (0.7)
- Gemini Flash Lite with low temperature (0.2)
- Gemini Flash Lite with high temperature (0.7)

Each configuration is tested on the same set of customer queries, ensuring a fair comparison.

In [None]:
@weave.op()
def evaluate_model_config(config: Dict[str, Any], eval_dataset: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Evaluate a specific model configuration and return results"""
    console.print(f"\n[bold blue]Evaluating {config['name']}...[/bold blue]")
    
    # Create agent with this configuration
    agent = create_customer_support_agent(
        use_weave=True,
        model_id=config["model_id"],
        temperature=config["temperature"],
        planning_interval=config.get("planning_interval", 1),
        max_steps=config.get("max_steps", 5)
    )
    
    # Run evaluation with the agent object directly
    results = evaluator.run_evaluation(
        agent=agent, 
        eval_dataset=eval_dataset, 
        output_dir=f"evaluation_results/{config['name'].replace(' ', '_').lower()}",
        weave_project=WEAVE_PROJECT_NAME
    )
    
    # Add configuration details to results
    results["config"] = config
    
    return results

# Run evaluations for all configurations
all_results = []
for config in model_configs:
    results = evaluate_model_config(config, eval_dataset)
    all_results.append(results)
    
    # Display summary results
    console.print(f"\n[bold green]Results for {config['name']}:[/bold green]")
    if "summary_metrics" in results and results["summary_metrics"]:
        table = evaluator._render_summary_table(results["summary_metrics"])
        console.print(table)
    else:
        console.print("[yellow]No summary metrics available[/yellow]")

# Navigate to Weave to compare results!!

Go to the project set in `WEAVE_PROJECT_NAME` in `config.py`

# Analyzing Model Performance Patterns

After running evaluations on different model configurations, we analyze the results to uncover key patterns and insights.

## Response Quality by Model Type

We compare how Gemini Pro and Gemini Flash Lite perform on response quality metrics. The analysis reveals which model provides more accurate and helpful customer support responses across our test cases.

## Impact of Temperature Settings

By examining how temperature affects performance, we can determine whether lower temperatures (0.2) produce more consistent, reliable responses or if higher temperatures (0.7) generate more helpful, creative solutions for customer queries.

## Performance Across Query Difficulty

We analyze how different models handle queries of varying complexity:
- Which model excels at simple, straightforward questions?
- Which configuration best handles complex, multi-part customer issues?
- Are there specific difficulty levels where one model significantly outperforms others?


In [None]:
render_model_comparison(all_results, console=console)

In [None]:
render_difficulty_analysis(all_results, console=console)

In [None]:
render_temperature_analysis(all_results, console=console)

# Conclusion

Our systematic evaluation of different model configurations provides a framework for making data-driven decisions about customer support agent implementation.

## Next Steps

With the evaluation results in hand, we can now:

1. **Select the optimal configuration** based on actual performance metrics rather than assumptions or theoretical capabilities.

2. **Understand trade-offs** between different models (Gemini Pro vs. Flash Lite) and temperature settings (0.2 vs. 0.7) for our specific customer support scenarios.

3. **Identify improvement areas** by focusing on the metrics where even our best configuration underperformed.

4. **Expand our evaluation dataset** to include more diverse customer queries and edge cases.

This evaluation framework allows us to continually refine our customer support agent as new models become available or as customer needs evolve. By measuring performance objectively across multiple dimensions, we can ensure we're deploying the most effective solution for our specific use case.


In [None]:
render_conclusion(all_results, console=console)