# The agent as shown on Noveum.ai platform

![Alt text](support_agent.png)

# Final Agent Evaluation Demo with NovaEval

This notebook demonstrates a streamlined approach to agent evaluation using modular utility functions:

1. **Load agent trace data** from JSON datasets
2. **Map trace spans** to AgentData format using utility functions
3. **Create and analyze** AgentDataset
4. **Evaluate agent performance** using AgentEvaluator with Gemini model
5. **Analyze results** and export data



# Scorers Used

**context_relevancy_scorer** - Evaluates whether the agent response is appropriate and relevant given the agent's task and role.

**role_adherence_scorer** - Scores whether the agent's tool calls and response adhere to its assigned role and task.

**task_progression_scorer** - Measures whether the agent has made meaningful progress on the assigned task.

**tool_relevancy_scorer** - Assesses how relevant and appropriate the tool call is given the available tools and the agent's context.

**tool_correctness_scorer** - Compares actual tool calls against expected tool calls to evaluate correctness of tool usage and parameters.

**parameter_correctness_scorer** - Validates whether correct parameters were passed to tool calls by analyzing the tool results.

## Step 1: Import Dependencies and Utility Functions


In [34]:
# Import our custom utility functions
from demo_utils import (
    list_dataset_files,
    load_and_analyze_dataset,
    convert_spans_to_agent_dataset,
    analyze_dataset_statistics,
    setup_gemini_model,
    setup_agent_evaluator,
    run_evaluation,
    analyze_agent_behavior_patterns,
    export_processed_dataset,
    setup_logging,
    validate_environment,
    print_demo_summary
)

print("✅ All utility functions imported successfully!")


✅ All utility functions imported successfully!


In [None]:
import os
os.environ["GEMINI_API_KEY"] = "GEMINI_KEY"

In [None]:
!python preprocess_filter.py /Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/traces/traces/dataset.json
!python preprocess_map.py /Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/traces/traces/dataset_filtered.json
!python preprocess_split_data.py /Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/traces/traces/dataset_filtered_mapped.json

Input file: /Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/traces/traces/dataset_filtered_mapped.json
Output directory: split_datasets

Loading dataset from /Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/traces/traces/dataset_filtered_mapped.json...
Loaded 120 objects
Found 8 unique span names
Using sanitized name: agent.query_routing -> agent.query_routing_dataset.json
  Wrote 20 objects to split_datasets/agent.query_routing_dataset.json
Using sanitized name: agent.web_search_generation -> agent.web_search_generation_dataset.json
  Wrote 8 objects to split_datasets/agent.web_search_generation_dataset.json
Using sanitized name: agent.routing_evaluation_metrics -> agent.routing_evaluation_metrics_dataset.json
  Wrote 20 objects to split_datasets/agent.routing_evaluation_metrics_dataset.json
Using sanitized name: tool-orchestator -> tool-orchestator_dataset.json
  Wrote 20 objects to split_datasets/tool-orchestator_dataset.json
Using sanitize

In [None]:
# Force reload the demo_utils module to get the latest changes
import importlib
import sys

# Remove the module from cache if it exists
if 'demo_utils' in sys.modules:
    del sys.modules['demo_utils']

# Import the updated module
from demo_utils import run_complete_agent_evaluation

print("✅ Module reloaded successfully!")


In [75]:
from demo_utils import run_complete_agent_evaluation
run_complete_agent_evaluation('/Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/split_datasets/agent.rag_evaluation_metrics_dataset.json',
evaluation_name = "agent.rag_evaluation_metrics_dataset", output_dir = "./demo_results")

🚀 Starting Complete Agent Evaluation Pipeline
📁 Processing file: /Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/split_datasets/agent.rag_evaluation_metrics_dataset.json

📋 Step 1: Environment Setup
✅ Logging configured at INFO level
🔍 Environment validation:
  ✅ gemini_api_key: True
  ✅ pandas_available: True
  ✅ novaeval_available: True
✅ Environment ready for evaluation!

📋 Step 2: Loading Dataset
📊 Loaded 12 spans from /Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/split_datasets/agent.rag_evaluation_metrics_dataset.json

🔍 Available span types:
  - agent.rag_evaluation_metrics: 12
✅ Dataset loaded: 12 spans

📋 Step 3: Converting to AgentDataset Format
🔄 Converting spans to AgentData objects...

✅ Successfully converted 12 spans to AgentData
📊 AgentDataset created with 12 records
✅ AgentDataset created: 12 records

📋 Step 4: Dataset Analysis
📈 Dataset Statistics:

Agent Types: {'agent': 12}
Records with responses: 12
Records with tool ca

Evaluating samples: 0it [00:00, ?it/s]

2025-10-09 21:13:31 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:32 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:33 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:34 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 1 samples
2025-10-09 21:13:34 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 1it [00:03,  3.61s/it]

2025-10-09 21:13:34 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:36 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:37 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:39 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 2 samples
2025-10-09 21:13:39 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 2it [00:08,  4.08s/it]

2025-10-09 21:13:39 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:40 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:41 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:42 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 3 samples
2025-10-09 21:13:42 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 3it [00:11,  3.61s/it]

2025-10-09 21:13:42 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:43 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:44 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:45 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 4 samples
2025-10-09 21:13:45 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 4it [00:14,  3.45s/it]

2025-10-09 21:13:45 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:46 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:47 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:48 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 5 samples
2025-10-09 21:13:48 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 5it [00:17,  3.40s/it]

2025-10-09 21:13:48 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:49 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:52 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:53 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 6 samples
2025-10-09 21:13:53 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 6it [00:21,  3.72s/it]

2025-10-09 21:13:53 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:54 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:55 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:56 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 7 samples
2025-10-09 21:13:56 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 7it [00:24,  3.48s/it]

2025-10-09 21:13:56 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:57 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:58 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:13:59 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 8 samples
2025-10-09 21:13:59 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 8it [00:28,  3.40s/it]

2025-10-09 21:13:59 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:00 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:01 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:02 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 9 samples
2025-10-09 21:14:02 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 9it [00:31,  3.31s/it]

2025-10-09 21:14:02 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:03 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:04 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:05 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 10 samples
2025-10-09 21:14:05 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 10it [00:34,  3.24s/it]

2025-10-09 21:14:05 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:06 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:07 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:08 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 11 samples
2025-10-09 21:14:08 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 11it [00:37,  3.19s/it]

2025-10-09 21:14:08 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:09 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:10 - INFO - google_genai.models - AFC is enabled with max remote calls: 10.
2025-10-09 21:14:11 - INFO - novaeval.evaluators.agent_evaluator - Saving intermediate results after 12 samples
2025-10-09 21:14:11 - INFO - novaeval.evaluators.agent_evaluator - Intermediate results saved to demo_results/agent.rag_evaluation_metrics_dataset/agent_evaluation_results.csv


Evaluating samples: 12it [00:40,  3.39s/it]

2025-10-09 21:14:11 - INFO - novaeval.evaluators.agent_evaluator - Saving final results
2025-10-09 21:14:11 - INFO - novaeval.evaluators.agent_evaluator - Reloaded 12 results from CSV
2025-10-09 21:14:11 - INFO - novaeval.evaluators.agent_evaluator - Agent evaluation completed

✅ Evaluation completed!

📊 Results Summary:
  - task_progression: 4.14
  - context_relevancy: 7.93
  - role_adherence: 9.00
  - tool_relevancy: 0.00
  - parameter_correctness: 0.00

🔍 Individual Scores:

  Record 1 (Task: f1f37bd7-0851-4659-b493-b80d3800d920):
    - task_progression: 3.8
    - context_relevancy: 7.8
    - role_adherence: 9.0
    - tool_relevancy: 0.0
    - parameter_correctness: 0.0

  Record 2 (Task: 52aacb67-c361-4445-9b72-c157f79f47d6):
    - task_progression: 2.8
    - context_relevancy: 7.8
    - role_adherence: 9.0
    - tool_relevancy: 0.0
    - parameter_correctness: 0.0

  Record 3 (Task: 2218f641-604c-491a-9710-b51a9941b982):
    - task_progression: 4.3
    - context_relevancy: 7.8
   




{'success': True,
 'file_processed': '/Users/mramanindia/Documents/Work/NovaEval/noveum_customer_support_bt/split_datasets/agent.rag_evaluation_metrics_dataset.json',
 'spans_loaded': 12,
 'dataset_created': True,
 'dataset_size': 12,
 'evaluation_completed': True,
 'results_df':     user_id                               task_id  \
 0       NaN  f1f37bd7-0851-4659-b493-b80d3800d920   
 1       NaN  52aacb67-c361-4445-9b72-c157f79f47d6   
 2       NaN  2218f641-604c-491a-9710-b51a9941b982   
 3       NaN  255fd49c-84b4-4b18-887e-6308a412d535   
 4       NaN  dc511122-c0b6-415c-9a49-c7b45132dd87   
 5       NaN  04bebf38-a343-4563-80db-0154bef8d927   
 6       NaN  5e043630-6493-42b5-beb8-79faa19bfa37   
 7       NaN  7da9814d-a2e8-4c4e-b750-68b26bd5fd22   
 8       NaN  16143f74-2831-4753-b33d-ce4b645093c5   
 9       NaN  fc64e6cc-6739-4256-ac4a-7b80c3028233   
 10      NaN  b7945c49-f584-4c70-972d-536a805d8a31   
 11      NaN  f5c40ecf-36c0-45ba-9cc9-dc0329b0324b   
 
                

# Analysis of poor scores in comment generation agent.

In [64]:
import pandas as pd
comment_gen = pd.read_csv("demo_results/agent_comment_gen_dataset/agent_evaluation_results.csv")

split_size = 3

task_progression = comment_gen.sort_values(by = 'task_progression', ascending= True).iloc[:split_size][['task_progression', 'task_progression_reasoning']]

print("Task Progression:")
print()
for idx, row in task_progression.iterrows():
    print(f"Score = {row['task_progression']}")
    print(f"Reasoning = {row['task_progression_reasoning']}")
    print()  # blank line

FileNotFoundError: [Errno 2] No such file or directory: 'demo_results/agent_comment_gen_dataset/agent_evaluation_results.csv'

In [65]:
# Context Relevancy Analysis
context_relevancy = comment_gen.sort_values(by='context_relevancy', ascending=True).iloc[:3][['context_relevancy', 'context_relevancy_reasoning']]

print("Context Relevancy Analysis:")
print("=" * 50)
for idx, row in context_relevancy.iterrows():
    print(f"Score = {row['context_relevancy']}")
    print(f"Reasoning = {row['context_relevancy_reasoning']}")
    print()

NameError: name 'comment_gen' is not defined

In [66]:
# Role Adherence Analysis
role_adherence = comment_gen.sort_values(by='role_adherence', ascending=True).iloc[:3][['role_adherence', 'role_adherence_reasoning']]

print("Role Adherence Analysis:")
print("=" * 50)
for idx, row in role_adherence.iterrows():
    print(f"Score = {row['role_adherence']}")
    print(f"Reasoning = {row['role_adherence_reasoning']}")
    print()

NameError: name 'comment_gen' is not defined

In [67]:
from novapilot_utils import recommend_improvements

# Advanced usage with custom parameters
final_analysis, summaries, log_file = recommend_improvements(
    demo_results_dir="demo_results/",
    agent_doc_path="reddit_agent.md",
    log_dir="log",
    verbose=True
)

NOVAPILOT AGENT ANALYSIS - RECOMMEND IMPROVEMENTS
This function runs the complete analysis pipeline equivalent to
running the entire complete_analysis_demo.ipynb notebook.
Setup complete! Log file: log/analysis_log_20251009_191149.txt
Found 5 dataset directories to process:
  - agent.rag_evaluation_metrics_dataset
  - agent_comment_gen_dataset
  - agent.query_routing_dataset
  - agent.llm-rag_dataset
  - agent.web_search_generation_dataset

Processing agent.rag_evaluation_metrics_dataset...
  No CSV files found in agent.rag_evaluation_metrics_dataset, skipping...

Processing agent_comment_gen_dataset...
  No CSV files found in agent_comment_gen_dataset, skipping...

Processing agent.query_routing_dataset...
  No CSV files found in agent.query_routing_dataset, skipping...

Processing agent.llm-rag_dataset...
  No CSV files found in agent.llm-rag_dataset, skipping...

Processing agent.web_search_generation_dataset...
  No CSV files found in agent.web_search_generation_dataset, skipping..

ValueError: No agent documentation provided. Load it first or pass as parameter.

In [None]:
print(final_analysis)