# Milestone 7: Automated Evaluation Pipeline

## LLM-as-a-Judge Pattern for Agent Quality Assessment

This notebook implements an automated evaluation pipeline for the FinGuard IntelliAgent using the **LLM-as-a-Judge** pattern.

### ADK Concepts Demonstrated:

1. **Golden Dataset** (Prototype to Production p.12):
   - Curated test cases with expected outputs
   - Covers all tools and edge cases

2. **LLM-as-a-Judge** (Intro to Agents p.29):
   - Uses Gemini to grade probabilistic outputs
   - Structured evaluation criteria

3. **Behavioral Evaluation** (Prototype to Production p.12):
   - Assesses the Trajectory (tool selection)
   - Not just the final answer

4. **Key Metrics**:
   - Tool Selection Accuracy
   - Goal Completion Rate
   - Idempotency Compliance

---

**Author**: Alfred Munga  
**Date**: November 18, 2025  
**Project**: FinGuard IntelliAgent ADK Capstone

## 1. Setup and Imports

In [None]:
import os
import sys
import json
import time
from datetime import datetime
from typing import List, Dict, Any

# Add parent directory to path
sys.path.insert(0, os.path.abspath('..'))

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Import agent components
from agent.orchestrator import FinGuardIntelliAgent
from agent.evaluator import AgentEvaluator, EvaluationResult
from backend.utils.logger import AgentLogger

print("‚úÖ All imports successful")
print(f"üìÖ Evaluation Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Load Golden Dataset

The golden dataset contains **10 test cases** covering:
- SMS parsing (3 cases)
- Invoice retrieval (2 cases)
- Payment actions (3 cases)
- Financial insights (2 cases)

Each test case includes:
- `query`: User's input
- `expected_tool`: Tool that should be called
- `criteria`: Success criteria for evaluation

In [None]:
# Load golden dataset
dataset_path = '../data/evaluation/golden_dataset.json'
with open(dataset_path, 'r') as f:
    golden_dataset = json.load(f)

print(f"üìä Loaded {len(golden_dataset)} test cases\n")

# Display test case summary
categories = {}
for test in golden_dataset:
    cat = test['category']
    categories[cat] = categories.get(cat, 0) + 1

print("Test Case Breakdown:")
for cat, count in categories.items():
    print(f"  - {cat}: {count} cases")

# Show sample test case
print("\nüìã Sample Test Case:")
sample = golden_dataset[0]
print(json.dumps(sample, indent=2))

## 3. Initialize Agent and Evaluator

We initialize:
1. **FinGuardIntelliAgent**: The agent under test
2. **AgentEvaluator**: The LLM judge

In [None]:
# Initialize agent
api_key = os.getenv('GEMINI_API_KEY')
agent = FinGuardIntelliAgent(api_key=api_key)
print("‚úÖ FinGuardIntelliAgent initialized")

# Initialize evaluator
evaluator = AgentEvaluator(api_key=api_key)
print("‚úÖ AgentEvaluator (Judge) initialized")

print("\nüöÄ Ready to run evaluation pipeline")

## 4. Run Batch Evaluation

**Evaluation Process**:
1. For each test case:
   - Run the agent with the query
   - Capture execution traces
   - Record tools called
2. Pass results to LLM Judge
3. Calculate aggregate metrics

‚ö†Ô∏è **Note**: This may take 5-10 minutes due to API rate limits (10 req/min for free tier).

In [None]:
# Storage for agent execution results
agent_results = []

print("üîÑ Running agent on all test cases...\n")
print("="*80)

for i, test_case in enumerate(golden_dataset, 1):
    test_id = test_case['test_id']
    query = test_case['query']
    
    print(f"\n[{i}/{len(golden_dataset)}] Test Case: {test_id}")
    print(f"Query: {query[:60]}..." if len(query) > 60 else f"Query: {query}")
    print(f"Expected Tool: {test_case['expected_tool']}")
    
    try:
        start_time = time.time()
        
        # Run agent
        result = agent.run(
            user_query=query,
            user_id="eval_tester"
        )
        
        execution_time = (time.time() - start_time) * 1000  # Convert to ms
        
        # Extract tools called from trace logs
        trace_logger = result.get('trace_logger')
        trace_logs = trace_logger.logs if trace_logger else []
        
        tools_called = [
            log.get('tool_name') 
            for log in trace_logs 
            if log.get('step_type') == 'ACT' and log.get('tool_name')
        ]
        
        agent_results.append({
            'response': result.get('response', 'No response'),
            'tools_called': tools_called,
            'trace_logs': trace_logs,
            'execution_time_ms': execution_time
        })
        
        print(f"‚úÖ Tools Called: {', '.join(tools_called) if tools_called else 'None'}")
        print(f"‚è±Ô∏è  Execution Time: {execution_time:.0f}ms")
        
        # Rate limiting: Wait between calls to avoid hitting API limits
        if i < len(golden_dataset):
            print("‚è≥ Waiting 10s to avoid rate limits...")
            time.sleep(10)
    
    except Exception as e:
        print(f"‚ùå Error: {str(e)[:100]}")
        agent_results.append({
            'response': f"Error: {str(e)}",
            'tools_called': [],
            'trace_logs': [],
            'execution_time_ms': 0
        })
        
        # Wait longer after errors
        if "429" in str(e) or "quota" in str(e).lower():
            print("‚ö†Ô∏è Rate limit hit. Waiting 60s...")
            time.sleep(60)

print("\n" + "="*80)
print(f"\n‚úÖ Agent execution complete: {len(agent_results)}/{len(golden_dataset)} test cases")

## 5. LLM-as-a-Judge Evaluation

Now we pass all agent outputs to the **Judge** (Gemini) for evaluation.

The judge grades based on:
- **Tool Selection**: Did it call the right tool?
- **Goal Achievement**: Was the task completed?
- **Idempotency**: Were safety checks performed?
- **Response Quality**: Is the answer correct and professional?

In [None]:
print("‚öñÔ∏è Starting LLM-as-a-Judge evaluation...\n")
print("="*80)

# Batch evaluate all results
evaluations = evaluator.batch_evaluate(
    test_cases=golden_dataset,
    agent_results=agent_results
)

print("\n" + "="*80)
print(f"\n‚úÖ Evaluation complete: {len(evaluations)} test cases graded")

## 6. Display Results

Let's examine the evaluation results in detail.

In [None]:
import pandas as pd

# Convert to DataFrame for easy viewing
results_df = pd.DataFrame([
    {
        'Test ID': eval_result.test_id,
        'Category': eval_result.category,
        'Difficulty': eval_result.difficulty,
        'Expected Tool': eval_result.expected_tool,
        'Tools Called': ', '.join(eval_result.tools_called) or 'None',
        'Score': eval_result.judge_evaluation.score,
        'Tool Correct': '‚úÖ' if eval_result.judge_evaluation.tool_usage_correct else '‚ùå',
        'Goal Achieved': '‚úÖ' if eval_result.judge_evaluation.goal_achieved else '‚ùå',
        'Idempotency': '‚úÖ' if eval_result.judge_evaluation.idempotency_respected else '‚ùå',
        'Exec Time (ms)': f"{eval_result.execution_time_ms:.0f}"
    }
    for eval_result in evaluations
])

print("üìä Evaluation Results Summary:\n")
print(results_df.to_string(index=False))

# Display detailed results for failed tests
print("\n" + "="*80)
print("\nüîç Detailed Analysis of Failed Tests (Score < 0.7):\n")

failed_tests = [e for e in evaluations if e.judge_evaluation.score < 0.7]
if failed_tests:
    for eval_result in failed_tests:
        print(f"Test ID: {eval_result.test_id}")
        print(f"Query: {eval_result.query}")
        print(f"Score: {eval_result.judge_evaluation.score:.2f}")
        print(f"Reasoning: {eval_result.judge_evaluation.reasoning}")
        if eval_result.judge_evaluation.issues:
            print(f"Issues:")
            for issue in eval_result.judge_evaluation.issues:
                print(f"  - {issue}")
        print("\n" + "-"*80 + "\n")
else:
    print("‚úÖ No failed tests! All scores >= 0.7")

## 7. Calculate Aggregate Metrics

Key metrics as defined in the ADK:
- **Tool Selection Accuracy**: % of times correct tool was called (Trajectory evaluation)
- **Goal Completion Rate**: % of tests that achieved their goal
- **Pass Rate**: % of tests with score >= 0.7
- **Average Score**: Overall quality score

In [None]:
# Calculate metrics
metrics = AgentEvaluator.calculate_metrics(evaluations)

print("üìä AGGREGATE METRICS")
print("="*80)
print(f"\nTotal Test Cases: {metrics['total_tests']}")
print(f"\nüéØ Tool Selection Accuracy: {metrics['tool_selection_accuracy']:.1%}")
print(f"   ‚Üí Correct tool called in {int(metrics['tool_selection_accuracy'] * metrics['total_tests'])}/{metrics['total_tests']} cases")

print(f"\n‚úÖ Goal Completion Rate: {metrics['goal_completion_rate']:.1%}")
print(f"   ‚Üí Task completed successfully in {int(metrics['goal_completion_rate'] * metrics['total_tests'])}/{metrics['total_tests']} cases")

print(f"\nüìà Pass Rate (Score >= 0.7): {metrics['pass_rate']:.1%}")
print(f"   ‚Üí {int(metrics['pass_rate'] * metrics['total_tests'])}/{metrics['total_tests']} tests passed")

print(f"\n‚≠ê Average Score: {metrics['average_score']:.2f}/1.00")

print(f"\nüõ°Ô∏è Idempotency Compliance: {metrics['idempotency_compliance']:.1%}")
print(f"   ‚Üí Safety checks performed in {int(metrics['idempotency_compliance'] * metrics['total_tests'])}/{metrics['total_tests']} cases")

print("\n" + "="*80)
print("\nüìÇ Category Breakdown:\n")

for category, data in metrics['category_breakdown'].items():
    print(f"{category}:")
    print(f"  - Tests: {data['count']}")
    print(f"  - Pass Rate: {data['pass_rate']:.1%}")
    print(f"  - Avg Score: {data['avg_score']:.2f}")
    print()

## 8. Save Results to CSV

Save results for tracking improvements over time.

In [None]:
# Save results
output_path = '../data/evaluation/results.csv'
AgentEvaluator.save_results(evaluations, output_path)

print(f"‚úÖ Results saved to: {output_path}")
print(f"\nüìÅ File size: {os.path.getsize(output_path)} bytes")

# Display first few rows
import pandas as pd
results_csv = pd.read_csv(output_path)
print(f"\nüìä CSV Preview (first 3 rows):\n")
print(results_csv.head(3).to_string(index=False))

## 9. Visualizations

Create visualizations for better understanding of evaluation results.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Set style
plt.style.use('seaborn-v0_8-darkgrid')

# Create figure with subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('FinGuard IntelliAgent - Evaluation Results', fontsize=16, fontweight='bold')

# 1. Score Distribution
scores = [e.judge_evaluation.score for e in evaluations]
axes[0, 0].hist(scores, bins=10, color='skyblue', edgecolor='black', alpha=0.7)
axes[0, 0].axvline(0.7, color='red', linestyle='--', label='Pass Threshold (0.7)')
axes[0, 0].set_xlabel('Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Score Distribution')
axes[0, 0].legend()

# 2. Category Performance
categories = list(metrics['category_breakdown'].keys())
category_scores = [metrics['category_breakdown'][cat]['avg_score'] for cat in categories]
axes[0, 1].bar(range(len(categories)), category_scores, color='lightgreen', edgecolor='black')
axes[0, 1].set_xticks(range(len(categories)))
axes[0, 1].set_xticklabels([cat.replace('_', '\n') for cat in categories], rotation=0, ha='center', fontsize=8)
axes[0, 1].axhline(0.7, color='red', linestyle='--', label='Pass Threshold')
axes[0, 1].set_ylabel('Average Score')
axes[0, 1].set_title('Performance by Category')
axes[0, 1].set_ylim(0, 1.0)
axes[0, 1].legend()

# 3. Key Metrics Comparison
metric_names = ['Tool\nSelection', 'Goal\nCompletion', 'Pass\nRate', 'Idempotency']
metric_values = [
    metrics['tool_selection_accuracy'],
    metrics['goal_completion_rate'],
    metrics['pass_rate'],
    metrics['idempotency_compliance']
]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A']
axes[1, 0].bar(metric_names, metric_values, color=colors, edgecolor='black')
axes[1, 0].set_ylabel('Rate')
axes[1, 0].set_title('Key Metrics')
axes[1, 0].set_ylim(0, 1.0)
for i, v in enumerate(metric_values):
    axes[1, 0].text(i, v + 0.03, f'{v:.1%}', ha='center', fontweight='bold')

# 4. Pass/Fail by Difficulty
difficulties = ['easy', 'medium', 'hard']
diff_data = {diff: {'pass': 0, 'fail': 0} for diff in difficulties}
for e in evaluations:
    diff = e.difficulty
    if e.judge_evaluation.score >= 0.7:
        diff_data[diff]['pass'] += 1
    else:
        diff_data[diff]['fail'] += 1

pass_counts = [diff_data[d]['pass'] for d in difficulties]
fail_counts = [diff_data[d]['fail'] for d in difficulties]

x = np.arange(len(difficulties))
width = 0.35
axes[1, 1].bar(x - width/2, pass_counts, width, label='Pass', color='lightgreen', edgecolor='black')
axes[1, 1].bar(x + width/2, fail_counts, width, label='Fail', color='lightcoral', edgecolor='black')
axes[1, 1].set_xlabel('Difficulty')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Pass/Fail by Difficulty')
axes[1, 1].set_xticks(x)
axes[1, 1].set_xticklabels([d.capitalize() for d in difficulties])
axes[1, 1].legend()

plt.tight_layout()
plt.savefig('../data/evaluation/evaluation_results.png', dpi=150, bbox_inches='tight')
print("‚úÖ Visualization saved to: ../data/evaluation/evaluation_results.png")
plt.show()

## 10. Conclusion

### Summary

This evaluation pipeline demonstrates:

1. **Golden Dataset**: 10 curated test cases covering all agent capabilities
2. **LLM-as-a-Judge**: Gemini evaluates responses based on structured criteria
3. **Behavioral Evaluation**: Trajectory analysis (tool selection) alongside goal completion
4. **Tracking**: Results saved to CSV for longitudinal monitoring

### Key Takeaways

- **Tool Selection Accuracy** shows how well the agent understands intent
- **Goal Completion Rate** measures task success
- **Idempotency Compliance** ensures safety in production
- **Category Breakdown** identifies strengths and weaknesses

### Production Recommendations

1. **Automated CI/CD**: Run this evaluation on every deployment
2. **Threshold Enforcement**: Require 80%+ pass rate before production
3. **Continuous Monitoring**: Track metrics over time
4. **Expand Dataset**: Add more edge cases as issues are discovered

---

**Milestone 7 Complete** ‚úÖ

References:
- Intro to Agents p.29: "Goal Completion Rate"
- Prototype to Production p.12: "Golden Dataset & Trajectory Evaluation"