# EcoHome Energy Advisor - Agent Run & Evaluation

In this notebook, you'll run the Energy Advisor agent with various real-world scenarios and see how it helps customers optimize their energy usage.

## Learning Objectives
- Create the agent's instructions
- Run the Energy Advisor with different types of questions
- Evaluate response quality and accuracy
- Measure tool usage effectiveness
- Identify areas for improvement
- Implement evaluation metrics

## Evaluation Criteria
- **Accuracy**: Correct information and calculations
- **Relevance**: Responses address the user's question
- **Completeness**: Comprehensive answers with actionable advice
- **Tool Usage**: Appropriate use of available tools
- **Reasoning**: Clear explanation of recommendations


## 1. Import and Initialize

In [1]:
from datetime import datetime
from agent import Agent

In [18]:
## Create the agent's instructions

ECOHOME_SYSTEM_PROMPT = """
You are an EcoHome Energy Advisor, an expert AI assistant specialized in home energy optimization.
Your mission: Provide data-driven recommendations to minimize costs, maximize solar usage, and reduce environmental impact.

CRITICAL RULE: ALWAYS use tools to gather data before responding. Never give generic advice.

## MANDATORY Tool Usage Rules:

### When user asks "How much can I save?" or mentions "money/cost savings":
1. MUST call query_energy_usage (get their current usage)
2. MUST call get_electricity_prices (understand cost structure)
3. MUST call calculate_energy_savings (quantify actual savings)
→ If you skip calculate_energy_savings for a savings question, you FAILED.

### When user asks about "optimal schedule" or "when to run appliances":
1. MUST call get_electricity_prices (find cheap hours)
2. If user mentions "THEIR" appliances: MUST call query_energy_usage (see their patterns)
3. Only call get_weather_forecast if solar/EV involved

### When user asks about "thermostat settings" or "temperature":
1. MUST call get_weather_forecast (outdoor conditions)
2. MUST call query_energy_usage (their HVAC patterns)
→ Weather alone is NOT enough for thermostat advice!

### When user asks about "solar panels" or "grid dependency":
1. MUST call query_solar_generation (their solar history)
2. MUST call get_weather_forecast (future predictions)
→ Both historical AND forecast are required!

### When user asks "general tips" or "how to reduce energy":
1. MUST call search_energy_tips (get specific advice)
2. If they want personalized tips: MUST call query_energy_usage (their usage)
→ Generic tips need personalization!

## Tool Descriptions:

1. **get_weather_forecast(location, days)** - Future weather, solar irradiance predictions
   USE WHEN: Questions about FUTURE timing, solar predictions, EV charging tomorrow/this week

2. **get_electricity_prices(date)** - Time-of-use rates, peak/off-peak pricing
   USE WHEN: ANY question mentioning cost, savings, when to run devices, optimal timing

3. **query_energy_usage(start_date, end_date, device_type)** - User's consumption history
   USE WHEN: Questions about THEIR usage, personalized advice, "my energy", savings calculations
   IMPORTANT: Use recent dates (last 7-30 days) for relevant data

4. **query_solar_generation(start_date, end_date)** - User's solar production history  
   USE WHEN: Questions about solar, grid dependency, maximizing solar usage

5. **search_energy_tips(query, max_results)** - Energy-saving best practices
   USE WHEN: General tips, device-specific advice, "how can I", efficiency strategies

6. **get_recent_energy_summary(hours)** - Quick recent overview
   USE WHEN: Need snapshot of recent activity (less detailed than query_energy_usage)

7. **calculate_energy_savings(device_type, current_kwh, optimized_kwh, price)** - ROI calculator
   USE WHEN: **ANY** question about savings, cost reduction, "how much can I save"
   CRITICAL: This must be called for ALL savings questions!

## Response Requirements:

1. **Always call tools FIRST** before providing advice
2. **Be specific**: Include times (e.g., "11 PM - 6 AM"), temperatures, dollar amounts
3. **Show calculations**: "Based on $0.15/kWh peak rate, you'll save $X"
4. **Reference data**: "According to your usage history..." / "The forecast shows..."
5. **Quantify savings**: Always use calculate_energy_savings when discussing money

## Common Mistakes to AVOID:

❌ Giving thermostat advice without checking weather forecast
❌ Discussing savings without calling calculate_energy_savings  
❌ Answering "when to run appliance" questions without checking electricity prices
❌ Solar advice without querying both history AND forecast
❌ Personalized advice without checking query_energy_usage

Remember: Every recommendation must be backed by data from tools. Generic advice = FAILURE.
"""

In [19]:
ecohome_agent = Agent(
    instructions=ECOHOME_SYSTEM_PROMPT,
)

In [4]:
response = ecohome_agent.invoke(
    question="When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
    context="Location: San Francisco, CA"
)

In [6]:
print(response["messages"][-1].content)

To determine the best time to charge your electric vehicle (EV) tomorrow in San Francisco, I gathered data on solar irradiance, electricity prices, and recent solar generation.

### Weather Forecast for Tomorrow (October 7, 2023)
- **Morning (6 AM - 12 PM)**: 
  - 6 AM: Sunny, solar irradiance ~608 W/m²
  - 9 AM: Sunny, solar irradiance ~513 W/m²
  - 10 AM: Sunny, solar irradiance ~870 W/m² (peak solar generation)
  
- **Afternoon (12 PM - 6 PM)**: 
  - 12 PM: Cloudy, solar irradiance ~225 W/m²
  - 1 PM: Cloudy, solar irradiance ~896 W/m² (peak solar generation)
  - 3 PM: Cloudy, solar irradiance ~572 W/m²

### Electricity Prices for Tomorrow
- **Off-Peak Rates**: 
  - 12 AM - 6 AM: $0.075 - $0.0959
- **Peak Rates**: 
  - 6 AM - 8 AM: $0.1556 - $0.1782
  - 9 AM - 12 PM: $0.1569 - $0.1722
  - 1 PM - 6 PM: $0.1549 - $0.1756
- **Off-Peak Rates**: 
  - 10 PM - 12 AM: $0.0806 - $0.0919

### Solar Generation Insights
- There was no recorded solar generation yesterday, indicating that solar p

In [7]:
print("TOOLS:")
for msg in response["messages"]:
    obj = msg.model_dump()
    if obj.get("tool_call_id"):
        print("-", msg.name)

TOOLS:
- get_weather_forecast
- get_electricity_prices
- query_solar_generation


## 2. Define Test Cases

In [8]:
# Define comprehensive test cases for the Energy Advisor
# Create 10 test cases covering different scenarios:
# - EV charging optimization
# - Thermostat settings
# - Appliance scheduling
# - Solar power maximization
# - Cost savings calculations

test_cases = [
    {
        "id": "ev_charging_1",
        "question": "When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices"],
        "expected_response": "The response should contain time recommendation, cost analysis and solar consideration",
    },
    {
        "id": "ev_charging_2",
        "question": "What's the most cost-effective charging schedule for my EV this week?",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "The response should provide a weekly charging schedule with cost estimates",
    },
    {
        "id": "thermostat_1",
        "question": "What temperature should I set my thermostat to save energy while staying comfortable?",
        "expected_tools": ["get_weather_forecast", "query_energy_usage"],
        "expected_response": "The response should recommend temperature settings based on weather and usage patterns",
    },
    {
        "id": "thermostat_2",
        "question": "How much can I save by adjusting my HVAC settings during peak hours?",
        "expected_tools": ["get_electricity_prices", "query_energy_usage", "calculate_energy_savings"],
        "expected_response": "The response should calculate potential savings with specific dollar amounts",
    },
    {
        "id": "appliance_scheduling_1",
        "question": "When should I run my dishwasher and washing machine to minimize electricity costs?",
        "expected_tools": ["get_electricity_prices"],
        "expected_response": "The response should recommend off-peak hours for running appliances",
    },
    {
        "id": "appliance_scheduling_2",
        "question": "Can you suggest an optimal schedule for my high-energy appliances this week?",
        "expected_tools": ["get_electricity_prices", "query_energy_usage"],
        "expected_response": "The response should provide a weekly appliance schedule with cost considerations",
    },
    {
        "id": "solar_maximization_1",
        "question": "How can I maximize the use of my solar panels to reduce grid dependency?",
        "expected_tools": ["get_weather_forecast", "query_solar_generation"],
        "expected_response": "The response should suggest ways to align energy usage with solar generation",
    },
    {
        "id": "solar_maximization_2",
        "question": "What's my solar generation forecast for tomorrow and how should I plan my energy usage?",
        "expected_tools": ["get_weather_forecast", "query_solar_generation"],
        "expected_response": "The response should provide solar forecast and usage recommendations",
    },
    {
        "id": "cost_savings_1",
        "question": "How much money could I save by optimizing my energy usage patterns?",
        "expected_tools": ["query_energy_usage", "get_electricity_prices", "calculate_energy_savings"],
        "expected_response": "The response should calculate potential savings with specific amounts",
    },
    {
        "id": "energy_tips_1",
        "question": "What are some practical tips to reduce my home energy consumption?",
        "expected_tools": ["search_energy_tips", "query_energy_usage"],
        "expected_response": "The response should provide actionable energy-saving tips relevant to the home",
    },
]

if len(test_cases) < 10:
    raise ValueError("You MUST have at least 10 test cases")

## 3. Run Agent Tests

In [9]:
CONTEXT = "Location: San Francisco, CA"

In [10]:
# Run the agent tests
# For each test case, call the agent and collect the response
# Store results for evaluation

print("=== Running Agent Tests ===")
test_results_1 = []

for i, test_case in enumerate(test_cases):
    print(f"\nTest {i+1}: {test_case['id']}")
    print(f"Question: {test_case['question']}")
    print("-" * 50)
    
    try:
        # Call the agent
        response = ecohome_agent.invoke(
            question=test_case['question'],
            context=CONTEXT
        )
        
        # Store the result
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': response,
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat()
        }
        test_results_1.append(result)
                
    except Exception as e:
        print(f"Error: {e}")
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': f"Error: {str(e)}",
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'error': str(e)
        }
        test_results_1.append(result)

print(f"\nCompleted {len(test_results_1)} tests")


=== Running Agent Tests ===

Test 1: ev_charging_1
Question: When should I charge my electric car tomorrow to minimize cost and maximize solar power?
--------------------------------------------------

Test 2: ev_charging_2
Question: What's the most cost-effective charging schedule for my EV this week?
--------------------------------------------------

Test 3: thermostat_1
Question: What temperature should I set my thermostat to save energy while staying comfortable?
--------------------------------------------------

Test 4: thermostat_2
Question: How much can I save by adjusting my HVAC settings during peak hours?
--------------------------------------------------

Test 5: appliance_scheduling_1
Question: When should I run my dishwasher and washing machine to minimize electricity costs?
--------------------------------------------------

Test 6: appliance_scheduling_2
Question: Can you suggest an optimal schedule for my high-energy appliances this week?
-----------------------------

In [11]:
test_results_1

[{'test_id': 'ev_charging_1',
  'question': 'When should I charge my electric car tomorrow to minimize cost and maximize solar power?',
  'response': {'messages': [SystemMessage(content='Location: San Francisco, CA', additional_kwargs={}, response_metadata={}, id='1d229296-475e-4d65-9713-92ed64096b2b'),
    HumanMessage(content='When should I charge my electric car tomorrow to minimize cost and maximize solar power?', additional_kwargs={}, response_metadata={}, id='784a7416-f402-4696-9942-e5ac8ff83dbc'),
    AIMessage(content='', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 93, 'prompt_tokens': 1565, 'total_tokens': 1658, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 1536}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_51db84afab', 'id': 

## 4. Evaluate Responses

In [20]:
# TODO: Create a response evaluator
def evaluate_response(question, final_response, expected_response):
    """Evaluate a single response against expected response"""
    
    # Create evaluation instructions for the Agent
    evaluation_instructions = """
    You are an expert evaluator for an AI energy advisor system.
    
    Evaluate responses based on these criteria:
    1. Relevance: Does the response address the user's question?
    2. Accuracy: Is the information provided accurate and reasonable?
    3. Completeness: Does the response provide sufficient detail?
    4. Actionability: Does the response provide actionable recommendations?
    
    Provide your evaluation as a JSON object with:
    - relevance_score (0-10)
    - accuracy_score (0-10)
    - completeness_score (0-10)
    - actionability_score (0-10)
    - overall_score (0-10, average of above)
    - feedback (brief explanation of the scores)
    
    Return only the JSON object, no other text.
    """
    
    # Create an LLM evaluator using the Agent class
    evaluator_agent = Agent(
        instructions=evaluation_instructions,
        model="gpt-4o-mini"
    )
    
    evaluation_prompt = f"""
    User Question: {question}
    
    AI Response: {final_response}
    
    Expected Response Type: {expected_response}
    
    Evaluate this response according to the criteria.
    """
    
    try:
        response = evaluator_agent.invoke(evaluation_prompt)
        import json
        # Extract the content from the response messages
        final_message = response['messages'][-1].content
        evaluation = json.loads(final_message)
        return evaluation
    except Exception as e:
        return {
            "relevance_score": 0,
            "accuracy_score": 0,
            "completeness_score": 0,
            "actionability_score": 0,
            "overall_score": 0,
            "feedback": f"Evaluation failed: {str(e)}"
        }

In [21]:
# Example: How to call evaluate_response with a test result
# This shows the correct way to extract final_response

# Get a test result (example using the first one)
if test_results_1:
    result = test_results_1[0]
    
    # Extract the final response from the messages
    final_response = result['response']['messages'][-1].content
    
    # Now call evaluate_response
    evaluation = evaluate_response(
        question=result['question'],
        final_response=final_response,
        expected_response=result['expected_response']
    )
    
    print("Evaluation:", evaluation)


Evaluation: {'relevance_score': 10, 'accuracy_score': 9, 'completeness_score': 9, 'actionability_score': 10, 'overall_score': 9, 'feedback': "The response is highly relevant as it directly addresses the user's question about charging an electric car to minimize costs and maximize solar power. The information provided about electricity pricing is accurate and detailed, although the solar generation forecast could have been more explicitly stated as zero. The recommendations are clear and actionable, guiding the user on when to charge their EV effectively."}


In [22]:
# Tool usage evaluator
def evaluate_tool_usage(messages, expected_tools):
    """
    Evaluate if the agent used the expected tools during execution.
    
    Args:
        messages: List of messages from the agent execution
        expected_tools: List of tool names that should have been used
    
    Returns:
        Dict with tool usage evaluation metrics
    """
    # Extract tool calls from messages (matching the pattern in Cell 7)
    tools_used = []
    for msg in messages:
        obj = msg.model_dump()
        if obj.get("tool_call_id"):
            tools_used.append(msg.name)
    
    # Remove duplicates while preserving order
    tools_used = list(dict.fromkeys(tools_used))
    
    # Calculate metrics
    expected_set = set(expected_tools)
    used_set = set(tools_used)
    
    correctly_used = expected_set.intersection(used_set)
    missing_tools = expected_set - used_set
    unnecessary_tools = used_set - expected_set
    
    # Calculate scores
    appropriateness = len(correctly_used) / len(used_set) if used_set else 0
    completeness = len(correctly_used) / len(expected_set) if expected_set else 0
    overall_tool_score = (appropriateness + completeness) / 2
    
    # Generate feedback
    feedback = []
    if correctly_used:
        feedback.append(f"✓ Correctly used: {', '.join(correctly_used)}")
    if missing_tools:
        feedback.append(f"✗ Missing: {', '.join(missing_tools)}")
    if unnecessary_tools:
        feedback.append(f"⚠ Extra: {', '.join(unnecessary_tools)}")
    
    return {
        "tools_used": tools_used,
        "tools_expected": expected_tools,
        "correctly_used": list(correctly_used),
        "missing_tools": list(missing_tools),
        "unnecessary_tools": list(unnecessary_tools),
        "appropriateness_score": round(appropriateness, 2),
        "completeness_score": round(completeness, 2),
        "overall_tool_score": round(overall_tool_score * 10, 1),  # Scale to 0-10
        "feedback": " | ".join(feedback) if feedback else "No tools used"
    }

In [23]:
# Generate a comprehensive evaluation report
def generate_evaluation_report(test_results):
    """
    Generate a comprehensive evaluation report from test results.
    
    Args:
        test_results: List of test result dictionaries
    
    Returns:
        Dict with overall evaluation metrics and recommendations
    """
    if not test_results:
        return {"error": "No test results provided"}
    
    # Process each test and gather evaluations
    all_evaluations = []
    
    for result in test_results:
        if 'error' in result:
            continue
            
        # Evaluate tool usage
        tool_eval = evaluate_tool_usage(
            result['response']['messages'],
            result['expected_tools']
        )
        
        all_evaluations.append({
            'test_id': result['test_id'],
            'tool_score': tool_eval['tool_usage_score'],
            'precision': tool_eval['precision'],
            'recall': tool_eval['recall'],
            'f1_score': tool_eval['f1_score'],
            'tools_missing': tool_eval['tools_missing'],
            'tools_unnecessary': tool_eval['tools_unnecessary']
        })
    
    if not all_evaluations:
        return {"error": "No successful tests to evaluate"}
    
    # Calculate overall scores
    total_tests = len(all_evaluations)
    avg_tool_score = sum(e['tool_score'] for e in all_evaluations) / total_tests
    avg_precision = sum(e['precision'] for e in all_evaluations) / total_tests
    avg_recall = sum(e['recall'] for e in all_evaluations) / total_tests
    avg_f1 = sum(e['f1_score'] for e in all_evaluations) / total_tests
    
    # Identify strengths and weaknesses
    strengths = []
    weaknesses = []
    
    if avg_tool_score >= 7:
        strengths.append("Effective tool selection and usage")
    elif avg_tool_score >= 5:
        strengths.append("Moderate tool usage")
    else:
        weaknesses.append("Tool selection needs significant improvement")
    
    # Count tests with issues
    missing_tools_count = sum(1 for e in all_evaluations if e['tools_missing'])
    extra_tools_count = sum(1 for e in all_evaluations if e['tools_unnecessary'])
    
    if missing_tools_count > total_tests * 0.3:
        weaknesses.append(f"Missing required tools in {missing_tools_count}/{total_tests} tests")
    
    if extra_tools_count > total_tests * 0.3:
        weaknesses.append(f"Using unnecessary tools in {extra_tools_count}/{total_tests} tests")
    
    # Generate recommendations
    recommendations = []
    if avg_tool_score < 7:
        recommendations.append("Improve system prompt to guide better tool selection")
        recommendations.append("Review tool descriptions for clarity")
    if missing_tools_count > 0:
        recommendations.append("Ensure agent understands when each tool should be used")
    if extra_tools_count > 0:
        recommendations.append("Reduce unnecessary tool calls for efficiency")
    
    # Assign grade
    if avg_tool_score >= 9:
        grade = "A (Excellent)"
    elif avg_tool_score >= 7:
        grade = "B (Good)"
    elif avg_tool_score >= 5:
        grade = "C (Satisfactory)"
    else:
        grade = "D (Needs Improvement)"
    
    return {
        "total_tests": total_tests,
        "average_tool_score": round(avg_tool_score, 2),
        "average_precision": round(avg_precision, 2),
        "average_recall": round(avg_recall, 2),
        "average_f1_score": round(avg_f1, 2),
        "strengths": strengths,
        "weaknesses": weaknesses,
        "recommendations": recommendations,
        "grade": grade,
        "detailed_evaluations": all_evaluations
    }


def display_evaluation_report(report):
    """Display the evaluation report in a readable format"""
    print("=" * 80)
    print(" " * 20 + "ECOHOME ENERGY ADVISOR EVALUATION REPORT")
    print("=" * 80)
    
    if "error" in report:
        print(f"\nERROR: {report['error']}")
        return
    
    print(f"\nSUMMARY")
    print(f"  Total Tests: {report['total_tests']}")
    print(f"  Average Tool Score: {report['average_tool_score']}/10")
    print(f"  Average Precision: {report['average_precision']}")
    print(f"  Average Recall: {report['average_recall']}")
    print(f"  Average F1 Score: {report['average_f1_score']}")
    print(f"  Overall Grade: {report['grade']}")
    
    print(f"\nSTRENGTHS")
    if report['strengths']:
        for strength in report['strengths']:
            print(f"  • {strength}")
    else:
        print("  None identified")
    
    print(f"\nWEAKNESSES")
    if report['weaknesses']:
        for weakness in report['weaknesses']:
            print(f"  • {weakness}")
    else:
        print("  None identified")
    
    print(f"\nRECOMMENDATIONS")
    if report['recommendations']:
        for i, rec in enumerate(report['recommendations'], 1):
            print(f"  {i}. {rec}")
    else:
        print("  System is performing well!")
    
    print("\n" + "=" * 80)

In [26]:
# Detailed analysis: See what went wrong in each test
print("=" * 80)
print("DETAILED TEST-BY-TEST ANALYSIS")
print("=" * 80)

for i, result in enumerate(test_results_1, 1):
    print(f"\n{'='*80}")
    print(f"TEST {i}: {result['test_id']}")
    print(f"{'='*80}")
    print(f"Question: {result['question']}")
    
    # Check if there was an error OR if response is not a proper dict
    if 'error' in result or not isinstance(result.get('response'), dict):
        error_msg = result.get('error', 'Unknown error - response format invalid')
        print(f"❌ ERROR: {error_msg}")
        continue

    # Additional safety check for messages
    if 'messages' not in result['response']:
        print(f"❌ ERROR: Response missing 'messages' key")
        continue
    
    try:
        # Get tool evaluation
        tool_eval = evaluate_tool_usage(
            result['response']['messages'],
            result['expected_tools']
        )
        
        print(f"\n📊 Expected Tools: {result['expected_tools']}")
        print(f"🔧 Actually Used: {tool_eval['tools_used']}")
        print(f"✅ Correct: {tool_eval['correctly_used']}")
        print(f"❌ Missing: {tool_eval['missing_tools']}")
        print(f"⚠️  Extra: {tool_eval['unnecessary_tools']}")
        print(f"📈 Score: {tool_eval['overall_tool_score']}/10")
        print(f"\n💬 Feedback: {tool_eval['feedback']}")
        
        # Show a snippet of the agent's response
        final_response = result['response']['messages'][-1].content
        print(f"\n📝 Agent Response (first 200 chars):")
        print(f"   {final_response[:200]}...")
        
    except Exception as e:
        print(f"❌ ERROR during evaluation: {str(e)}")

print("\n" + "=" * 80)


DETAILED TEST-BY-TEST ANALYSIS

TEST 1: ev_charging_1
Question: When should I charge my electric car tomorrow to minimize cost and maximize solar power?

📊 Expected Tools: ['get_weather_forecast', 'get_electricity_prices']
🔧 Actually Used: ['get_weather_forecast', 'get_electricity_prices', 'query_solar_generation']
✅ Correct: ['get_weather_forecast', 'get_electricity_prices']
❌ Missing: []
⚠️  Extra: ['query_solar_generation']
📈 Score: 8.3/10

💬 Feedback: ✓ Correctly used: get_weather_forecast, get_electricity_prices | ⚠ Extra: query_solar_generation

📝 Agent Response (first 200 chars):
   To determine the best time to charge your electric vehicle (EV) tomorrow in San Francisco, CA, we need to consider both the solar power generation forecast and the electricity pricing for the day.

##...

TEST 2: ev_charging_2
Question: What's the most cost-effective charging schedule for my EV this week?

📊 Expected Tools: ['get_electricity_prices', 'get_weather_forecast']
🔧 Actually Used: ['get_ele