# EcoHome Energy Advisor - Agent Run & Evaluation

In this notebook, you'll run the Energy Advisor agent with various real-world scenarios and see how it helps customers optimize their energy usage.

## Learning Objectives
- Create the agent's instructions
- Run the Energy Advisor with different types of questions
- Evaluate response quality and accuracy
- Measure tool usage effectiveness
- Identify areas for improvement
- Implement evaluation metrics

## Evaluation Criteria
- **Accuracy**: Correct information and calculations
- **Relevance**: Responses address the user's question
- **Completeness**: Comprehensive answers with actionable advice
- **Tool Usage**: Appropriate use of available tools
- **Reasoning**: Clear explanation of recommendations


## 1. Import and Initialize

In [1]:
from datetime import datetime
from agent import Agent

In [2]:
## Agent instructions

ECOHOME_SYSTEM_PROMPT = """
You are EcoHome, a proactive residential energy advisor for homeowners and renters.
Role: deliver actionable, data-backed recommendations that reduce costs and improve energy efficiency.

Steps to follow:
1) Clarify location, timeframe, and devices if missing; state any assumptions.
2) Pull relevant data using tools: weather for solar/thermal context, electricity prices for time-of-use windows, usage and solar history for trends, recent summary when timeframe is unclear, and energy tips for best practices.
3) Analyze patterns (peaks, off-peak windows, forecasted conditions) and decide the best actions.
4) Quantify impact (kWh and USD) with calculate_energy_savings when numbers are available; otherwise give conservative ranges.
5) Present 2-4 prioritized recommendations with reasoning and next steps; note gaps and ask one concise follow-up question if needed.

Key capabilities:
- get_weather_forecast: assess upcoming conditions and solar potential.
- get_electricity_prices: identify off-peak vs peak hours for load shifting.
- query_energy_usage / query_solar_generation: inspect historical consumption and production.
- get_recent_energy_summary: get a quick view when the user provides little context.
- search_energy_tips: retrieve best practices via RAG.
- calculate_energy_savings: quantify savings for proposed actions.

Recommendations guidance:
- Tie every suggestion to retrieved data (price periods, forecast, usage patterns) and make them specific and time-bound.
- Prefer scheduling and load shifting to cheaper hours; suggest thermostat, EV, appliance, and solar-usage tweaks.
- Include expected savings and assumptions; provide quick wins plus one longer-term improvement when relevant.
- If data is missing, state the assumption and request the needed detail succinctly.

Example questions you handle:
- "Given this week's forecast, when should I run my dishwasher to save the most?"
- "How can I cut my EV charging costs in San Diego tomorrow?"
- "Review my past 7 days of usage and suggest ways to reduce peak load."
- "Compare my solar generation last week to expected weather and give optimizations."

Respond concisely, show key tool findings briefly, then deliver the final plan.
"""


In [3]:
ecohome_agent = Agent(
    instructions=ECOHOME_SYSTEM_PROMPT,
)

In [4]:
response = ecohome_agent.invoke(
    question="When should I charge my electric car tomorrow to minimize cost and maximize solar power?",
    context="Location: San Francisco, CA"
)

In [5]:
print(response["messages"][-1].content)

### Summary of Findings for December 29, 2025

1. **Weather Forecast**: The day is expected to be cloudy with no solar irradiance during the day. Solar generation is expected to be minimal.
2. **Electricity Prices**:
   - **Peak Hours (6 AM - 8 PM)**: Rates range from $0.175 to $0.185 per kWh.
   - **Off-Peak Hours (9 PM - 5 AM)**: Rates are significantly lower, ranging from $0.096 to $0.105 per kWh.

### Recommendations for Charging Your Electric Vehicle (EV)

1. **Charge During Off-Peak Hours**:
   - **Best Time**: Charge your EV between **9 PM and 5 AM** when the rates are at their lowest (around $0.096 to $0.105 per kWh).
   - **Reasoning**: Charging during these hours will save you money as the rates are significantly lower compared to peak hours.

2. **Avoid Daytime Charging**:
   - **Peak Hours**: Avoid charging between **6 AM and 8 PM** due to higher rates (up to $0.185 per kWh) and minimal solar generation.

### Next Steps
- Schedule your EV charging to start at **9 PM** on De

In [6]:
print("TOOLS:")
for msg in response["messages"]:
    obj = msg.dict()
    if obj.get("type") == "function":
        print("-", obj.get("name"))

TOOLS:
- get_weather_forecast
- get_electricity_prices


## 2. Define Test Cases

In [7]:
# Comprehensive scenario-based test cases for the Energy Advisor
# Covers EV charging, thermostat, appliance scheduling, solar usage, and cost savings calculations.


In [8]:
test_cases = [
    {
        "id": "ev_charging_peak_avoid",
        "question": "When should I charge my EV tomorrow to avoid peak rates and use my rooftop solar?",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "Should recommend off-peak/night or mid-day solar window with rate comparison and solar hours.",
    },
    {
        "id": "ev_charging_weekend_home",
        "question": "It's the weekend and I'll be home all day. What is the cheapest charging window for my EV?",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast"],
        "expected_response": "Should highlight weekend pricing profile and suggest a 2-3 hour window with solar alignment.",
    },
    {
        "id": "thermostat_heatwave_peak",
        "question": "How should I set my thermostat this afternoon during a heatwave to stay comfortable but minimize cost?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices", "search_energy_tips"],
        "expected_response": "Should suggest pre-cooling before peak, target temp band, and ventilation/humidity tips.",
    },
    {
        "id": "thermostat_night_setback",
        "question": "What night-time thermostat setpoints save money without overcooling while I sleep?",
        "expected_tools": ["get_electricity_prices", "search_energy_tips"],
        "expected_response": "Should give a setback range, reference off-peak pricing, and comfort guidance.",
    },
    {
        "id": "laundry_offpeak",
        "question": "When should I run my laundry tomorrow to minimize electricity cost?",
        "expected_tools": ["get_electricity_prices", "search_energy_tips"],
        "expected_response": "Should recommend an off-peak window and mention load shifting benefits.",
    },
    {
        "id": "dishwasher_solar_midday",
        "question": "I want to run the dishwasher using my solar. What time window is best tomorrow?",
        "expected_tools": ["get_weather_forecast", "get_electricity_prices", "query_solar_generation"],
        "expected_response": "Should pick a sunny mid-day slot referencing solar output and any peak price overlap.",
    },
    {
        "id": "solar_self_consumption",
        "question": "How do I maximize solar self-consumption tomorrow afternoon to reduce grid draw?",
        "expected_tools": ["get_weather_forecast", "query_solar_generation", "get_recent_energy_summary"],
        "expected_response": "Should suggest shifting flexible loads into high-irradiance hours with expected kWh impact.",
    },
    {
        "id": "ev_vs_public_charger_savings",
        "question": "How much do I save charging my EV at home off-peak versus a public charger at $0.35/kWh?",
        "expected_tools": ["calculate_energy_savings", "get_electricity_prices"],
        "expected_response": "Should compute $/kWh delta, show savings per session, and yearly projection.",
    },
    {
        "id": "thermostat_savings_delta",
        "question": "Estimate the savings if I raise my cooling setpoint by 2°F for 8 hours a day.",
        "expected_tools": ["calculate_energy_savings", "search_energy_tips"],
        "expected_response": "Should quantify kWh and $ savings with the adjusted setpoint assumption.",
    },
    {
        "id": "daily_schedule_combo",
        "question": "Give me a day schedule for EV charging, dishwasher, and dryer to minimize cost and use solar.",
        "expected_tools": ["get_electricity_prices", "get_weather_forecast", "get_recent_energy_summary", "search_energy_tips"],
        "expected_response": "Should provide a staggered schedule with peak avoidance and solar-aware timing per device.",
    },
]

if len(test_cases) < 10:
    raise ValueError("You MUST have at least 10 test cases")



## 3. Run Agent Tests

In [9]:
CONTEXT = "Location: San Francisco, CA"

In [10]:
# Run the agent tests
# For each test case, call the agent and collect the response
# Store results for evaluation

print("=== Running Agent Tests ===")
test_results = []

for i, test_case in enumerate(test_cases):
    print(f"\nTest {i+1}: {test_case['id']}")
    print(f"Question: {test_case['question']}")
    print("-" * 50)
    
    try:
        # Call the agent
        response = ecohome_agent.invoke(
            question=test_case['question'],
            context=CONTEXT
        )
        
        # Store the result
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': response,
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat()
        }
        test_results.append(result)
                
    except Exception as e:
        print(f"Error: {e}")
        result = {
            'test_id': test_case['id'],
            'question': test_case['question'],
            'response': f"Error: {str(e)}",
            'expected_tools': test_case['expected_tools'],
            'expected_response': test_case['expected_response'],
            'timestamp': datetime.now().isoformat(),
            'error': str(e)
        }
        test_results.append(result)

print(f"\nCompleted {len(test_results)} tests")


=== Running Agent Tests ===

Test 1: ev_charging_peak_avoid
Question: When should I charge my EV tomorrow to avoid peak rates and use my rooftop solar?
--------------------------------------------------

Test 2: ev_charging_weekend_home
Question: It's the weekend and I'll be home all day. What is the cheapest charging window for my EV?
--------------------------------------------------

Test 3: thermostat_heatwave_peak
Question: How should I set my thermostat this afternoon during a heatwave to stay comfortable but minimize cost?
--------------------------------------------------

Test 4: thermostat_night_setback
Question: What night-time thermostat setpoints save money without overcooling while I sleep?
--------------------------------------------------


Number of requested results 5 is greater than number of elements in index 4, updating n_results = 4



Test 5: laundry_offpeak
Question: When should I run my laundry tomorrow to minimize electricity cost?
--------------------------------------------------

Test 6: dishwasher_solar_midday
Question: I want to run the dishwasher using my solar. What time window is best tomorrow?
--------------------------------------------------

Test 7: solar_self_consumption
Question: How do I maximize solar self-consumption tomorrow afternoon to reduce grid draw?
--------------------------------------------------

Test 8: ev_vs_public_charger_savings
Question: How much do I save charging my EV at home off-peak versus a public charger at $0.35/kWh?
--------------------------------------------------

Test 9: thermostat_savings_delta
Question: Estimate the savings if I raise my cooling setpoint by 2°F for 8 hours a day.
--------------------------------------------------

Test 10: daily_schedule_combo
Question: Give me a day schedule for EV charging, dishwasher, and dryer to minimize cost and use solar.
--

In [11]:
test_results

[{'test_id': 'ev_charging_peak_avoid',
  'question': 'When should I charge my EV tomorrow to avoid peak rates and use my rooftop solar?',
  'response': {'messages': [SystemMessage(content='\nYou are EcoHome, a proactive residential energy advisor for homeowners and renters.\nRole: deliver actionable, data-backed recommendations that reduce costs and improve energy efficiency.\n\nSteps to follow:\n1) Clarify location, timeframe, and devices if missing; state any assumptions.\n2) Pull relevant data using tools: weather for solar/thermal context, electricity prices for time-of-use windows, usage and solar history for trends, recent summary when timeframe is unclear, and energy tips for best practices.\n3) Analyze patterns (peaks, off-peak windows, forecasted conditions) and decide the best actions.\n4) Quantify impact (kWh and USD) with calculate_energy_savings when numbers are available; otherwise give conservative ranges.\n5) Present 2-4 prioritized recommendations with reasoning and ne

## 4. Evaluate Responses

In [12]:
def evaluate_response(question, final_response, expected_response):
    """Evaluate a single response against expected response"""
    import re
    from difflib import SequenceMatcher

    def _normalize(text):
        if not text:
            return ""
        return re.sub(r"\s+", " ", text.strip().lower())

    def _tokens(text):
        return set(re.findall(r"\b\w+\b", _normalize(text)))

    def _coverage(base_tokens, comparison_tokens):
        if not base_tokens:
            return 0.0
        return len(base_tokens & comparison_tokens) / len(base_tokens)

    normalized_final = _normalize(final_response)
    normalized_expected = _normalize(expected_response)

    question_tokens = _tokens(question)
    final_tokens = _tokens(final_response)
    expected_tokens = _tokens(expected_response)

    accuracy = SequenceMatcher(None, normalized_expected, normalized_final).ratio() if (normalized_expected or normalized_final) else 0.0
    relevance = _coverage(question_tokens, final_tokens)
    completeness = _coverage(expected_tokens, final_tokens)
    usefulness = max(0.0, min(1.0, (0.3 * accuracy) + (0.3 * relevance) + (0.4 * completeness)))

    def _describe(score, aspect):
        if score >= 0.85:
            return f"Strong {aspect}"
        if score >= 0.6:
            return f"Moderate {aspect}"
        return f"Weak {aspect}"

    strengths = []
    improvement_areas = []
    if completeness >= 0.7:
        strengths.append("Covers most of the expected points.")
    else:
        improvement_areas.append("Add missing key details from the expected answer.")
    if relevance >= 0.7:
        strengths.append("Response stays focused on the question.")
    else:
        improvement_areas.append("Tighten the answer to better address the user's question.")
    if accuracy >= 0.7:
        strengths.append("Wording aligns well with expected content.")
    else:
        improvement_areas.append("Adjust phrasing to better match the expected response.")

    feedback = {
        "metrics": {
            "accuracy": accuracy,
            "relevance": relevance,
            "completeness": completeness,
            "usefulness": usefulness,
        },
        "summaries": {
            "accuracy": _describe(accuracy, "accuracy"),
            "relevance": _describe(relevance, "relevance"),
            "completeness": _describe(completeness, "completeness"),
            "usefulness": _describe(usefulness, "usefulness"),
        },
        "strengths": strengths,
        "improvements": improvement_areas,
        "notes": {
            "missing_expected_terms": list(expected_tokens - final_tokens),
            "extra_response_terms": list(final_tokens - expected_tokens),
        },
    }

    return feedback

In [13]:
def evaluate_tool_usage(messages, expected_tools):
    """Evaluate if the right tools were used"""
    expected_set = {t.lower() for t in (expected_tools or [])}
    used_tools = set()

    def _extract_from_message(msg_obj):
        """Pull tool names from various message shapes."""
        record = None
        if hasattr(msg_obj, "dict"):
            record = msg_obj.dict()
        elif isinstance(msg_obj, dict):
            record = msg_obj
        if not record:
            return []

        names = []
        # Direct function/tool message
        for key in ("name",):
            if record.get("type") in {"function", "tool"} and record.get(key):
                names.append(record[key])
        # OpenAI-style tool_calls
        additional = record.get("additional_kwargs", {}) or {}
        for tc in additional.get("tool_calls", []) or []:
            func = (tc or {}).get("function", {})
            if func.get("name"):
                names.append(func["name"])
        # Nested function_call pattern
        func_call = record.get("function_call", {}) or {}
        if func_call.get("name"):
            names.append(func_call["name"])
        return names

    for msg in messages or []:
        for name in _extract_from_message(msg):
            used_tools.add(name.lower())

    overlap = used_tools & expected_set
    missing = expected_set - used_tools
    unexpected = used_tools - expected_set

    appropriateness = len(overlap) / len(used_tools) if used_tools else 0.0
    completeness = len(overlap) / len(expected_set) if expected_set else 1.0

    def _describe(score, aspect):
        if score >= 0.85:
            return f"Strong {aspect}"
        if score >= 0.6:
            return f"Moderate {aspect}"
        return f"Weak {aspect}"

    strengths = []
    improvements = []
    if overlap:
        strengths.append(f"Used expected tools: {sorted(overlap)}")
    if not missing:
        strengths.append("All required tools were invoked.")
    else:
        improvements.append(f"Missing tools: {sorted(missing)}")
    if unexpected:
        improvements.append(f"Unexpected tools used: {sorted(unexpected)}")

    feedback = {
        "metrics": {
            "tool_appropriateness": appropriateness,
            "tool_completeness": completeness,
        },
        "summaries": {
            "tool_appropriateness": _describe(appropriateness, "tool appropriateness"),
            "tool_completeness": _describe(completeness, "tool completeness"),
        },
        "strengths": strengths,
        "improvements": improvements,
        "details": {
            "expected": sorted(expected_set),
            "used": sorted(used_tools),
            "missing_expected": sorted(missing),
            "unexpected_used": sorted(unexpected),
        },
    }

    return feedback

In [14]:
# Calculate overall scores and metrics
# Identify strengths and weaknesses
# Provide recommendations for improvement
def generate_evaluation_report(test_results):
    """Aggregate per-test evaluations into a structured report."""
    from datetime import datetime

    report = {
        "generated_at": datetime.now().isoformat(),
        "overall": {},
        "per_test": [],
        "strengths": [],
        "weaknesses": [],
        "recommendations": [],
    }

    aggregates = {
        "accuracy": 0.0,
        "relevance": 0.0,
        "completeness": 0.0,
        "usefulness": 0.0,
        "tool_appropriateness": 0.0,
        "tool_completeness": 0.0,
    }

    def _get_final_message(msgs):
        if not msgs:
            return ""
        last = msgs[-1]
        if hasattr(last, "content"):
            return last.content or ""
        if isinstance(last, dict):
            return last.get("content", "")
        return str(last)

    for result in test_results or []:
        messages = result.get("response", {}).get("messages", []) if isinstance(result.get("response"), dict) else []
        final_response = _get_final_message(messages)

        response_eval = evaluate_response(
            result.get("question", ""),
            final_response,
            result.get("expected_response", ""),
        )
        tool_eval = evaluate_tool_usage(messages, result.get("expected_tools", []))

        test_entry = {
            "test_id": result.get("test_id"),
            "question": result.get("question"),
            "response_preview": (final_response or "").strip()[:280],
            "response_metrics": response_eval,
            "tool_metrics": tool_eval,
        }
        report["per_test"].append(test_entry)

        for key in aggregates:
            aggregates[key] += (
                response_eval["metrics"].get(key, 0.0)
                if key in response_eval["metrics"]
                else tool_eval["metrics"].get(key, 0.0)
            )

        tagged_strengths = [f"[{result.get('test_id')}] {s}" for s in response_eval.get("strengths", [])]
        tagged_strengths += [f"[{result.get('test_id')}] {s}" for s in tool_eval.get("strengths", [])]
        tagged_improvements = [f"[{result.get('test_id')}] {s}" for s in response_eval.get("improvements", [])]
        tagged_improvements += [f"[{result.get('test_id')}] {s}" for s in tool_eval.get("improvements", [])]
        
        report["strengths"].extend(tagged_strengths)
        report["weaknesses"].extend(tagged_improvements)

    total_tests = max(len(report["per_test"]), 1)
    overall_metrics = {k: v / total_tests for k, v in aggregates.items()}
    report["overall"] = overall_metrics

    recommendations = []
    if overall_metrics.get("completeness", 0) < 0.7:
        recommendations.append("Increase coverage of expected answer points; ensure key facts are included.")
    if overall_metrics.get("relevance", 0) < 0.7:
        recommendations.append("Tighten responses to directly address the user's question and avoid drift.")
    if overall_metrics.get("tool_completeness", 0) < 0.8:
        recommendations.append("Invoke all required tools per scenario; add guardrails for missing calls.")
    if overall_metrics.get("tool_appropriateness", 0) < 0.8:
        recommendations.append("Prefer expected tools and avoid unnecessary calls; refine tool selection logic.")
    if overall_metrics.get("accuracy", 0) < 0.7:
        recommendations.append("Align phrasing and facts with expected responses; adjust templating or prompts.")
    if not recommendations:
        recommendations.append("Maintain current approach; consider stress-testing with harder edge cases.")
    report["recommendations"] = recommendations

    return report


def display_evaluation_report(report):
    """Pretty-print the evaluation report with clear sectioning."""
    if not report:
        print("No report to display.")
        return

    def _section(title):
        print("\n" + title)
        print("-" * len(title))

    def _fmt_tagged(item):
        if isinstance(item, str) and item.startswith("[") and "]" in item:
            end = item.find("]")
            test_id = item[1:end]
            text = item[end+1:].strip()
            return f"[{test_id}] {text}"
        return item

    def _print_list(title, items):
        _section(title)
        if items:
            for item in items:
                print(f"- {_fmt_tagged(item)}")
        else:
            print("- None noted")

    print("=== Evaluation Report ===")
    print(f"Generated at: {report.get('generated_at')}")

    overall = report.get("overall", {})
    _section("Overall Metrics")
    for k in ("accuracy", "relevance", "completeness", "usefulness", "tool_appropriateness", "tool_completeness"):
        if k in overall:
            print(f"- {k}: {overall[k]:.2f}")

    _print_list("Key Strengths", report.get("strengths", []))
    _print_list("Key Weaknesses", report.get("weaknesses", []))
    _print_list("Recommendations", report.get("recommendations", []))

    _section("Per-Test Breakdown")
    for entry in report.get("per_test", []):
        print(f"\nTest: {entry.get('test_id')} — {entry.get('question')}")
        print(f"Response preview: {entry.get('response_preview')}")
        rm = entry.get("response_metrics", {}).get("metrics", {})
        tm = entry.get("tool_metrics", {}).get("metrics", {})
        print("  Response metrics:")
        for k in ("accuracy", "relevance", "completeness", "usefulness"):
            if k in rm:
                print(f"    - {k}: {rm[k]:.2f}")
        print("  Tool metrics:")
        for k in ("tool_appropriateness", "tool_completeness"):
            if k in tm:
                print(f"    - {k}: {tm[k]:.2f}")

In [15]:
report = generate_evaluation_report(test_results)
report

{'generated_at': '2025-12-28T09:19:35.857594',
 'overall': {'accuracy': 0.01602416358305043,
  'relevance': 0.6949175759740776,
  'completeness': 0.381047286047286,
  'usefulness': 0.36570143628605284,
  'tool_appropriateness': 0.7833333333333333,
  'tool_completeness': 0.55},
 'per_test': [{'test_id': 'ev_charging_peak_avoid',
   'question': 'When should I charge my EV tomorrow to avoid peak rates and use my rooftop solar?',
   'response_preview': '### Summary of Findings\n1. **Electricity Prices for Tomorrow (October 6, 2023)**:\n   - **Off-Peak Rates**: \n     - 12 AM - 5 AM: $0.101 - $0.103 per kWh\n     - 10 PM - 11 PM: $0.101 - $0.103 per kWh\n   - **Peak Rates**: \n     - 6 AM - 9 PM: $0.175 - $0.184 per kWh\n\n2. **Weather F',
   'response_metrics': {'metrics': {'accuracy': 0.010126582278481013,
     'relevance': 0.6,
     'completeness': 0.4666666666666667,
     'usefulness': 0.36970464135021097},
    'summaries': {'accuracy': 'Weak accuracy',
     'relevance': 'Moderate relev

In [16]:
display_evaluation_report(report)

=== Evaluation Report ===
Generated at: 2025-12-28T09:19:35.857594

Overall Metrics
---------------
- accuracy: 0.02
- relevance: 0.69
- completeness: 0.38
- usefulness: 0.37
- tool_appropriateness: 0.78
- tool_completeness: 0.55

Key Strengths
-------------
- [ev_charging_peak_avoid] Used expected tools: ['get_electricity_prices', 'get_weather_forecast']
- [ev_charging_peak_avoid] All required tools were invoked.
- [ev_charging_weekend_home] Used expected tools: ['get_electricity_prices']
- [thermostat_heatwave_peak] Used expected tools: ['get_electricity_prices', 'get_weather_forecast']
- [thermostat_night_setback] Used expected tools: ['search_energy_tips']
- [laundry_offpeak] Response stays focused on the question.
- [laundry_offpeak] Used expected tools: ['get_electricity_prices']
- [dishwasher_solar_midday] Response stays focused on the question.
- [dishwasher_solar_midday] Used expected tools: ['get_electricity_prices', 'get_weather_forecast']
- [solar_self_consumption] Response