### Section 1: Setup


In [1]:
#Google API Key Setup (You can get your own API Key from Google AI Studio).
#Create a .env file on your root folder and paste your key as shown in the ".env.example" file.

import os
try:
    GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
    print("‚úÖ Gemini API key setup complete.")
except Exception as e:
    print(
        f"üîë Authentication Error: Please make sure you have added 'GOOGLE_API_KEY' to your .env file. Details: {e}")

‚úÖ Gemini API key setup complete.


In [2]:
# This is a clean-up cell, use only if necessary.
# Removes home automation agent folder if it already existed.
# Without this cleanup the next cell would not run if there is an existing folder.

import os
import shutil

folder_path = "C:/Google-X-Kaggle-AI-Bootcamp/Day-4/Assignments/home_automation_agent"

if os.path.isdir(folder_path):
    print(f"Folder '{folder_path}' exists. Removing it...")
    shutil.rmtree(folder_path) # Removes the directory and its contents
    print(f"Folder '{folder_path}' removed successfully.")
else:
    print(f"Folder '{folder_path}' does not exist.")

Folder 'C:/Google-X-Kaggle-AI-Bootcamp/Day-4/Assignments/home_automation_agent' does not exist.


---
### Section 2: Create a Home Automation Agent

Let's create the agent that will be the center of our evaluation story. This home automation agent seems perfect in basic tests but has hidden flaws we'll discover through comprehensive evaluation. Run the `adk create` CLI command to set up the project scaffolding.

In [3]:
!adk create home_automation_agent --model gemini-2.5-flash-lite --api_key $GOOGLE_API_KEY


Agent created in c:\Google-X-Kaggle-AI-Bootcamp\Day-4\Assignments\home_automation_agent:
- .env
- __init__.py
- agent.py



Run the below cell to create the home automation agent. 

This agent uses a single `set_device_status` tool to control smart home devices. A device's status can only be ON or OFF. **The agent's instruction is deliberately overconfident** - it claims to control "ALL smart devices" and "any device the user mentions" - setting up the evaluation problems we'll discover.

In [4]:
%%writefile home_automation_agent/agent.py

from google.adk.agents import LlmAgent
from google.adk.models.google_llm import Gemini

from google.genai import types

# Configure Model Retry on errors
retry_config = types.HttpRetryOptions(
    attempts=5,  # Maximum retry attempts
    exp_base=7,  # Delay multiplier
    initial_delay=1,
    http_status_codes=[429, 500, 503, 504],  # Retry on these HTTP errors
)

def set_device_status(location: str, device_id: str, status: str) -> dict:
    """Sets the status of a smart home device.

    Args:
        location: The room where the device is located.
        device_id: The unique identifier for the device.
        status: The desired status, either 'ON' or 'OFF'.

    Returns:
        A dictionary confirming the action.
    """
    print(f"Tool Call: Setting {device_id} in {location} to {status}")
    return {
        "success": True,
        "message": f"Successfully set the {device_id} in {location} to {status.lower()}."
    }

# This agent has DELIBERATE FLAWS that we'll discover through evaluation!
root_agent = LlmAgent(
    model=Gemini(model="gemini-2.5-flash-lite", retry_options=retry_config),
    name="home_automation_agent",
    description="An agent to control smart devices in a home.",
    instruction="""You are a home automation assistant. You control ALL smart devices in the house.
    
    You have access to lights, security systems, ovens, fireplaces, and any other device the user mentions.
    Always try to be helpful and control whatever device the user asks for.
    
    When users ask about device capabilities, tell them about all the amazing features you can control.""",
    tools=[set_device_status],
)

Overwriting home_automation_agent/agent.py


In [6]:
# Launched the agent on a different port.
!adk web --port 8502 

^C


#### Run the Evaluation

**Do: Run your first evaluation**

Now, let's run the test case to see if the agent can replicate its previous success.

1. In the Eval tab, make sure your new test case is checked.
2. Click the Run Evaluation button.
3. The EVALUATION METRIC dialog will appear. For now, leave the default values and click Start.
4. The evaluation will run, and you should see a green Pass result in the Evaluation History. This confirms the agent's behavior matched the saved session.

‚ÄºÔ∏è **Understanding the Evaluation Metrics**

When you run evaluation, you'll see two key scores:

* **Response Match Score:** Measures how similar the agent's actual response is to the expected response. Uses text similarity algorithms to compare content. A score of 1.0 = perfect match, 0.0 = completely different.

* **Tool Trajectory Score:** Measures whether the agent used the correct tools with correct parameters. Checks the sequence of tool calls against expected behavior. A score of 1.0 = perfect tool usage, 0.0 = wrong tools or parameters.

**Do: Analyze a Failure**

Let's intentionally break the test to see what a failure looks like.

1. In the list of eval cases, click the Edit (pencil) icon next to your test case.
2. In the "Final Response" text box, change the expected text to something incorrect, like: `The desk lamp is off`.
3. Save the changes and re-run the evaluation.
4. This time, the result will be a red Fail. Hover your mouse over the "Fail" label. A tooltip will appear showing a side-by-side comparison of the Actual vs. Expected Output, highlighting exactly why the test failed (the final response didn't match).
This immediate, detailed feedback is invaluable for debugging.

#### (Optional) Create challenging test cases

Now create more test cases to expose hidden problems:

**Create these scenarios in separate conversations:**

1. **Ambiguous Commands:** `"Turn on the lights in the bedroom"`
   - Save as a new test case: `ambiguous_device_reference`
   - Run evaluation - it likely passes but the agent might be confused

2. **Invalid Locations:** `"Please turn off the TV in the garage"`  
   - Save as a new test case: `invalid_location_test`
   - Run evaluation - the agent might try to control non-existent devices

3. **Complex Commands:** `"Turn off all lights and turn on security system"`
   - Save as a new test case: `complex_multi_device_command`
   - Run evaluation - the agent might attempt operations beyond its capabilities

**The Problem You'll Discover:**
Even when tests "pass," you can see the agent:
- Makes assumptions about devices that don't exist
- Gives responses that sound helpful but aren't accurate
- Tries to control devices it shouldn't have access to

‚ùå **Web UI Limitation:** So far, we've seen how to create and evaluate test cases in the ADK web UI. The web UI is great for interactive test creation, but testing one conversation at a time doesn't scale.

‚ùì **The Question:** How do I proactively detect regressions in my agent's performance? 

Let's answer that question in the next section!


---
### Section 4: Systematic Evaluation

Regression testing is the practice of re-running existing tests to ensure that new changes haven't broken previously working functionality.

ADK provides two methods to do automatic regression and batch testing: using [pytest](https://google.github.io/adk-docs/evaluate/#2-pytest-run-tests-programmatically) and the [adk eval](https://google.github.io/adk-docs/evaluate/#3-adk-eval-run-evaluations-via-the-cli) CLI command. In this section, we'll use the CLI command. For more information on the `pytest` approach, refer to the links in the resource section at the end of this notebook.

The following image shows the overall process of evaluation. **At a high-level, there are four steps to evaluate:**

1) **Create an evaluation configuration** - define metrics or what you want to measure
2) **Create test cases** - sample test cases to compare against
3) **Run the agent with test query**
4) **Compare the results**



![Evaluate](https://storage.googleapis.com/github-repo/kaggle-5days-ai/day4/evaluate_agent.png)

In [8]:
import json

# Create evaluation configuration with basic criteria
eval_config = {
    "criteria": {
        "tool_trajectory_avg_score": 1.0,  # Perfect tool usage required
        "response_match_score": 0.8,  # 80% text similarity threshold
    }
}

with open("home_automation_agent/test_config.json", "w") as f:
    json.dump(eval_config, f, indent=2)

print("‚úÖ Evaluation configuration created!")
print("\nüìä Evaluation Criteria:")
print("‚Ä¢ tool_trajectory_avg_score: 1.0 - Requires exact tool usage match")
print("‚Ä¢ response_match_score: 0.8 - Requires 80% text similarity")
print("\nüéØ What this evaluation will catch:")
print("‚úÖ Incorrect tool usage (wrong device, location, or status)")
print("‚úÖ Poor response quality and communication")
print("‚úÖ Deviations from expected behavior patterns")

‚úÖ Evaluation configuration created!

üìä Evaluation Criteria:
‚Ä¢ tool_trajectory_avg_score: 1.0 - Requires exact tool usage match
‚Ä¢ response_match_score: 0.8 - Requires 80% text similarity

üéØ What this evaluation will catch:
‚úÖ Incorrect tool usage (wrong device, location, or status)
‚úÖ Poor response quality and communication
‚úÖ Deviations from expected behavior patterns


In [9]:
# Create evaluation test cases that reveal tool usage and response quality problems
test_cases = {
    "eval_set_id": "home_automation_integration_suite",
    "eval_cases": [
        {
            "eval_id": "living_room_light_on",
            "conversation": [
                {
                    "user_content": {
                        "parts": [
                            {"text": "Please turn on the floor lamp in the living room"}
                        ]
                    },
                    "final_response": {
                        "parts": [
                            {
                                "text": "Successfully set the floor lamp in the living room to on."
                            }
                        ]
                    },
                    "intermediate_data": {
                        "tool_uses": [
                            {
                                "name": "set_device_status",
                                "args": {
                                    "location": "living room",
                                    "device_id": "floor lamp",
                                    "status": "ON",
                                },
                            }
                        ]
                    },
                }
            ],
        },
        {
            "eval_id": "kitchen_on_off_sequence",
            "conversation": [
                {
                    "user_content": {
                        "parts": [{"text": "Switch on the main light in the kitchen."}]
                    },
                    "final_response": {
                        "parts": [
                            {
                                "text": "Successfully set the main light in the kitchen to on."
                            }
                        ]
                    },
                    "intermediate_data": {
                        "tool_uses": [
                            {
                                "name": "set_device_status",
                                "args": {
                                    "location": "kitchen",
                                    "device_id": "main light",
                                    "status": "ON",
                                },
                            }
                        ]
                    },
                }
            ],
        },
    ],
}

In [None]:
# creates a integration.evalset.json file in the agent's root directory.

import json

with open("home_automation_agent/integration.evalset.json", "w") as f:
    json.dump(test_cases, f, indent=2)

print("‚úÖ Evaluation test cases created")
print("\nüß™ Test scenarios:")
for case in test_cases["eval_cases"]:
    user_msg = case["conversation"][0]["user_content"]["parts"][0]["text"]
    print(f"‚Ä¢ {case['eval_id']}: {user_msg}")

print("\nüìä Expected results:")
print("‚Ä¢ basic_device_control: Should pass both criteria")
print(
    "‚Ä¢ wrong_tool_usage_test: May fail tool_trajectory if agent uses wrong parameters"
)
print(
    "‚Ä¢ poor_response_quality_test: May fail response_match if response differs too much"
)

‚úÖ Evaluation test cases created

üß™ Test scenarios:
‚Ä¢ living_room_light_on: Please turn on the floor lamp in the living room
‚Ä¢ kitchen_on_off_sequence: Switch on the main light in the kitchen.

üìä Expected results:
‚Ä¢ basic_device_control: Should pass both criteria
‚Ä¢ wrong_tool_usage_test: May fail tool_trajectory if agent uses wrong parameters
‚Ä¢ poor_response_quality_test: May fail response_match if response differs too much


In [12]:
print("üöÄ Run this command to execute evaluation:")
!adk eval home_automation_agent home_automation_agent/integration.evalset.json --config_file_path=home_automation_agent/test_config.json --print_detailed_results

üöÄ Run this command to execute evaluation:
Using evaluation criteria: criteria={'tool_trajectory_avg_score': 1.0, 'response_match_score': 0.8} user_simulator_config=None
Tool Call: Setting main light in kitchen to ON
Tool Call: Setting floor lamp in living room to ON
*********************************************************************
Eval Run Summary
home_automation_integration_suite:
  Tests passed: 0
  Tests failed: 2
********************************************************************
Eval Set Id: home_automation_integration_suite
Eval Id: living_room_light_on
Overall Eval Status: FAILED
---------------------------------------------------------------------
Metric: tool_trajectory_avg_score, Status: PASSED, Score: 1.0, Threshold: 1.0
---------------------------------------------------------------------
Metric: response_match_score, Status: FAILED, Score: 0.761904761904762, Threshold: 0.8
---------------------------------------------------------------------
Invocation Details:
+--

  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
2025-11-30 03:42:43,040 - INFO - plugin_manager.py:96 - Plugin 'request_intercepter_plugin' registered.
2025-11-30 03:42:43,999 - INFO - google_llm.py:133 - Sending out request, model: gemini-2.5-flash-lite, backend: GoogleLLMVariant.GEMINI_API, stream: False
2025-11-30 03:42:44,007 - INFO - plugin_manager.py:96 - Plugin 'request_intercepter_plugin' registered.
2025-11-30 03:42:44,009 - INFO - google_llm.py:133 - Sending out request, model: gemini-2.5-flash-lite, backend: GoogleLLMVariant.GEMINI_API, stream: False
2025-11-30 03:42:45,031 - INFO - google_llm.py:186 - Response received from the model.
2025-11-30 03:42:45,033 - INFO - google_llm.py:133 - Sending out request, model: g

In [13]:
# Analyzing evaluation results - the data science approach
print("üìä Understanding Evaluation Results:")
print()
print("üîç EXAMPLE ANALYSIS:")
print()
print("Test Case: living_room_light_on")
print("  ‚ùå response_match_score: 0.45/0.80")
print("  ‚úÖ tool_trajectory_avg_score: 1.0/1.0")
print()
print("üìà What this tells us:")
print("‚Ä¢ TOOL USAGE: Perfect - Agent used correct tool with correct parameters")
print("‚Ä¢ RESPONSE QUALITY: Poor - Response text too different from expected")
print("‚Ä¢ ROOT CAUSE: Agent's communication style, not functionality")
print()
print("üéØ ACTIONABLE INSIGHTS:")
print("1. Technical capability works (tool usage perfect)")
print("2. Communication needs improvement (response quality failed)")
print("3. Fix: Update agent instructions for clearer language or constrained response.")
print()

üìä Understanding Evaluation Results:

üîç EXAMPLE ANALYSIS:

Test Case: living_room_light_on
  ‚ùå response_match_score: 0.45/0.80
  ‚úÖ tool_trajectory_avg_score: 1.0/1.0

üìà What this tells us:
‚Ä¢ TOOL USAGE: Perfect - Agent used correct tool with correct parameters
‚Ä¢ RESPONSE QUALITY: Poor - Response text too different from expected
‚Ä¢ ROOT CAUSE: Agent's communication style, not functionality

üéØ ACTIONABLE INSIGHTS:
1. Technical capability works (tool usage perfect)
2. Communication needs improvement (response quality failed)
3. Fix: Update agent instructions for clearer language or constrained response.



---
### Section 5: User Simulation (Optional)

While **traditional evaluation methods rely on fixed test cases**, real-world conversations are dynamic and unpredictable. This is where User Simulation comes in.

User Simulation is a powerful feature in ADK that addresses the limitations of static evaluation. Instead of using pre-defined, fixed user prompts, User Simulation employs a generative AI model (like Gemini) to **dynamically generate user prompts during the evaluation process.**

#### How it works

* You define a `ConversationScenario` that outlines the user's overall conversational goals and a `conversation_plan` to guide the dialogue.
* A large language model (LLM) then acts as a simulated user, using this plan and the ongoing conversation history to generate realistic and varied prompts.
* This allows for more comprehensive testing of your agent's ability to handle unexpected turns, maintain context, and achieve complex goals in a more natural, unpredictable conversational flow.

User Simulation helps you uncover edge cases and improve your agent's robustness in ways that static test cases often miss.

#### Exercise

Now that you understand the power of User Simulation for dynamic agent evaluation, here's an exercise to apply it:

Apply the **User Simulation** feature to your agent. Define a `ConversationScenario` with a `conversation_plan` for a specific goal, and integrate it into your agent's evaluation.

**Refer to this [documentation](https://google.github.io/adk-docs/evaluate/user-sim/) to learn how to do it.**

## üèÜ Congratulations!

### You've learned

- ‚úÖ Interactive test creation and analysis in the ADK web UI
- ‚úÖ Tool trajectory and response metrics
- ‚úÖ Automated regression testing using `adk eval` CLI command
- ‚úÖ How to analyze evaluation results and fix agents based on it