# Agentic Platform: Agent Evaluation

This lab introduces the concept of Agent Evaluation. Effective evaluation frameworks are essential for measuring, comparing and improving agent performance across different implementations and configurations.

There are several approaches to evaluating agentic systems, ranging from simple to complex. In this lab we'll focus on two evaluation metrics
* Assertion based evaluations
* Step-based procedural analysis

The most sophisticated evaluation systems incorporate human feedback loops and nuanced assessments of agent behavior. While powerful, these systems often require significant infrastructure and human resources to implement effectively.

To get started, we'll explore a couple approaches to an evaluation framework focused on two key metrics:
1. Task success rate - measuring whether agents complete their assigned tasks correctly
2. Steps to completion - analyzing the efficiency of agents by tracking the number of steps required to achieve success

Assertions are essentially test cases. Sometimes we're looking for exact answers (What's the capital of France? -> Paris). Sometimes the evaluation criteria is qualitative data (did the agent provide a recipe that was gluten free?)

These metrics provide a straightforward but powerful foundation for comparing agent performance across different models, prompting strategies, and tool configurations.

First lets modify a couple agents we built in the previous labs.

In [1]:
from pydantic_ai import Agent
from pydantic import BaseModel
from tavily import TavilyClient
from typing import Dict
import os
from dotenv import load_dotenv

from agentic_platform.core.models.memory_models import ToolResult
from mcp import ListToolsResult
from mcp.server import FastMCP
from typing import List, Any

load_dotenv()

# First we wrap our own MCP server in a MCPServerStdio object
from pydantic_ai.mcp import MCPServerStdio as PyAIMCPServerStdio

# Load our API key from the environment variable
TAVILY_API_KEY = os.getenv('TAVILY_API_KEY')

# Now lets create our research tools
class WebSearch(BaseModel):
    query: str

server = FastMCP()

def search_web(query: WebSearch) -> List[Dict[str, Any]]:
    '''Search the web to get back a list of results and content.'''
    client: TavilyClient = TavilyClient(os.getenv("TAVILY_KEY"))
    response: Dict[str, any] = client.search(query=query.query)
    return response['results']



In [None]:
# Next lets create our researcher agent. 
from pydantic_ai import Agent as PyAIAgent
from agentic_platform.core.models.prompt_models import BasePrompt
from pydantic import BaseModel

RESEARCHER_SYSTEM_PROMPT = """
You are a helpful research agent with web search capabilities. Your job is to:

1. Search for accurate, up-to-date information about any topic
2. Provide clear and conscise answers to the users question. Provide a source annotation at the end of your response that maps to the source links below.
3. Cite your sources with numbered links at the end of your response.

Always be factual and objective in your research. Be clear and concise in your responses.
Only answer the immediate question, you do not need to provide any additional context or commentary.
If someone asks who the CEO of a company is, you should just say their name for example.
"""

# Define the expected output structure
class ResearchResponse(BaseModel):
    content: str
    sources: list[str]

# Create the agent with the simplified prompt
researcher_agent = PyAIAgent(
    'bedrock:us.anthropic.claude-3-5-haiku-20241022-v1:0',
    system_prompt=RESEARCHER_SYSTEM_PROMPT,
    result_type=ResearchResponse,
)

researcher_agent.tool_plain(search_web)

In [None]:
import nest_asyncio
nest_asyncio.apply()

test_case = "Who is the CEO of Amazon?"

result = researcher_agent.run_sync(test_case)

result.output



In [None]:
# Print out the conversation history to see where the results came from. 
result.all_messages()

# Write Assertions.
Now that we have a researcher agent using Tavily websearch, lets evaluate how well it does. There are many frameworks that do evaluations for you including LangChain, Ragas, and Pydantic Evals. However, it's valuable to understand whats going on under the hood by at least seeing it written out from scratch. 

Below is a sample implementation of an assertion based evaluator. In practice, you'd want to parallelize these for speed like we did in the module 1 evaluation notebook.

In [5]:
import asyncio
import json
import boto3
from typing import List, Callable, Awaitable
from pydantic import BaseModel

class TestCase(BaseModel):
    name: str
    query: str
    assertions: List[str]

class AgentResponse(BaseModel):
    content: str
    sources: List[str]
    success: bool
    error: str = None

class AssertionResult(BaseModel):
    assertion: str
    passed: bool

class TestResult(BaseModel):
    name: str
    query: str
    assertions_results: List[AssertionResult]
    agent_failed: bool = False
    agent_error: str = None
    
    @property
    def total_assertions(self) -> int:
        return len(self.assertions_results)
    
    @property
    def passed_assertions(self) -> int:
        return sum(1 for r in self.assertions_results if r.passed)
    
    @property
    def success_rate(self) -> float:
        if self.agent_failed or self.total_assertions == 0:
            return 0.0
        return self.passed_assertions / self.total_assertions

class AssertionEvaluator:
    def __init__(self, model_id: str = 'us.anthropic.claude-3-5-haiku-20241022-v1:0', region: str = 'us-east-1'):
        self.bedrock = boto3.client('bedrock-runtime', region_name=region)
        self.model_id = model_id
    
    def _check_assertion(self, query: str, response: AgentResponse, assertion: str) -> bool:
        prompt = f"""Check if this assertion is TRUE or FALSE based on the research response.
        
        QUERY: {query}
        RESPONSE: {response.content}
        SOURCES: {response.sources}
        
        ASSERTION TO CHECK: {assertion}
        
        Return only "TRUE" or "FALSE" (no explanation needed)."""
        
        try:
            resp = self.bedrock.converse(
                modelId=self.model_id,
                messages=[{"role": "user", "content": [{"text": prompt}]}],
                inferenceConfig={"maxTokens": 50, "temperature": 0}
            )
            content = resp['output']['message']['content'][0]['text'].strip().upper()
            return "TRUE" in content
        except Exception as e:
            print(f"Error checking assertion: {e}")
            return False
    
    async def evaluate_test_case(self, test_case: TestCase, agent_function: Callable[[str], Awaitable[AgentResponse]]) -> TestResult:
        print(f"🧪 Running: {test_case.name}")
        
        response = await agent_function(test_case.query)
        
        if not response.success:
            print(f"❌ Agent failed: {response.error}")
            return TestResult(
                name=test_case.name,
                query=test_case.query,
                assertions_results=[],
                agent_failed=True,
                agent_error=response.error
            )
        
        print(f"💬 Response: {response.content[:100]}...")
        
        assertion_results = []
        for i, assertion in enumerate(test_case.assertions, 1):
            print(f"  Checking assertion {i}...")
            passed = self._check_assertion(test_case.query, response, assertion)
            assertion_results.append(AssertionResult(assertion=assertion, passed=passed))
        
        return TestResult(
            name=test_case.name,
            query=test_case.query,
            assertions_results=assertion_results
        )
    
    async def evaluate_test_cases(self, test_cases: List[TestCase], agent_function: Callable[[str], Awaitable[AgentResponse]]) -> List[TestResult]:
        results = []
        for test_case in test_cases:
            result = await self.evaluate_test_case(test_case, agent_function)
            results.append(result)
            print()
        return results
    
    @staticmethod
    def print_results(results: List[TestResult]):
        print("\n" + "="*80)
        print("📊 EVALUATION RESULTS")
        print("="*80)
        
        for result in results:
            print(f"\n{result.name}: {result.passed_assertions}/{result.total_assertions} passed")
            
            if result.agent_failed:
                print(f"  ❌ Agent execution failed: {result.agent_error}")
                continue
                
            for j, assertion_result in enumerate(result.assertions_results, 1):
                symbol = "✅" if assertion_result.passed else "❌"
                status = "PASSED" if assertion_result.passed else "FAILED"
                print(f"  {symbol} Assertion {j} ({status}): {assertion_result.assertion}")
        
        successful_results = [r for r in results if not r.agent_failed]
        total_tests = len(results)
        successful_tests = len(successful_results)
        
        if successful_tests > 0:
            total_assertions = sum(r.total_assertions for r in successful_results)
            passed_assertions = sum(r.passed_assertions for r in successful_results)
            
            print(f"\n📈 SUMMARY:")
            print(f"  Tests: {successful_tests}/{total_tests} successful")
            print(f"  Assertions: {passed_assertions}/{total_assertions} passed")
            print(f"  Success Rate: {(passed_assertions/total_assertions)*100:.1f}%")
            
            # Show failed assertions
            failed_assertions = []
            for result in successful_results:
                for i, assertion_result in enumerate(result.assertions_results, 1):
                    if not assertion_result.passed:
                        failed_assertions.append(f"{result.name} - Assertion {i}: {assertion_result.assertion}")
            
            if failed_assertions:
                print(f"\n❌ FAILED ASSERTIONS:")
                for failed in failed_assertions:
                    print(f"  • {failed}")
        else:
            print(f"\n❌ All {total_tests} tests failed")

# Run Our Assertions
Next we'll create a function to evaluate against and pass in a test set. 



In [None]:
# Function to run the researcher agent
async def researcher_agent_function(query: str) -> AgentResponse:
    try:
        async with researcher_agent.run_mcp_servers():
            result = researcher_agent.run_sync(query)
        return AgentResponse(
            content=result.output.content,
            sources=result.output.sources,
            success=True
        )
    except Exception as e:
        return AgentResponse(
            content="",
            sources=[],
            success=False,
            error=str(e)
        )

# Test cases
test_cases = [
    TestCase(
        name='Amazon CEO Test',
        query='Who is the current CEO of Amazon?',
        assertions=[
            'The response correctly identifies Andy Jassy as the current CEO of Amazon',
            'The response mentions when Andy Jassy became CEO (2021)',
            'The response includes at least one credible source citation',
            'The response is concise and directly answers the question'
        ]
    ),
    TestCase(
        name='Paris Population Test',
        query='What is the population of the capital of France?',
        assertions=[
            'The response correctly identifies Paris as the capital of France',
            'The response provides a specific population figure for Paris',
            'The response distinguishes between city proper and metropolitan area population',
            'The response cites at least two reliable sources',
            'The response indicates when the population data was collected/estimated'
        ]
    )
]

# Usage example
async def run_evaluation():
    print("🚀 Starting Assertion-Based Evaluation\n")
    evaluator = AssertionEvaluator()
    results = await evaluator.evaluate_test_cases(test_cases, researcher_agent_function)
    evaluator.print_results(results)

# Execute
await run_evaluation()

As you can see, we got our results back. Evals are finicky. They almost never pass 100%, but they do fail. Running them multiple times can often result in different answers by the nature of LLMs being non-determinstic. What consistitutes as passing is more on the owner of the change. One metric might go up while another goes down. 

Lastly, writing all your cases in code is not ideal. Let's export this dataset to a json file so it's more reusable across different platforms and approaches to evaluation. We can write utility functions to import and export them.

In [None]:
from pathlib import Path

# Utility functions for test case management
def export_test_cases(test_cases: List[TestCase], file_path: str):
    """Export test cases to JSON file"""
    from pathlib import Path
    data = [test_case.model_dump() for test_case in test_cases]
    Path(file_path).write_text(json.dumps(data, indent=2))
    print(f"✅ Exported {len(test_cases)} test cases to {file_path}")

def import_test_cases(file_path: str) -> List[TestCase]:
    """Import test cases from JSON file"""
    from pathlib import Path
    data = json.loads(Path(file_path).read_text())
    test_cases = [TestCase(**item) for item in data]
    print(f"✅ Imported {len(test_cases)} test cases from {file_path}")
    return test_cases

# Export test cases
test_cases_path = Path('data/test_cases.json')
export_test_cases(test_cases, str(test_cases_path))

# Import test cases  
loaded_test_cases = import_test_cases(str(test_cases_path))

# View the saved file
print(test_cases_path.read_text())

# Tool Use Evaluation
Next lets go a bit deeper and evaluate tool use. Every evaluation framework has a different method of doing this. Ragas (popular) measures tool adherence, tool call accuracy, etc.. We'll opt for a slightly different approach. And because none of the frameworks support it, we'll write it ourselves. 

Agents can take undeterministic paths. The important thing is that they get to the right answer in a reasonable amount of steps. In cases where we do know the steps the agent would need to take, we can create logical groupings of tool invocations. Below is an implementation of that. 

In [None]:
from agentic_platform.core.models.memory_models import Message, SessionContext
from pydantic_ai.agent import AgentRunResult

test_runs: Dict[str, Any] = {}

# Define the input type for the researcher agent
class ResearchQuery(BaseModel):
    user_query: str

# Updated to use assertion evaluator
async def perform_research_2(query: str) -> AgentResponse:
    try:
        async with researcher_agent.run_mcp_servers():
            result: AgentRunResult[ResearchResponse] = researcher_agent.run_sync(query)
        
        test_runs[query] = result.all_messages()
        
        return AgentResponse(
            content=result.output.content,
            sources=result.output.sources,
            success=True
        )
    except Exception as e:
        return AgentResponse(
            content="",
            sources=[],
            success=False,
            error=str(e)
        )

# Run the assertion evaluation
print("🚀 Starting Assertion-Based Evaluation\n")
evaluator = AssertionEvaluator()
results = await evaluator.evaluate_test_cases(test_cases, perform_research_2)
evaluator.print_results(results)

print('-' * 100)

for k, v in test_runs.items():
    print(k)
    print(v)
    print('-' * 100)

Lets convert our test runs into our format so we can run our own evaluation metrics.

In [None]:
# Now lets convert the message to our types using our converters we've already build. 
from typing import List
from agentic_platform.core.converter.pydanticai_converters import PydanticAIMessageConverter

# Convert the messages to our types.
test_runs_formatted: Dict[str, List[Message]] = {}
for k, v in test_runs.items():
    print(v)
    test_runs_formatted[k] = PydanticAIMessageConverter.convert_messages(v)

for k, v in test_runs_formatted.items():
    print(k)
    print(v)
    print('-' * 100)



Lastly, lets write our evaluation harness. To do this we need our test cases. Steps to complete is performed by creating logical groupings of steps the agent takes to complete a task. Because they're autonomous, they might take different paths to get to the same answer. This is why logical groupings are important.

In [10]:
from typing import Literal, Callable
from pydantic import Field
import json
from typing import Tuple
from pydantic_core import to_jsonable_python
from agentic_platform.core.models.llm_models import LLMRequest, LLMResponse
from agentic_platform.core.models.prompt_models import BasePrompt
from agentic_platform.core.models.tool_models import ToolSpec

from agentic_platform.core.converter.llm_request_converters import ConverseRequestConverter
from agentic_platform.core.converter.llm_response_converters import ConverseResponseConverter
from concurrent.futures import ThreadPoolExecutor
import boto3

SYSTEM_PROMPT = """
You are an expert evaluator of agentic systems. You are given a list of steps the agent took to complete a task and a list of expected steps. 
You need to evaluate the agent's performance based on the expected steps.

You should take the steps inputted and create "logical groupings" of those steps. Those grouping names should come from the expected steps if similar. 
If the agent took a different path, you should create a new grouping name so we can evaluate the agent's performance.

When creating logical groupings, group things together. Instead of saying search the web, gather results. Group them into one step even if the message history shows them as separate steps.
"""

USER_PROMPT = """
Here are the steps the agent took to complete the task:
{steps}

Here are the expected steps:
{expected_steps}

Here is the task success criteria:
{success_criteria}

Before answering, think about the steps and how the agent took them to complete the task.
"""

# Types for the evaluation.

class AgentEvalPrompt(BasePrompt):
    system_prompt: str = SYSTEM_PROMPT
    user_prompt: str = USER_PROMPT

class AgentEvalResult(BaseModel):
    '''Evaluation results for the agent.'''
    thoughts: str = Field(
        description="The evaluators thoughts on the agents performance."
    )
    
    steps: List[str] = Field(
        description="The logical groupings of actions taken by the agent to complete the task, typically representing tool calls or major decision points."
    )
    
    task_success: bool = Field(
        description="Whether the agent successfully completed the task according to the defined success criteria. True indicates success, False indicates failure."
    )

class AgentEvalSample(BaseModel):
    user_input: str
    expected_steps: List[str]
    expected_output: Any
    success_criteria: Literal['Got Corect Answer', '< 1 step from gold standard']


client = boto3.client('bedrock-runtime')

# Helper function to call the LLM.
def call_bedrock(request: LLMRequest) -> LLMResponse:
    kwargs: Dict[str, Any] = ConverseRequestConverter.convert_llm_request(request)
    converse_response: Dict[str, Any] = client.converse(**kwargs)
    return ConverseResponseConverter.to_llm_response(converse_response)


def eval_function(sample: AgentEvalSample, context: List[Message]) -> AgentEvalResult:
    # Create the input for the prompt.
    inputs={
        'steps': json.dumps(to_jsonable_python(context)),
        'expected_steps': '\n'.join(sample.expected_steps),
        'success_criteria': sample.success_criteria
    }

    # Format the prompt.
    prompt: BasePrompt = AgentEvalPrompt(inputs=inputs)

    # Create a tool spec for the structured output.
    tool_spec: ToolSpec = ToolSpec(
        name='AgentEvalResult',
        description='Structured output for an agents performance.',
        model=AgentEvalResult,
    )

    # Create the LLM request and forces structured output through a tool call.
    llm_request: LLMRequest = LLMRequest(
        system_prompt=prompt.system_prompt,
        model_id='us.anthropic.claude-3-5-haiku-20241022-v1:0',
        messages=[Message(role='user', text=prompt.user_prompt)],
        hyperparams={"temperature": 0.2},
        tools=[tool_spec],
        force_tool=tool_spec.name
    )

    # Call the LLM and get results.
    llm_response: LLMResponse = call_bedrock(llm_request)
    return AgentEvalResult(**llm_response.tool_calls[0].arguments)

def run_function(sample: AgentEvalSample) -> List[Message]:
    result: AgentRunResult[ResearchResponse] = researcher_agent.run_sync(sample.user_input)
    messages: List[Message] = result.all_messages()
    return PydanticAIMessageConverter.convert_messages(messages)

class AgentEvalHarness:

    def __init__(self, 
                 samples: List[AgentEvalSample], 
                 eval_function: Callable[[AgentEvalSample, List[Message]], AgentEvalResult],
                 run_function: Callable[[AgentEvalSample], List[Message]]):
        
        self.samples = samples
        self.eval_function = eval_function
        self.run_function = run_function

    def evaluate_sample(self, sample: AgentEvalSample) -> Tuple[AgentEvalSample, AgentEvalResult]:
        context: List[Message] = self.run_function(sample)
        result: AgentEvalResult = self.eval_function(sample, context)
        return sample, result
    
    
    def evaluate_threaded(self, num_workers: int = 2) -> List[AgentEvalSample]:
        with ThreadPoolExecutor(max_workers=num_workers) as executor:
            results: List[Tuple[AgentEvalSample, AgentEvalResult]] = list(executor.map(self.evaluate_sample, self.samples))
        return results





In [None]:
samples = [
    AgentEvalSample(
        user_input='Who is the current CEO of Amazon?',
        expected_steps=['Search the web for the current CEO of Amazon', "Verify the information is correct"],
        expected_output='Andy Jassy',
        success_criteria='Got Corect Answer'
    ),
    AgentEvalSample(
        user_input='What is the population of the capital of France?',
        expected_steps=[
            'Search the web for the capital of France',
            'Search the web for the population of Paris',
            'Format results into a response'
        ],
        expected_output='Roughly 2.1 million',
        success_criteria='Got Corect Answer'
    )
]

harness = AgentEvalHarness(samples, eval_function, run_function)
evaluation_results = harness.evaluate_threaded()
evaluation_results

This is great, but going through each result individually would be painful. We can create a summary view

In [12]:
def create_evaluation_summary(eval_pairs: List[Tuple[AgentEvalSample, AgentEvalResult]]) -> str:
    total_samples = len(eval_pairs)
    success_count = 0
    total_step_delta = 0
    
    for sample, result in eval_pairs:
        # Calculate step delta
        expected_step_count = len(sample.expected_steps)
        actual_step_count = len(result.steps)
        step_delta = actual_step_count - expected_step_count
        
        # Update totals
        if result.task_success:
            success_count += 1
        total_step_delta += step_delta
    
    # Calculate averages
    success_rate = success_count / total_samples * 100
    avg_step_delta = total_step_delta / total_samples
    
    # Generate summary
    summary = f"""
    AGENT EVALUATION SUMMARY
    ========================
    Total samples evaluated: {total_samples}
    Success rate: {success_rate:.1f}%
    Average step delta: {avg_step_delta:.2f}

    Step delta interpretation:
    - Negative: Agent used fewer steps than expected (more efficient)
    - Zero: Agent used exactly the expected number of steps
    - Positive: Agent used more steps than expected (less efficient)
    """
    
    return summary

In [None]:
print(create_evaluation_summary(eval_pairs=evaluation_results))

And lastly, lets convert the results to a pandas dataframe so we can dive into the results a bit better

In [None]:
import pandas as pd
from typing import List, Tuple

def create_evaluation_dataframe(eval_pairs: List[Tuple]):
    """Convert evaluation pairs to a structured pandas DataFrame."""
    
    data = []
    for sample, result in eval_pairs:
        # Calculate step delta
        expected_step_count = len(sample.expected_steps)
        actual_step_count = len(result.steps)
        step_delta = actual_step_count - expected_step_count
        
        # Create row
        row = {
            'user_input': sample.user_input,
            'expected_steps': sample.expected_steps,
            'expected_output': sample.expected_output,
            'success_criteria': sample.success_criteria,
            'actual_steps': result.steps,
            'step_count': actual_step_count,
            'expected_step_count': expected_step_count,
            'step_delta': step_delta,
            'task_success': result.task_success,
            'thoughts': result.thoughts
        }
        data.append(row)
    
    # Create DataFrame
    df = pd.DataFrame(data)
    
    # Calculate summary statistics and add to dataframe attributes
    df.attrs['total_samples'] = len(eval_pairs)
    df.attrs['success_rate'] = df['task_success'].mean() * 100
    df.attrs['avg_step_delta'] = df['step_delta'].mean()
    
    return df

# Example usage:
df = create_evaluation_dataframe(evaluation_results)

# Display basic info
print(f"Evaluation results for {df.attrs['total_samples']} samples:")
print(f"Success rate: {df.attrs['success_rate']:.1f}%")
print(f"Average step delta: {df.attrs['avg_step_delta']:.2f}")

# Display the DataFrame
print("\nDetailed results:")
print(df[['user_input', 'step_delta', 'task_success']])

# Output summary statistics
print("\nSummary by success status:")
print(df.groupby('task_success')['step_delta'].describe())



# Conclusion
In this lab we went through the basics of agent evaluation. This is very much an open research area. For starting out, we suggest using the assertion based approach that Pydantic provides. The most important thing in evaluation is to identify what metrics to select and what make sense for your usecase. If a framework provides it, great. If not, you often times have to build your own test harness. 

In future labs, we'll discuss how these fit into a CI/CD