# Tutorial 20: Multi-Agent Evaluation

In this tutorial, you'll learn how to evaluate agents using **simulated users** and **automated evaluators**. This pattern enables systematic testing of agents through realistic conversations.

**What you'll learn:**
- Creating simulated user agents with personas and goals
- Building automated evaluator agents that score conversations
- Orchestrating evaluation sessions with metrics collection
- Aggregating scores across multiple evaluation runs
- Using evaluation to improve agent quality

By the end, you'll have a working evaluation system that can test any agent through simulated conversations and provide objective quality scores.

## Prerequisites

- Completed Tutorials 14-16 (Multi-Agent Patterns)
- Understanding of multi-agent collaboration
- Ollama running with a capable model (llama3.1:8b or larger recommended)

## Why Multi-Agent Evaluation?

Testing agents is challenging because:

1. **Manual testing doesn't scale**: Testing every change manually is time-consuming
2. **Need consistent metrics**: Subjective assessment varies between testers
3. **Edge cases matter**: Agents should handle various user behaviors
4. **Quality is multi-dimensional**: Helpfulness, accuracy, empathy all matter

**Multi-agent evaluation** solves this by:
- Using **simulated users** to generate realistic test conversations
- Using **evaluator agents** to score conversations objectively
- Running **multiple sessions** to cover different scenarios
- **Aggregating metrics** for overall quality assessment

```
┌─────────────────────────────────────────────────────────┐
│                  Evaluation Session                      │
│                                                           │
│  ┌──────────┐       ┌──────────┐       ┌──────────┐    │
│  │  Agent   │◄─────►│Simulated │       │Evaluator │    │
│  │ Under    │       │  User    │──────►│  Agent   │    │
│  │  Test    │       │  Agent   │       │          │    │
│  └──────────┘       └──────────┘       └────┬─────┘    │
│                                              │           │
│                                              ▼           │
│                                        ┌──────────┐     │
│                                        │  Scores  │     │
│                                        │  Metrics │     │
│                                        └──────────┘     │
└─────────────────────────────────────────────────────────┘
```

## Step 1: Setup and Verify Environment

In [None]:
# Verify our setup
from langgraph_ollama_local import LocalAgentConfig

config = LocalAgentConfig()
print(f"Ollama URL: {config.ollama.base_url}")
print(f"Model: {config.ollama.model}")
print("Setup verified!")

## Step 2: Define the Evaluation State

Our state tracks:
- The conversation between agent and simulated user
- Scores from the evaluator agent
- Turn count and completion status
- Final aggregated metrics

In [None]:
from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph.message import add_messages
import operator


class EvaluationState(TypedDict):
    """State for agent evaluation sessions."""
    
    # Conversation messages
    messages: Annotated[list, add_messages]
    
    # Formatted conversation for evaluator
    conversation: str
    
    # Accumulated scores - uses operator.add to append
    evaluator_scores: Annotated[list[dict], operator.add]
    
    # Turn tracking
    turn_count: int
    max_turns: int
    
    # Session status
    session_complete: bool
    
    # Final metrics
    final_metrics: dict[str, float]


print("Evaluation state defined!")
print("\nKey fields:")
print("- evaluator_scores: Accumulates scores from evaluator agent")
print("- final_metrics: Aggregated scores (helpfulness, accuracy, etc.)")

## Step 3: Configure the Simulated User

The simulated user acts as a realistic test user with:
- A specific persona and situation
- Goals to achieve in the conversation
- A behavioral style (friendly, impatient, confused, etc.)

In [None]:
from pydantic import BaseModel, Field
from typing import Literal


class SimulatedUser(BaseModel):
    """Configuration for simulated user behavior."""
    
    persona: str = Field(
        description="User's background and situation"
    )
    goals: list[str] = Field(
        description="What the user wants to accomplish"
    )
    behavior: Literal["friendly", "impatient", "confused", "technical", "casual"] = Field(
        default="friendly",
        description="Communication style"
    )
    initial_message: str | None = Field(
        default=None,
        description="First message (auto-generated if None)"
    )


# Example: Customer service scenario
frustrated_customer = SimulatedUser(
    persona="Customer who received a damaged product and wants a refund. Has been waiting 2 weeks for resolution.",
    goals=[
        "Get full refund processed",
        "Express dissatisfaction with shipping delay",
        "Ensure future orders won't have same issue"
    ],
    behavior="impatient",
    initial_message="Hi, I'm really frustrated. My order arrived damaged 2 weeks ago and I still haven't gotten a refund!"
)

print("Simulated user configured!")
print(f"\nPersona: {frustrated_customer.persona}")
print(f"Goals: {frustrated_customer.goals}")
print(f"Behavior: {frustrated_customer.behavior}")

## Step 4: Define Evaluation Criteria

The evaluator scores conversations on multiple dimensions using structured output:

In [None]:
class EvaluationCriteria(BaseModel):
    """Structured scores for conversation evaluation."""
    
    helpfulness: int = Field(
        ge=1, le=5,
        description="How helpful were the responses? (1-5)"
    )
    accuracy: int = Field(
        ge=1, le=5,
        description="How accurate was the information? (1-5)"
    )
    empathy: int = Field(
        ge=1, le=5,
        description="How empathetic was the agent? (1-5)"
    )
    efficiency: int = Field(
        ge=1, le=5,
        description="How efficient was the conversation? (1-5)"
    )
    goal_completion: int = Field(
        ge=0, le=1,
        description="Were user's goals achieved? (0=no, 1=yes)"
    )
    reasoning: str = Field(
        description="Explanation of the scores"
    )


print("Evaluation criteria defined!")
print("\nScoring dimensions:")
print("- Helpfulness (1-5)")
print("- Accuracy (1-5)")
print("- Empathy (1-5)")
print("- Efficiency (1-5)")
print("- Goal Completion (0 or 1)")

## Step 5: Create the Simulated User Node

This node generates user messages based on persona and conversation history:

In [None]:
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage


def create_simulated_user_node(llm, user_config: SimulatedUser):
    """Create simulated user agent."""
    
    system_prompt = f"""You are simulating a user in a conversation.

**Your Persona:**
{user_config.persona}

**Your Goals:**
{chr(10).join(f"- {goal}" for goal in user_config.goals)}

**Your Behavior:**
You are {user_config.behavior}.

**Instructions:**
- Stay in character throughout the conversation
- Work towards your goals naturally
- Keep responses concise (1-3 sentences)
- Express satisfaction if goals are met
- Don't mention that you're simulating anything"""
    
    def simulated_user(state: EvaluationState) -> dict:
        """Generate user message."""
        messages = state.get("messages", [])
        turn_count = state.get("turn_count", 0)
        
        # Use initial message for first turn
        if turn_count == 0 and user_config.initial_message:
            user_message = user_config.initial_message
        else:
            # Generate response based on conversation
            prompt_messages = [
                SystemMessage(content=system_prompt),
                HumanMessage(content=f"""Based on the conversation, provide your next message.

Recent conversation:
{chr(10).join(f"{m.type}: {m.content}" for m in messages[-6:] if messages)}

Your next message:""")
            ]
            
            response = llm.invoke(prompt_messages)
            user_message = response.content
        
        return {
            "messages": [HumanMessage(content=user_message)],
            "turn_count": turn_count + 1,
        }
    
    return simulated_user


print("Simulated user node creator defined!")

## Step 6: Create the Evaluator Node

This node reviews conversations and provides structured scores:

In [None]:
def create_evaluator_node(llm):
    """Create evaluator agent."""
    
    # Use structured output for reliable scoring
    structured_llm = llm.with_structured_output(EvaluationCriteria)
    
    system_prompt = """You are an expert evaluator of customer service conversations.

Score conversations objectively on:
- Helpfulness: How helpful were responses? (1-5)
- Accuracy: How accurate was the information? (1-5)
- Empathy: How empathetic was the agent? (1-5)
- Efficiency: How concise and direct? (1-5)
- Goal Completion: Were user goals achieved? (0=no, 1=yes)

Be fair and consider full context."""
    
    def evaluator(state: EvaluationState) -> dict:
        """Evaluate conversation and provide scores."""
        messages = state.get("messages", [])
        
        # Format conversation
        conversation_text = "\n".join([
            f"{'User' if isinstance(m, HumanMessage) else 'Agent'}: {m.content}"
            for m in messages
        ])
        
        eval_messages = [
            SystemMessage(content=system_prompt),
            HumanMessage(content=f"""Evaluate this conversation:

{conversation_text}

Provide scores for all criteria.""")
        ]
        
        criteria = structured_llm.invoke(eval_messages)
        
        scores = {
            "helpfulness": criteria.helpfulness,
            "accuracy": criteria.accuracy,
            "empathy": criteria.empathy,
            "efficiency": criteria.efficiency,
            "goal_completion": criteria.goal_completion,
            "reasoning": criteria.reasoning,
        }
        
        return {
            "evaluator_scores": [scores],
            "conversation": conversation_text,
        }
    
    return evaluator


print("Evaluator node creator defined!")

## Step 7: Create the Agent Under Test

Let's build a simple customer service agent to evaluate:

In [None]:
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model=config.ollama.model,
    base_url=config.ollama.base_url,
    temperature=0.7,
)


def create_customer_service_agent(llm):
    """Create a customer service agent to test."""
    
    system_prompt = """You are a helpful customer service agent.

Your responsibilities:
- Listen to customer concerns with empathy
- Provide clear, accurate information
- Offer solutions to resolve issues
- Be professional and courteous

For refund requests:
- Apologize for the inconvenience
- Confirm you'll process the refund
- Provide a timeline (3-5 business days)
- Offer to help with anything else"""
    
    def agent(state: EvaluationState) -> dict:
        """Respond to customer message."""
        messages = state.get("messages", [])
        
        # Build prompt with conversation history
        prompt_messages = [SystemMessage(content=system_prompt)]
        prompt_messages.extend(messages)
        
        response = llm.invoke(prompt_messages)
        
        return {"messages": [AIMessage(content=response.content)]}
    
    return agent


customer_service_agent = create_customer_service_agent(llm)
print("Customer service agent created!")

## Step 8: Create Helper Nodes

We need nodes to check completion and finalize metrics:

In [None]:
def create_check_completion_node():
    """Check if evaluation session should end."""
    
    def check_completion(state: EvaluationState) -> dict:
        turn_count = state.get("turn_count", 0)
        max_turns = state.get("max_turns", 10)
        
        # Check max turns
        if turn_count >= max_turns:
            return {"session_complete": True}
        
        # Check if user seems satisfied
        messages = state.get("messages", [])
        if messages:
            last_message = messages[-1].content.lower()
            if any(phrase in last_message for phrase in [
                "thank you", "thanks", "goodbye", "bye",
                "that's all", "all set", "perfect"
            ]):
                return {"session_complete": True}
        
        return {"session_complete": False}
    
    return check_completion


def aggregate_scores(scores: list[dict]) -> dict[str, float]:
    """Aggregate evaluator scores."""
    if not scores:
        return {
            "helpfulness_avg": 0.0,
            "accuracy_avg": 0.0,
            "empathy_avg": 0.0,
            "efficiency_avg": 0.0,
            "goal_completion_rate": 0.0,
            "num_scores": 0,
        }
    
    num_scores = len(scores)
    
    return {
        "helpfulness_avg": sum(s.get("helpfulness", 0) for s in scores) / num_scores,
        "accuracy_avg": sum(s.get("accuracy", 0) for s in scores) / num_scores,
        "empathy_avg": sum(s.get("empathy", 0) for s in scores) / num_scores,
        "efficiency_avg": sum(s.get("efficiency", 0) for s in scores) / num_scores,
        "goal_completion_rate": sum(s.get("goal_completion", 0) for s in scores) / num_scores,
        "num_scores": num_scores,
    }


def create_finalize_node():
    """Finalize evaluation metrics."""
    
    def finalize(state: EvaluationState) -> dict:
        scores = state.get("evaluator_scores", [])
        metrics = aggregate_scores(scores)
        return {"final_metrics": metrics}
    
    return finalize


print("Helper nodes defined!")

## Step 9: Build the Evaluation Graph

Now we assemble all pieces into the evaluation graph:

In [None]:
from langgraph.graph import StateGraph, START, END


def create_evaluation_graph(
    llm,
    agent_node,
    user_config: SimulatedUser,
    evaluate_every_n_turns: int = 2,
):
    """Build evaluation graph."""
    
    workflow = StateGraph(EvaluationState)
    
    # Add nodes
    workflow.add_node("simulated_user", create_simulated_user_node(llm, user_config))
    workflow.add_node("agent", agent_node)
    workflow.add_node("evaluator", create_evaluator_node(llm))
    workflow.add_node("check_completion", create_check_completion_node())
    workflow.add_node("finalize", create_finalize_node())
    
    # Entry: simulated user starts
    workflow.add_edge(START, "simulated_user")
    
    # User -> Agent
    workflow.add_edge("simulated_user", "agent")
    
    # Agent -> Check completion
    workflow.add_edge("agent", "check_completion")
    
    # Routing after check
    def route_after_check(state: EvaluationState) -> str:
        if state.get("session_complete", False):
            return "finalize"
        
        # Evaluate periodically
        turn_count = state.get("turn_count", 0)
        if turn_count % evaluate_every_n_turns == 0 and turn_count > 0:
            return "evaluator"
        
        return "simulated_user"
    
    workflow.add_conditional_edges(
        "check_completion",
        route_after_check,
        {
            "simulated_user": "simulated_user",
            "evaluator": "evaluator",
            "finalize": "finalize",
        }
    )
    
    # Evaluator loops back
    workflow.add_edge("evaluator", "simulated_user")
    
    # Finalize ends
    workflow.add_edge("finalize", END)
    
    return workflow.compile()


# Build the graph
eval_graph = create_evaluation_graph(
    llm,
    customer_service_agent,
    frustrated_customer,
    evaluate_every_n_turns=2
)

print("Evaluation graph compiled!")
print("\nGraph flow:")
print("  START -> simulated_user")
print("  simulated_user -> agent")
print("  agent -> check_completion")
print("  check_completion -> [simulated_user | evaluator | finalize]")
print("  evaluator -> simulated_user")
print("  finalize -> END")

## Step 10: Run an Evaluation Session

Let's test our customer service agent with the frustrated customer scenario:

In [None]:
def run_evaluation_session(graph, max_turns: int = 10):
    """Run complete evaluation session."""
    
    print("="*60)
    print("STARTING EVALUATION SESSION")
    print("="*60)
    
    initial_state: EvaluationState = {
        "messages": [],
        "conversation": "",
        "evaluator_scores": [],
        "turn_count": 0,
        "max_turns": max_turns,
        "session_complete": False,
        "final_metrics": {},
    }
    
    result = graph.invoke(initial_state)
    
    # Display conversation
    print("\n" + "="*60)
    print("CONVERSATION")
    print("="*60)
    for msg in result["messages"]:
        role = "User" if isinstance(msg, HumanMessage) else "Agent"
        print(f"\n{role}: {msg.content}")
    
    # Display scores
    print("\n" + "="*60)
    print("EVALUATION SCORES")
    print("="*60)
    for i, score in enumerate(result["evaluator_scores"], 1):
        print(f"\nEvaluation {i}:")
        print(f"  Helpfulness: {score['helpfulness']}/5")
        print(f"  Accuracy: {score['accuracy']}/5")
        print(f"  Empathy: {score['empathy']}/5")
        print(f"  Efficiency: {score['efficiency']}/5")
        print(f"  Goal Completion: {score['goal_completion']}")
        print(f"  Reasoning: {score['reasoning']}")
    
    # Display final metrics
    print("\n" + "="*60)
    print("FINAL METRICS")
    print("="*60)
    metrics = result["final_metrics"]
    print(f"Helpfulness:      {metrics['helpfulness_avg']:.2f}/5.00")
    print(f"Accuracy:         {metrics['accuracy_avg']:.2f}/5.00")
    print(f"Empathy:          {metrics['empathy_avg']:.2f}/5.00")
    print(f"Efficiency:       {metrics['efficiency_avg']:.2f}/5.00")
    print(f"Goal Completion:  {metrics['goal_completion_rate']:.0%}")
    print(f"\nTotal Turns: {result['turn_count']}")
    print(f"Evaluations: {metrics['num_scores']}")
    
    return result


# Run evaluation
result = run_evaluation_session(eval_graph, max_turns=6)

## Step 11: Test Different User Scenarios

Let's evaluate with different simulated user types:

In [None]:
# Scenario 2: Friendly customer with simple question
friendly_customer = SimulatedUser(
    persona="Customer interested in learning about return policy before making a purchase.",
    goals=["Understand return policy", "Clarify timeframe for returns"],
    behavior="friendly",
    initial_message="Hi! I'm thinking of ordering something, but I'd like to know about your return policy first."
)

friendly_graph = create_evaluation_graph(
    llm,
    customer_service_agent,
    friendly_customer,
)

result2 = run_evaluation_session(friendly_graph, max_turns=4)

In [None]:
# Scenario 3: Confused customer
confused_customer = SimulatedUser(
    persona="Elderly customer who is not tech-savvy and having trouble tracking an order online.",
    goals=["Find out where order is", "Get help with tracking"],
    behavior="confused",
    initial_message="Hello, I'm trying to find my order but I don't understand this website. Can you help me?"
)

confused_graph = create_evaluation_graph(
    llm,
    customer_service_agent,
    confused_customer,
)

result3 = run_evaluation_session(confused_graph, max_turns=4)

## Step 12: Using the Built-in Module

The `langgraph_ollama_local.patterns` module provides ready-to-use evaluation functions:

In [None]:
from langgraph_ollama_local.patterns.evaluation import (
    SimulatedUser,
    create_evaluation_graph,
    run_evaluation_session,
    run_multiple_evaluations,
)

# Use the module's implementation
user_config = SimulatedUser(
    persona="Customer with billing question",
    goals=["Understand unexpected charge", "Get charge removed if incorrect"],
    behavior="friendly",
)

module_graph = create_evaluation_graph(
    llm,
    customer_service_agent,
    user_config
)

# Run single session
result = run_evaluation_session(module_graph, max_turns=5)

print("\nFinal Metrics:")
for metric, value in result["final_metrics"].items():
    print(f"{metric}: {value}")

## Step 13: Run Multiple Evaluations

For robust testing, run multiple evaluation sessions and aggregate results:

In [None]:
# Run 3 evaluation sessions
multi_results = run_multiple_evaluations(
    module_graph,
    num_sessions=3,
    max_turns=6
)

print("="*60)
print("AGGREGATE METRICS (3 SESSIONS)")
print("="*60)
metrics = multi_results["aggregate_metrics"]
print(f"Helpfulness:      {metrics['helpfulness_avg']:.2f}/5.00")
print(f"Accuracy:         {metrics['accuracy_avg']:.2f}/5.00")
print(f"Empathy:          {metrics['empathy_avg']:.2f}/5.00")
print(f"Efficiency:       {metrics['efficiency_avg']:.2f}/5.00")
print(f"Goal Completion:  {metrics['goal_completion_rate']:.0%}")
print(f"\nSessions Evaluated: {metrics['num_sessions']}")

## Complete Code

Here's the complete implementation in one cell for reference:

In [None]:
# === Complete Multi-Agent Evaluation Implementation ===

from typing import Annotated, Literal
from typing_extensions import TypedDict
import operator

from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from langchain_ollama import ChatOllama
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from pydantic import BaseModel, Field

from langgraph_ollama_local import LocalAgentConfig


# === State ===
class EvaluationState(TypedDict):
    messages: Annotated[list, add_messages]
    conversation: str
    evaluator_scores: Annotated[list[dict], operator.add]
    turn_count: int
    max_turns: int
    session_complete: bool
    final_metrics: dict[str, float]


# === Config ===
class SimulatedUser(BaseModel):
    persona: str
    goals: list[str]
    behavior: Literal["friendly", "impatient", "confused", "technical", "casual"] = "friendly"
    initial_message: str | None = None


class EvaluationCriteria(BaseModel):
    helpfulness: int = Field(ge=1, le=5)
    accuracy: int = Field(ge=1, le=5)
    empathy: int = Field(ge=1, le=5)
    efficiency: int = Field(ge=1, le=5)
    goal_completion: int = Field(ge=0, le=1)
    reasoning: str


# === Quick Example ===
def quick_evaluation_example():
    config = LocalAgentConfig()
    llm = ChatOllama(model=config.ollama.model, base_url=config.ollama.base_url)
    
    # Simple agent
    def agent(state):
        return {"messages": [AIMessage(content="I'm here to help! How can I assist you?")]}
    
    # Simulated user
    user = SimulatedUser(
        persona="Customer with question",
        goals=["Get help"],
        behavior="friendly"
    )
    
    from langgraph_ollama_local.patterns.evaluation import (
        create_evaluation_graph,
        run_evaluation_session,
    )
    
    graph = create_evaluation_graph(llm, agent, user)
    result = run_evaluation_session(graph, max_turns=4)
    
    return result["final_metrics"]


if __name__ == "__main__":
    metrics = quick_evaluation_example()
    print(f"Helpfulness: {metrics['helpfulness_avg']:.2f}/5.00")
    print(f"Empathy: {metrics['empathy_avg']:.2f}/5.00")

## Key Concepts

| Concept | Description |
|---------|-------------|
| **Simulated User** | Agent that mimics real user with persona and goals |
| **Evaluator Agent** | Agent that scores conversations objectively |
| **Evaluation Session** | Full conversation + periodic scoring |
| **Structured Scoring** | Pydantic models ensure consistent evaluation |
| **Metrics Aggregation** | Combine scores into summary statistics |
| **Multiple Runs** | Test with different scenarios for robustness |
| **Periodic Evaluation** | Score every N turns to track quality over time |

## Best Practices

1. **Diverse scenarios**: Test with multiple user personas and behaviors
2. **Clear goals**: Define specific, measurable objectives for simulated users
3. **Multiple sessions**: Run 3-5 evaluations per scenario for reliability
4. **Objective criteria**: Use consistent scoring dimensions across tests
5. **Track changes**: Compare metrics before/after agent improvements
6. **Edge cases**: Test impatient, confused, and difficult user behaviors
7. **Reasonable limits**: Set max_turns to prevent endless conversations

## What's Next

Congratulations! You've completed the Multi-Agent Patterns series. You now know:
- Multi-agent collaboration with supervisors (Tutorial 14)
- Hierarchical team structures (Tutorial 15)
- Subgraph composition patterns (Tutorial 16)
- Automated agent evaluation (Tutorial 20)

Continue exploring:
- Combine evaluation with other patterns for quality assurance
- Build custom evaluator agents for domain-specific metrics
- Create test suites with diverse simulated users
- Use evaluation to guide agent development and iteration