# Evaluating Agent Decisions

This notebook shows you how to systematically evaluate AI agent behavior—measuring tool selection accuracy, detecting escalation errors, and tracking workflow completion. Unlike simple LLM evaluation, agent evaluation addresses multi-step decision chains where errors compound.

**What You'll Learn:**
- Build test datasets that properly evaluate agent behavior
- Create evaluators for tool selection accuracy
- Measure escalation accuracy and over/under-escalation
- Track multi-step workflow completion
- Debug agent failures using traces

**Prerequisites:**
- Python >=3.10, <3.14
- OpenAI API key
- Netra API key ([Get started here](https://docs.getnetra.ai/quick-start/Overview))
- A LangChain or similar agent framework

## Step 0: Install Packages

In [None]:
pip install netra-sdk openai langchain langchain-openai

## Step 1: Set Environment Variables

In [None]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key:")
os.environ["NETRA_API_KEY"] = getpass("Enter your Netra API Key:")
os.environ["NETRA_OTLP_ENDPOINT"] = getpass("Enter your Netra OTLP Endpoint:")

print("API keys configured!")


## Step 2: Initialize Netra

In [None]:
from netra import Netra
from netra.instrumentation.instruments import InstrumentSet

Netra.init(
    app_name="agent-evaluation",
    headers=f"x-api-key={os.getenv('NETRA_API_KEY')}",
    environment="testing",
    trace_content=True,
    instruments={InstrumentSet.OPENAI, InstrumentSet.LANGCHAIN},
)

print("Netra initialized for agent evaluation!")

## Step 3: Create Test Dataset Structure

Define test cases with expected agent behavior.

In [None]:
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class AgentTestCase:
    """Test case for evaluating agent decisions."""
    test_id: str
    query: str
    category: str  # e.g., "single_tool", "multi_tool", "escalation"
    expected_tools: List[str]  # Tools the agent should use
    forbidden_tools: List[str] = None  # Tools the agent should NOT use
    should_escalate: bool = False  # Whether escalation is appropriate
    expected_outcome: str = "success"  # "success", "failure", "escalation"

# Define test dataset
TEST_DATASET = [
    AgentTestCase(
        test_id="single-tool-1",
        query="What is the return policy?",
        category="single_tool",
        expected_tools=["search_kb"],
        forbidden_tools=["check_order_status", "escalate_to_human"],
        should_escalate=False,
        expected_outcome="success"
    ),
    AgentTestCase(
        test_id="multi-tool-1",
        query="I have ticket TKT-001 about a damaged item. Can you check the order?",
        category="multi_tool",
        expected_tools=["lookup_ticket", "check_order_status"],
        forbidden_tools=["escalate_to_human"],
        should_escalate=False,
        expected_outcome="success"
    ),
    AgentTestCase(
        test_id="escalation-1",
        query="I'm FURIOUS! Your product destroyed my business! I need to speak to someone NOW!",
        category="escalation",
        expected_tools=["escalate_to_human"],
        forbidden_tools=[],
        should_escalate=True,
        expected_outcome="escalation"
    ),
    AgentTestCase(
        test_id="no-tool-1",
        query="Thank you for your help!",
        category="no_tool",
        expected_tools=[],
        forbidden_tools=["lookup_ticket", "check_order_status", "escalate_to_human"],
        should_escalate=False,
        expected_outcome="success"
    ),
]

print(f"Test dataset created with {len(TEST_DATASET)} cases")
for case in TEST_DATASET:
    print(f"  - {case.test_id} ({case.category}): {case.query[:40]}...")

## Step 4: Implement Agent Evaluators

In [None]:
from typing import Dict

class AgentEvaluator:
    """Evaluate agent behavior against expected decisions."""

    @staticmethod
    def evaluate_tool_selection(
        expected_tools: List[str],
        actual_tools: List[str],
        forbidden_tools: List[str] = None
    ) -> Dict:
        """Evaluate whether agent selected correct tools."""
        if forbidden_tools is None:
            forbidden_tools = []

        # Check for correct tool usage
        correct_tools = set(expected_tools) & set(actual_tools)
        missing_tools = set(expected_tools) - set(actual_tools)
        extra_tools = set(actual_tools) - set(expected_tools)
        forbidden_used = set(forbidden_tools) & set(actual_tools)

        # Calculate accuracy
        if not expected_tools:  # No tools expected
            accuracy = 0 if actual_tools else 1
        else:
            accuracy = len(correct_tools) / len(expected_tools)

        # Penalize forbidden tool usage
        if forbidden_used:
            accuracy *= 0.5

        return {
            "passed": len(missing_tools) == 0 and len(forbidden_used) == 0,
            "accuracy": accuracy,
            "correct_tools": list(correct_tools),
            "missing_tools": list(missing_tools),
            "extra_tools": list(extra_tools),
            "forbidden_used": list(forbidden_used),
        }

    @staticmethod
    def evaluate_escalation(
        should_escalate: bool,
        did_escalate: bool
    ) -> Dict:
        """Evaluate escalation decision accuracy."""
        if should_escalate == did_escalate:
            return {"passed": True, "type": "correct", "score": 1.0}
        elif should_escalate and not did_escalate:
            # Under-escalation: worse than over-escalation
            return {"passed": False, "type": "under_escalation", "score": 0.2}
        else:
            # Over-escalation: less critical but still wrong
            return {"passed": False, "type": "over_escalation", "score": 0.5}

    @staticmethod
    def evaluate_workflow(
        tool_sequence: List[str],
        expected_first_tool: str
    ) -> Dict:
        """Evaluate multi-step workflow execution."""
        if not tool_sequence:
            return {"passed": False, "sequence_valid": False, "score": 0.0}

        # Check if workflow starts with expected tool
        first_correct = tool_sequence[0] == expected_first_tool
        sequence_valid = len(tool_sequence) > 1  # Multi-step workflow

        return {
            "passed": first_correct and sequence_valid,
            "first_tool_correct": first_correct,
            "sequence_valid": sequence_valid,
            "tool_count": len(tool_sequence),
            "score": 1.0 if (first_correct and sequence_valid) else 0.5 if first_correct else 0.0,
        }


print("Agent evaluators implemented!")

## Step 5: Simulate Agent Execution

In [None]:
class SimulatedAgent:
    """Simulate agent behavior for testing."""

    def __init__(self):
        self.tool_calls = []
        self.escalated = False

    def process_query(self, test_case: AgentTestCase) -> Dict:
        """Simulate agent processing a query."""
        # Simulated tool usage based on query analysis
        if test_case.test_id == "single-tool-1":
            self.tool_calls = ["search_kb"]
            self.escalated = False
        elif test_case.test_id == "multi-tool-1":
            self.tool_calls = ["lookup_ticket", "check_order_status"]
            self.escalated = False
        elif test_case.test_id == "escalation-1":
            self.tool_calls = []
            self.escalated = True
        elif test_case.test_id == "no-tool-1":
            self.tool_calls = []
            self.escalated = False
        else:
            # Random behavior for unknown tests
            self.tool_calls = []
            self.escalated = False

        return {
            "test_id": test_case.test_id,
            "tool_calls": self.tool_calls,
            "escalated": self.escalated
        }


print("Simulated agent created!")

## Step 6: Run Agent Evaluation Tests

In [None]:
# Initialize agent and evaluator
agent = SimulatedAgent()
results = []

print("="*70)
print("Running Agent Evaluation Tests")
print("="*70)

for test_case in TEST_DATASET:
    print(f"\n[{test_case.test_id}] {test_case.query}")
    print("-" * 70)

    # Run agent
    agent_result = agent.process_query(test_case)

    # Evaluate tool selection
    tool_eval = AgentEvaluator.evaluate_tool_selection(
        expected_tools=test_case.expected_tools,
        actual_tools=agent_result["tool_calls"],
        forbidden_tools=test_case.forbidden_tools
    )

    # Evaluate escalation
    escalation_eval = AgentEvaluator.evaluate_escalation(
        should_escalate=test_case.should_escalate,
        did_escalate=agent_result["escalated"]
    )

    # Overall test result
    test_passed = tool_eval["passed"] and escalation_eval["passed"]

    # Print results
    print(f"Category: {test_case.category}")
    print(f"Expected Tools: {test_case.expected_tools}")
    print(f"Actual Tools: {agent_result['tool_calls']}")
    print(f"\nTool Selection:")
    print(f"  Correct: {tool_eval['correct_tools']}")
    print(f"  Missing: {tool_eval['missing_tools']}")
    print(f"  Accuracy: {tool_eval['accuracy']:.1%}")
    print(f"\nEscalation:")
    print(f"  Should Escalate: {test_case.should_escalate}")
    print(f"  Did Escalate: {agent_result['escalated']}")
    print(f"  Type: {escalation_eval['type']}")
    print(f"\nResult: {'✓ PASS' if test_passed else '✗ FAIL'}")

    # Store result
    results.append({
        "test_id": test_case.test_id,
        "category": test_case.category,
        "passed": test_passed,
        "tool_eval": tool_eval,
        "escalation_eval": escalation_eval,
    })

## Step 7: Analyze Test Results

In [None]:
# Calculate summary metrics
print("\n" + "="*70)
print("Test Summary")
print("="*70)

total_tests = len(results)
passed_tests = sum(1 for r in results if r["passed"])
pass_rate = passed_tests / total_tests if total_tests > 0 else 0

print(f"\nTotal Tests: {total_tests}")
print(f"Passed: {passed_tests}")
print(f"Failed: {total_tests - passed_tests}")
print(f"Pass Rate: {pass_rate:.1%}")

# Breakdown by category
categories = {}
for result in results:
    cat = result["category"]
    if cat not in categories:
        categories[cat] = {"total": 0, "passed": 0}
    categories[cat]["total"] += 1
    if result["passed"]:
        categories[cat]["passed"] += 1

print("\nBy Category:")
for cat, stats in sorted(categories.items()):
    rate = stats["passed"] / stats["total"] if stats["total"] > 0 else 0
    print(f"  {cat}: {stats['passed']}/{stats['total']} ({rate:.1%})")

# Failure analysis
failures = [r for r in results if not r["passed"]]
if failures:
    print(f"\nFailed Tests ({len(failures)}):")
    for failure in failures:
        print(f"\n  {failure['test_id']}:")
        if not failure["tool_eval"]["passed"]:
            print(f"    Tool Selection Issue:")
            if failure["tool_eval"]["missing_tools"]:
                print(f"      Missing: {failure['tool_eval']['missing_tools']}")
            if failure["tool_eval"]["forbidden_used"]:
                print(f"      Forbidden: {failure['tool_eval']['forbidden_used']}")
        if not failure["escalation_eval"]["passed"]:
            print(f"    Escalation Issue: {failure['escalation_eval']['type']}")

## Step 8: Debugging with Traces

Use Netra traces to debug failed tests.

In [None]:
print("""
DEBUGGING WITH TRACES
="*70)

When an agent test fails, use Netra traces to investigate:

1. **Tool Call Sequence**
   - Check the order of tool invocations
   - Verify each tool's inputs and outputs
   - Look for unexpected tool calls

2. **LLM Reasoning**
   - Review the exact prompt sent to the LLM
   - Check the model's reasoning in the completion
   - Look for prompt ambiguity or missing instructions

3. **Tool Definitions**
   - Verify tool descriptions are clear and distinct
   - Check if similar tools might confuse the agent
   - Review tool parameters and expected outputs

4. **Context and History**
   - Check if conversation history affects decisions
   - Verify session/user context is properly set
   - Look for carryover from previous queries

5. **Common Failure Patterns**
   - Wrong tool selection → Improve tool descriptions
   - Infinite loops → Adjust loop detection in agent
   - Missing escalation → Refine escalation criteria
   - Extra tool calls → Add tool usage constraints
""")

## Step 9: Improvement Recommendations

Based on evaluation results, implement improvements.

In [None]:
if pass_rate < 1.0:
    print("""
IMPROVEMENT RECOMMENDATIONS
="*70)

Based on your agent evaluation results, consider these improvements:

1. **Refine Tool Descriptions**
   - Make tool descriptions more distinct and specific
   - Include examples of when each tool should be used
   - Add notes about when NOT to use a tool

2. **Improve System Prompts**
   - Add explicit instructions for escalation criteria
   - Include decision trees for complex scenarios
   - Specify tool usage order for multi-step workflows

3. **Add Tool Constraints**
   - Limit tool call sequences to prevent loops
   - Require specific tools before others (dependencies)
   - Set maximum tool calls per query

4. **Implement Pre/Post-Checks**
   - Validate tool selection before execution
   - Check escalation appropriateness before routing
   - Verify workflow completion before returning

5. **Run Continuous Evaluation**
   - Weekly regression tests to detect degradation
   - Evaluate after prompt changes
   - Test after adding new tools
""")
else:
    print("\n✓ All tests passed! Continue monitoring with regular evaluation runs.")

---

## Key Metrics for Agent Evaluation

| Metric | Good | Warning | Action Needed |
|--------|------|---------|---------------|
| Tool Accuracy | 95-100% | 80-95% | <80% |
| Escalation Accuracy | 100% | 90-100% | <90% |
| Workflow Completion | 95-100% | 80-95% | <80% |
| Multi-step Success | 90-100% | 70-90% | <70% |

## Documentation Links

- [Netra Documentation](https://docs.getnetra.ai)
- [Agent Evaluation](https://docs.getnetra.ai/Evaluation/Agent-Evaluation)
- [Traces for Debugging](https://docs.getnetra.ai/Observability/Traces/Debugging)
- [Evaluation Framework](https://docs.getnetra.ai/Evaluation)

## See Also

- [Tracing LangChain Agents](/Cookbooks/observability/tracing-langchain-agents) - Add observability to agents
- [Custom Evaluator Patterns](/Cookbooks/evaluation/custom-evaluator-patterns) - Build custom evaluators
- [A/B Testing Configurations](/Cookbooks/evaluation/ab-testing-configurations) - Compare agent configurations