### Building & Evaluating Complex Agents with Strands and flotorch-eval

In this notebook, we'll go through a complete example of building complex agents easily using strands agents and then evaluate it's performance.
Steps we'll go through:
- **Define specialist agents**: Create individual agents that has specific tasks based on the use case
- **Build orchestrator**: Create a master agent that intelligently delegates tasks to appropriate specialist agents
- **Tracing**: Use OpenTelemetry to capture the agent's step-by-step reasoning and tool calls.
- **Evaluate performance**: Use the captured traces to score the agent's performance against ground truth reference on multiple metrics.

#### Setup and dependencies

In [1]:
# Install dependencies
!pip install agentevals strands-agents strands-agents-tools strands-agents-builder nest_asyncio uv -q
!pip install yfinance matplotlib -q
!pip install beautifulsoup4 pandas requests -q
!pip install pandas numpy scikit-learn matplotlib seaborn joblib -q

In [None]:
# Import libraries
from typing import Sequence
import asyncio
import queue
import warnings
import yfinance as yf
import pandas as pd
from typing import Sequence, List
from IPython.display import display
from strands.models.bedrock import BedrockModel
from strands.telemetry.tracer import get_tracer
from strands import Agent, tool
from langchain_aws import ChatBedrockConverse
from ragas.llms import LangchainLLMWrapper
from flotorch_eval.agent_eval.core.converter import TraceConverter
from opentelemetry.sdk.trace import TracerProvider, ReadableSpan
from opentelemetry.sdk.trace.export import SpanExporter, BatchSpanProcessor


from flotorch_eval.agent_eval.core.evaluator import Evaluator
from flotorch_eval.agent_eval.metrics.base import BaseMetric
from flotorch_eval.agent_eval.metrics.langchain_metrics import (
    TrajectoryEvalWithLLMMetric
)
from flotorch_eval.agent_eval.metrics.ragas_metrics import (
    AgentGoalAccuracyMetric,
    ToolCallAccuracyMetric,
)
from flotorch_eval.agent_eval.metrics.tool_accuracy import ToolAccuracyMetric
from flotorch_eval.agent_eval.metrics.base import MetricConfig


warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


#### Setting Up Tracing and Evaluation
To evaluate an agent, we first need to understand what it did. We'll use OpenTelemetry to "trace" the agent's execution. The code below sets up a custom QueueSpanExporter, which captures every step (span) of the agent's process and stores it in a queue for later analysis.

In [3]:
# Create a custom exporter that stores spans in a queue
class QueueSpanExporter(SpanExporter):
    def __init__(self):
        self.spans = queue.Queue()
        
    def export(self, spans: Sequence[ReadableSpan]) -> None:
        for span in spans:
            self.spans.put(span)
        return None
    
    def clear(self) -> None:
        with self.spans.mutex:
            self.spans.queue.clear()
    
    def force_flush(self, timeout_millis: float = 30000) -> None:
        return None
    
    def shutdown(self) -> None:
        return None

# Configure the tracer with both console export and our queue export
# Create our queue exporter
queue_exporter = QueueSpanExporter()

# Get the current tracer provider
provider = TracerProvider()

# Add our queue processor
provider.add_span_processor(BatchSpanProcessor(queue_exporter))
tracer = get_tracer(
    enable_console_export=True  # Keep console export for visibility
)

# Add our queue processor to the existing tracer provider
tracer.tracer_provider.add_span_processor(BatchSpanProcessor(queue_exporter))

Next, we'll define a few utility functions to streamline the process of running the agent, converting the captured spans into a structured "trajectory", and displaying the evaluation results neatly.

In [4]:
# Utility functions to handle the spans, convert them to trajectories, and evaluate them

def get_all_spans(exporter: QueueSpanExporter)->list:
    """Extracts all spans from the queue exporter."""
    spans = []
    while not exporter.spans.empty():
        spans.append(exporter.spans.get())
    return spans

def create_trajectory(spans:list):
    """Converts a list of spans into a structured Trajectory object."""
    converter = TraceConverter()
    trajectory = converter.from_spans(spans)
    return trajectory

def display_evaluation_results(results):
    """
    Displays evaluation results in a clean, readable pandas DataFrame.
    
    Args:
        results: The results object from the evaluator.
    """
    if not results or not results.scores:
        print("No evaluation results were generated.")
        return
    
    # Convert the list of MetricResult objects into a list of dictionaries
    data = []
    for metric in results.scores:
        details = f"Error: {metric.details['error']}" if 'error' in metric.details else "Success"
        data.append({
            "Metric": metric.name,
            "Score": metric.score,
            "Details": details
        })
    df = pd.DataFrame(data)
    display(df)
    
def initialize_evaluator(metrics: List[BaseMetric]) -> Evaluator:
    """Initializes the Evaluator with a given list of metric objects."""
    return Evaluator(metrics=metrics)

# Initialization LLMs
bedrock_sonnet = ChatBedrockConverse(
    region_name="us-east-1",
    model_id="anthropic.claude-3-sonnet-20240229-v1:0"
)
llm_judge = LangchainLLMWrapper(bedrock_sonnet)

Finally, this asynchronous function ties everything together. It takes an agent, a prompt, and a list of metrics, then performs the entire run-trace-evaluate-display cycle

In [5]:
async def run_and_evaluate_agent(agent: Agent, prompt: str, metrics: List[BaseMetric]):
    """
    Runs an agent with a given prompt, captures its trace, evaluates it,
    and displays the results.

    Args:
        agent: The agent to be evaluated.
        prompt: The input prompt for the agent.
        metrics: A list of configured metrics for evaluation.
    """
    # Clear the exporter queue for any residue
    queue_exporter.clear()
    print("--- Running Agent ---")
    response = agent(prompt)
    print("\n--- Agent Final Response ---")
    print(response.message)

    # 2. Capture and convert the trace
    spans = get_all_spans(queue_exporter)
    if not spans:
        print("\nEvaluation failed: No spans were captured.")
        return

    trajectory = create_trajectory(spans)
    print("\n--- Captured Trajectory ---")
    print(trajectory)

    # 3. Initialize and run the evaluator
    evaluator = initialize_evaluator(metrics)
    print("\n--- Running Evaluation ---")
    results = await evaluator.evaluate(trajectory)

    # 4. Display results
    print("\n--- Evaluation Scores ---")
    display_evaluation_results(results)
    return results

#### Use Case 1. Data frame manipulation agent

Let's start with a simple, single-agent use case. This agent will act as a data analyst. Its task is to write and execute a Python script using pandas to perform several common data manipulation tasks based on a single prompt.

The Goal: Evaluate how well a general-purpose agent can handle a multi-step, code-generation task from a single instruction.

In [6]:
dataframe_agent = Agent()

dataframe_prompt = """
Write a Python script using the pandas library that performs the following tasks:

- Create a sample DataFrame with the columns: 'Name', 'Age', and 'Salary'.
- Add a new column named 'Bonus' that is 10% of the corresponding 'Salary' value.
- Filter the DataFrame to include only rows where the 'Age' is greater than 30.
- Group the data by age brackets (e.g., 20s, 30s, 40s) and calculate the average 'Salary' and 'Bonus' for each group.

Execute the python script and show the output.

Requirements:
- Include clear inline comments to explain the logic.
- Add a docstring for the function, describing its purpose, parameters, and return value.
- Provide an example of how to use the function.
- List any external libraries that need to be installed with pip (if any).
- Include brief documentation describing how the code works and how to run it.
"""

To evaluate the agent, we need a "golden answer" to compare its output against. This REFERENCE_FINAL_ANSWER represents the ideal output.

In [7]:
REFERENCE_FINAL_ANSWER = """import pandas as pd

def process_employee_data():

    # Create sample DataFrame
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [28, 32, 45, 31, 29],
        'Salary': [50000, 65000, 80000, 72000, 48000]
    }
    df = pd.DataFrame(data)

    # Add Bonus column (10% of Salary)
    df['Bonus'] = df['Salary'] * 0.10

    # Filter rows where Age > 30
    filtered_df = df[df['Age'] > 30]

    # Define age brackets
    def assign_age_bracket(age):
        if age < 30:
            return '20s'
        elif 30 <= age < 40:
            return '30s'
        elif 40 <= age < 50:
            return '40s'
        else:
            return '50+'

    # Apply age bracket and group
    filtered_df['AgeBracket'] = filtered_df['Age'].apply(assign_age_bracket)
    grouped_df = filtered_df.groupby('AgeBracket').agg({
        'Salary': 'mean',
        'Bonus': 'mean'
    }).round(2)

    return grouped_df

# Example usage
result_df = process_employee_data()
print(result_df)"""

Now, we define the metrics for evaluation:

ToolAccuracyMetric: Checks if the agent's tool calls (in this case, code execution) were successful.
AgentGoalAccuracyMetric: Uses our LLM judge to compare the agent's final answer to our reference answer.

In [8]:
dataframe_metrics = [
    ToolAccuracyMetric(),
    AgentGoalAccuracyMetric(
        llm=llm_judge,
        config=MetricConfig(
            metric_params={"reference_answer": REFERENCE_FINAL_ANSWER}
        )
    )
]

async def evaluate_dataframe_agent():
    return await run_and_evaluate_agent(dataframe_agent, dataframe_prompt, dataframe_metrics)

# Execute the evaluation
dataframe_results = asyncio.run(evaluate_dataframe_agent())

--- Running Agent ---
# Data Analysis with Pandas - Example Script

This script demonstrates how to create, manipulate, and analyze data using the pandas library in Python. It includes data creation, transformation, filtering, and grouping operations.

## Required Libraries
```bash
pip install pandas
```

## The Code

```python
import pandas as pd

def analyze_employee_data():
    """
    Creates a sample DataFrame with employee information, adds a bonus column,
    filters by age, and groups data by age brackets to calculate averages.
    
    Returns:
        Tuple containing:
        - The original DataFrame
        - DataFrame with bonus column
        - Filtered DataFrame (age > 30)
        - GroupBy object with average salary and bonus by age bracket
    """
    # Create a sample DataFrame with employee data
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Hannah'],
        'Age': [25, 32, 45, 28, 36, 52, 29, 41],
        'Salary': [500

Unnamed: 0,Metric,Score,Details
0,tool_accuracy,1.0,Success
1,agent_goal_accuracy,1.0,Success


### Use Case 2: Financial advisor orchestrator

Now for a more complex and realistic scenario. We'll build a hierarchical agent system with a master orchestrator that delegates tasks to a team of specialist agents. This is a powerful pattern for building robust and scalable AI systems.

Define the Specialist Agent Tools
First, we define our three specialists. Each specialist is an Agent wrapped in a `@tool` decorator. This turns the entire agent into a callable tool that our orchestrator can use.

**investment_research_assistant**: Knows all about stocks and ETFs. It has its own tool, get_stock_price, to fetch live data.
**budget_optimizer_assistant**: Expert at creating and managing monthly budgets.
**financial_planner_assistant**: Focuses on long-term strategic goals like retirement and buying a house.

In [9]:
# Define the tools necessary for the agents

model_research = BedrockModel(model_id="us.amazon.nova-pro-v1:0")

# Investment Research Assistant and it's tool to get the stock prices
@tool
def get_stock_price(ticker: str) -> str:
    """Gets the current stock price for a given ticker symbol."""
    try:
        stock = yf.Ticker(ticker)
        price = stock.history(period="1d")['Close'].iloc[-1]
        return f"The current price of {ticker} is ${price:.2f}."
    except Exception as e:
        return f"Could not retrieve price for {ticker}. Error: {e}"
    
@tool
def investment_research_assistant(query: str) -> str:
    investment_researcher = Agent(
        model=model_research,
        system_prompt="""
        You are a financial research analyst who helps users explore various investment opportunities. 
        You specialize in providing insights into stocks, ETFs, mutual funds, bonds, and other instruments.
        You also highlight key market trends, risk factors, and historical performance.
        Your goal is to equip users with comprehensive, objective information to support their investment decisions.
        """,
        tools=[get_stock_price]
    )
    return investment_researcher(query).message

# Budget Optimizer Assistant
model_budget = BedrockModel(model_id="us.amazon.nova-lite-v1:0")

@tool
def budget_optimizer_assistant(query: str) -> str:
    try:
        budget_coach = Agent(
            model=model_budget,
            system_prompt="""
            You are a smart budgeting assistant who helps users manage and optimize their monthly expenses.
            You analyze income, spending patterns, and savings goals, and suggest personalized recommendations 
            to cut unnecessary costs and improve savings. Your goal is to help users maintain a healthy financial balance.
            """,
        )
        return budget_coach(query).message
    except Exception as e:
        return "The budget optimizer assistant encountered an error and could not process the request."

# Financial Planner Assistant
model_planner = BedrockModel(model_id="us.amazon.nova-micro-v1:0")

@tool
def financial_planner_assistant(query: str) -> str:
    financial_planner = Agent(
        model=model_planner,
        system_prompt="""
        You are a certified financial advisor bot who helps users create customized financial plans 
        based on their goals, income, age, and risk tolerance. 
        You guide them through budgeting, saving, debt management, insurance, and retirement planning.
        Your goal is to provide practical, step-by-step advice to help users achieve financial stability and growth.
        """,
    )
    return financial_planner(query).message

#### Define Reference Data for Evaluation
For a multi-step orchestrator, we need more than just a final answer. We need a REFERENCE_TRAJECTORY_OUTPUTS, which defines the ideal path of reasoning and tool calls the agent should take.

In [10]:
REFERENCE_FINAL_ANSWER = """
# Your Comprehensive Financial Plan

Based on your profile as a 30-year-old earning $6,000/month, here is a step-by-step plan to pay off your debt, invest for growth, buy a home, and retire early.

## 1. Monthly Budget Allocation

Here is an optimized budget designed to aggressively pay down your student loan while funding your investments:

| Category              | Allocation | Monthly Amount | Notes                                            |
|-----------------------|:----------:|:--------------:|--------------------------------------------------|
| **Needs** | **57%** | **$3,420** | Housing, Transportation, Food, Insurance         |
| **Debt Repayment** | **20%** | **$1,200** | Aggressively targets your $25k student loan      |
| **Savings & Invest** | **23%** | **$1,380** | For your house, retirement, and emergency fund   |

* **Action:** There is a remaining balance of **$220/month**. It is recommended to apply this directly to your student loan, bringing the total payment to $1,420/month.

## 2. Investment Strategy

With a monthly investment of **$780**, here is a strategy focused on long-term growth.

**ETF Performance Comparison:**
* **VOO (Vanguard S&P 500):** Currently trading at **$485.50**. As a broad market index, it offers stability and has historically provided strong, steady returns. It's an excellent core holding.
* **QQQ (Invesco QQQ Trust):** Currently trading at **$452.10**. Being tech-focused, it offers higher growth potential but comes with greater volatility.

**Recommended Sample Portfolio for a 30-Year-Old:**
* **50% Large-Cap U.S. Stocks:** (e.g., 30% in VOO, 20% in QQQ)
* **20% International Stocks:** (e.g., VXUS or similar)
* **20% Bonds:** (e.g., BND for stability)
* **10% Small-Cap U.S. Stocks:** (e.g., IJR for growth)

## 3. Long-Term Plan & Timeline

This roadmap integrates the budget and investment strategy to meet your goals.

**Goal 1: Pay Off Student Loan (Balance: $25,000 at 5.5%)**
* **Action:** Pay **$512.50/month** (minimum + extra from budget).
* **Timeline:** Your loan will be fully paid off in **~4 years** (by age 34).
* **Benefit:** You will save approximately **$6,875** in interest.

**Goal 2: Buy a House (Target: $450,000 home in 7 years)**
* **Action:** Save **$800/month** towards the **$90,000** (20%) down payment.
* **Timeline:** This will take approximately **9.3 years**.
* **To meet your 7-year goal:** After your student loan is paid off in 4 years, you must redirect that $512.50 payment towards your house fund, increasing your savings to $1,312.50/month. This will allow you to reach your goal on time.

**Goal 3: Retire by Age 55**
* **Action:** Consistently invest **$750/month** in tax-advantaged accounts (401(k), IRA).
* **Projection:** With a 7% average annual return, your dedicated retirement contributions could grow to over **$310,000** in 25 years. This does not include additional investments made after your house is purchased.

**Recommendation:** Review this plan annually to adjust for changes in income or goals.
"""

REFERENCE_TRAJECTORY_OUTPUTS = [
    {
        "role": "user",
        "content": """
        I'm a 30-year-old, earning $6,000 per month after taxes. I have $15,000 in a high-yield savings account and a $25,000 student loan with a 5.5% interest rate.
        I need a comprehensive financial plan with three parts:
        1.  A monthly budget that helps me aggressively pay down my loan while still investing.
        2.  An investment strategy for long-term growth. Please compare the recent performance of the VOO and QQQ ETFs and suggest a sample portfolio for my age.
        3.  A long-term plan outlining the steps to buy a $450,000 house in the next 7 years and retire by age 55.
        Can you help create this for me?
        """,
    },
    {
        "role": "assistant",
        "content": "I will create a comprehensive financial plan by addressing each part of your request. First, I'll generate an optimized monthly budget.",
        "tool_calls": [
            {
                "function": {
                    "name": "budget_optimizer_assistant",
                    "arguments": '{"query": "I\'m 30 years old with $6,000 monthly after-tax income, $15,000 in savings, and a $25,000 student loan at 5.5%. Create a budget to aggressively pay the loan while investing."}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "content": """
        Here is a detailed monthly budget to help you aggressively pay down your student loan while investing:
        - **Income:** $6,000
        - **Housing (30%):** $1,800
        - **Transportation (10%):** $600
        - **Food (12%):** $720
        - **Personal (10%):** $600
        - **Insurance (5%):** $300
        - **Debt Payments (20%):** $1,200 (for student loan)
        - **Savings (10%):** $600 (for emergency fund and house fund)
        - **Investments (13%):** $780 (for retirement and other goals)
        - **Summary:** Your total allocated expenses are $5,780, leaving a $220 balance that can be used for additional debt repayment or savings.
        """,
        "tool_calls": []
    },
    {
        "role": "assistant",
        "content": "With the budget established, you can invest $780 per month. Now, I will analyze the VOO and QQQ ETFs and suggest a portfolio.",
        "tool_calls": [
            {
                "function": {
                    "name": "investment_research_assistant",
                    "arguments": '{"query": "I am 30 with a high risk tolerance and can invest $780/month. Compare the recent performance of VOO and QQQ ETFs and suggest a diversified portfolio."}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "content": """
        <thinking>
        The user wants a comparison of VOO and QQQ and a sample portfolio. I must use the `get_stock_price` tool for both tickers to provide the latest data before giving my analysis.
        </thinking>
        """,
        "tool_calls": [
            {"function": {"name": "get_stock_price", "arguments": '{"ticker": "VOO"}'}},
            {"function": {"name": "get_stock_price", "arguments": '{"ticker": "QQQ"}'}}
        ]
    },
    {
        "role": "tool",
        "content": "The current price of VOO is $485.50.",
        "tool_calls": []
    },
    {
        "role": "tool",
        "content": "The current price of QQQ is $452.10.",
        "tool_calls": []
    },
    {
        "role": "tool",
        "content": """
        Based on the latest data:
        - VOO is trading at $485.50. It tracks the S&P 500, offering broad market exposure and stability, making it a great core holding.
        - QQQ is trading at $452.10. It tracks the Nasdaq-100 and is tech-heavy, offering higher growth potential with higher risk.
        
        For your age and goals, a sample portfolio could be:
        - 50% Large-Cap U.S. Stocks (e.g., 30% VOO, 20% QQQ)
        - 20% International Stocks (e.g., VXUS)
        - 20% Bonds (e.g., BND)
        - 10% Small-Cap U.S. Stocks (e.g., IJR)
        """,
        "tool_calls": []
    },
    {
        "role": "assistant",
        "content": "With the budget and investment strategy set, I will now create a long-term plan to integrate everything and map out your path to buying a house and retiring early.",
        "tool_calls": [
            {
                "function": {
                    "name": "financial_planner_assistant",
                    "arguments": '{"query": "With a $6k/month income, $25k loan, and $1,380/month for savings/investing, create a timeline to buy a $450k house in 7 years and retire by 55."}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "content": """
        Here is a step-by-step plan:
        1. **Emergency Fund:** Your $15,000 savings already covers 3-6 months of expenses. Maintain this.
        2. **Student Loan Payoff:** Paying $512.50/month will clear your $25k loan in approximately 4 years, saving you over $6,800 in interest.
        3. **Home Savings:** Saving $800/month for a $90,000 down payment will take about 9.3 years. To meet your 7-year goal, you must increase this contribution after the student loan is paid off.
        4. **Retirement Savings:** Investing $750/month at a 7% average return can result in a portfolio of over $310,000 by age 55.
        """,
        "tool_calls": []
    },
    {
        "role": "assistant",
        "content": REFERENCE_FINAL_ANSWER
    }
]

#### Build and Evaluate the Orchestrator
Now, we create the orchestrator agent. Its system prompt is crucial—it explicitly tells the agent about its available specialist tools and when to use them.

In [11]:
# Define orchestrator system prompt with clear tool selection guidance

MAIN_SYSTEM_PROMPT = """
You are a master financial orchestrator. Your primary role is to understand a user's financial query and delegate tasks to the appropriate specialist agent.

Here are your available tools and when to use them:
- `budget_optimizer_assistant`: Use for questions about creating, managing, or optimizing a monthly budget, tracking expenses, and analyzing spending.
- `investment_research_assistant`: Use for questions about investment opportunities, stocks, ETFs, bonds, market trends, and historical performance.
- `financial_planner_assistant`: Use for long-term strategic planning like retirement, home buying, debt management, and integrating financial goals.

For complex queries, you MUST call multiple tools in a logical sequence. Synthesize the information from all tool calls into a final, cohesive answer.
"""
financial_orchestrator = Agent(
    system_prompt=MAIN_SYSTEM_PROMPT,
    tools=[investment_research_assistant, budget_optimizer_assistant, financial_planner_assistant]
)

financial_metrics = [
    ToolCallAccuracyMetric(),
    AgentGoalAccuracyMetric(
        llm=llm_judge,
        config=MetricConfig(metric_params={"reference_answer": REFERENCE_FINAL_ANSWER})
    ),
    TrajectoryEvalWithLLMMetric(
        llm=bedrock_sonnet,
        config=MetricConfig(metric_params={"reference_outputs": REFERENCE_TRAJECTORY_OUTPUTS})
    )
]

# Define the complex user query
user_query_financial = """
I'm a 30-year-old, earning $6,000 per month after taxes. I have $15,000 in a high-yield savings account and a $25,000 student loan with a 5.5% interest rate.

I need a comprehensive financial plan with three parts:
1.  A monthly budget that helps me aggressively pay down my loan while still investing.
2.  An investment strategy for long-term growth. Please compare the recent performance of the VOO and QQQ ETFs and suggest a sample portfolio for my age.
3.  A long-term plan outlining the steps to buy a $450,000 house in the next 7 years and retire by age 55.

Can you help create this for me?
"""

# Run the orchestrator with the user's query
async def evaluate_financial_agent():
    return await run_and_evaluate_agent(financial_orchestrator, user_query_financial, financial_metrics)

# Execute the evaluation
financial_results = asyncio.run(evaluate_financial_agent())

--- Running Agent ---
I'll help you create a comprehensive financial plan by consulting with our specialist tools for each part of your request. Let's address each component systematically.
Tool #1: budget_optimizer_assistant
{
    "name": "Model invoke",
    "context": {
        "trace_id": "0xfb8d429ee3291515d711a0bff15d8d54",
        "span_id": "0xe675a00e71a0575a",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": "0x9870f87db60a195f",
    "start_time": "2025-06-11T15:56:59.780638Z",
    "end_time": "2025-06-11T15:57:04.346491Z",
    "status": {
        "status_code": "OK"
    },
    "attributes": {
        "gen_ai.event.start_time": "2025-06-11T15:56:59.780646+00:00",
        "gen_ai.system": "strands-agents",
        "agent.name": "Strands Agent",
        "gen_ai.agent.name": "Strands Agent",
        "gen_ai.prompt": "[{\"role\": \"user\", \"content\": [{\"text\": \"\\nI'm a 30-year-old, earning $6,000 per month after taxes. I have $15,000 in a 

Unnamed: 0,Metric,Score,Details
0,tool_call_accuracy,2.2,Success
1,agent_goal_accuracy,1.0,Success
2,trajectory_eval_with_llm,0.0,Error: 'id'
