# Agent Runs - Evaluation

In this guide, we build a simple agent which can use two tools.  
We run the agent against a couple of user queries and monitor the agent runs via Literal AI.  
Finally, we retrieve the agent runs and check the number of tools they called.

- [Necessary imports](#imports)
- [Create an agent prompt](#create-agent-prompt)
- [Define Available tools](#define-available-tools)
- [Agent logic](#agent-logic)
- [Run agent against two questions](#run-agent)
- [Create a dataset of agent runs](#create-dataset)
- [Evaluate agent runs](#evaluate-agent-runs)

<a id="imports"></a>
## Necessary imports

Make sure to define the `LITERAL_API_KEY` and `OPENAI_API_KEY` in your `.env`.

In [1]:
from literalai import LiteralClient
from dotenv import load_dotenv
from openai import OpenAI

import asyncio
import json

load_dotenv()

client = LiteralClient()
openai_client = OpenAI()

client.instrument_openai()

<a id="create-agent-prompt"></a>
## Create agent prompt

This prompt template is the starting prompt for our agent.  
It simply passes on the user's question and mentions that the assistant should be mindful of the tools at its disposal.

In [2]:
# TODO: changed the messages 
PROMPT_NAME = "Agent prompt template"
template_messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that always answers questions. Keep it short. Answer the question if you can, otherwise leverage tools at your disposal."
    },
    {
        "role": "user",
        "content": "{{question}}"
    }
]
prompt = client.api.get_or_create_prompt(name=PROMPT_NAME, template_messages=template_messages)

<a id="define-available-tools"></a>
## Define available tools

In the next cell, we define the tools, and their JSON definitions, which we provide to the agent. We have two tools:
- `get_current_weather`
- `get_home_town`

We annotate the two functions with Literal AI `step` decorators of type `tool`, which allows us to easily monitor the tool calls made by our agent. We can evaluate on the number of tool calls and their type.

In [3]:
@client.step(type="tool")
async def get_current_weather(location, unit="Fahrenheit"):
    """Get the current weather in a given location."""
    weather_info = {
        "location": location,
        "temperature": "72",
        "unit": unit,
        "forecast": ["sunny", "windy"],
    }

    return json.dumps(weather_info)

@client.step(type="tool")
async def get_home_town(person: str) -> str:
    """Get the hometown of a person"""
    return "Ajaccio, Corsica"


"""
JSON tool definitions provided to the LLM.
"""
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_home_town",
            "description": "Get the home town of a specific person",
            "parameters": {
                "type": "object",
                "properties": {
                    "person": {
                        "type": "string",
                        "description": "The name of a person (first and last names) to identify."
                    }
                },
                "required": ["person"]
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

<a id="agent-logic"></a>
## Agent logic

For the agent logic, we simply repeat the following pattern (max. 5 times):
- ask the user question to the LLM, making the tools available
- execute tools if LLM asks for it, otherwise return message

In [4]:
async def run_multiple(tool_calls):
    """
    Execute multiple tool calls asynchronously.
    """
    available_tools = {
        "get_current_weather": get_current_weather,
        "get_home_town": get_home_town
    }

    async def run_single(tool_call):
        function_name = tool_call.function.name
        function_to_call = available_tools[function_name]
        function_args = json.loads(tool_call.function.arguments)

        function_response = await function_to_call(**function_args)
        return {
            "tool_call_id": tool_call.id,
            "role": "tool",
            "name": function_name,
            "content": function_response,
        }

    # Run tool calls in parallel.
    tool_results = await asyncio.gather(
        *(run_single(tool_call) for tool_call in tool_calls)
    )
    return tool_results

@client.step(type="run")
async def run_agent(user_query: str):
    number_iterations = 0
    messages = prompt.format_messages(question=user_query)

    answer_message_content = None

    while number_iterations < 5:
        completion = openai_client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=messages,
            tool_choice="auto",
            tools=tools
        )
        message = completion.choices[0].message
        messages.append(message)
        answer_message_content = message.content
        
        if not message.tool_calls:
            break

        tool_results = await run_multiple(message.tool_calls)
        messages.extend(tool_results)
        
        number_iterations += 1
    return answer_message_content

<a id="run-agent"></a>
## Run agent against two questions

In [5]:
# TODO: why are the to_score at the thread, and not the run.
"""
First question should make 2 calls (one to get the hometown, the other to get the weather).
Second question should trigger a single tool call. Used to evaluate in the last cell.
"""
questions = [ "What's the weather in Napoleon's hometown?", 
              "What's the weather in Paris, TX?" ]

async def main():
    for idx, question in enumerate(questions):
        with client.thread(name=f"Question {idx+1}", tags=["to_score"]) as thread:
            client.message(content=question, type="user_message", name="User")
            answer = await run_agent(question)
            client.message(content=answer, type="assistant_message", name="My Assistant")

await main()

# Network requests by the SDK are performed asynchronously.
# Invoke flush() to guarantee the completion of all requests prior to the process termination.
# WARNING: If you run a continuous server, you should not use this method.
client.flush()

Here is what the thread details look like from Literal AI:

![image.png](attachment:c9f88345-ceb2-4f18-9857-98ef0a668bc5.png)

<a id="create-dataset"></a>
## Create a dataset of agent runs

Let's get the two threads corresponding to the two questions we simulated above. 

In [6]:
# TODO: why use get_threads, but then only filter the runs.
# TODO:get_steps of type run, with to_score
threads = client.api.get_threads(filters=[{
    "field": "tags",
    "operator": "in",
    "value": ["to_score"]
}], step_types_to_keep=["run"]).data

print(f"Number of fetched threads: {len(threads)}")

Number of fetched threads: 2


In [28]:
threads[0].steps[0].__dict__
# TODO: there are no intermediary steps. How do i get intermediary steps? is it meant? 

{'id': '9c96bc6d-416f-42b6-9dc5-85765c2e2e05',
 'start_time': '2024-05-10T12:50:14.843',
 'name': 'run_agent',
 'type': 'run',
 'processor': None,
 'thread_id': '0e49361e-aef6-49f1-a6fd-4b4b95b75e44',
 'parent_id': None,
 'input': {'args': ["What's the weather in Paris, TX?"], 'kwargs': {}},
 'error': None,
 'output': {'content': 'The current weather in Paris, TX is sunny and windy with a temperature of 72°F.'},
 'metadata': {},
 'tags': None,
 'end_time': '2024-05-10T12:50:19.114',
 'created_at': '2024-05-10T12:50:17.473'}

In [22]:
# TODO: shouldn't there be a __repr__ function
threads[0].steps[0]

<literalai.step.Step at 0x10ca32480>

In [24]:
threads[0].steps[0].to_dict()

{'id': '9c96bc6d-416f-42b6-9dc5-85765c2e2e05',
 'metadata': {},
 'parentId': None,
 'startTime': '2024-05-10T12:50:14.843',
 'endTime': '2024-05-10T12:50:19.114',
 'type': 'run',
 'threadId': '0e49361e-aef6-49f1-a6fd-4b4b95b75e44',
 'error': None,
 'input': {'args': ["What's the weather in Paris, TX?"], 'kwargs': {}},
 'output': {'content': 'The current weather in Paris, TX is sunny and windy with a temperature of 72°F.'},
 'generation': None,
 'name': 'run_agent',
 'tags': None,
 'scores': [],
 'attachments': []}

For each thread, we will have a single step of type `run`, which we will add to a dataset.

In [25]:
dataset = client.api.create_dataset(name="Agent Runs")

for thread in threads:
    # Only a single agent run in each thread.
    dataset.add_step(step_id=thread.steps[0].id)


<a id="evaluate-agent-runs"></a>
## Evaluate agent runs

We can create an experiment referencing the two questions we ran against our agent.  

The Agent is incorrect if it initiates with get_current_weather followed by get_home_town. 


In [29]:
# TODO: cannot relate.
experiment = dataset.create_experiment(name="Tool use", params={})

for idx, item in enumerate(dataset.items):
    
    intermediary_tool_calls = list(filter(lambda x: x["type"] == "tool", item.intermediary_steps))
    
    scores = [{ 
        "name": "Tool calls",
        "type": "AI",
        "value": len(intermediary_tool_calls)
    }]
    
    experiment_item = {
        "datasetItemId": item.id,
        "scores": scores,
        "input": { "question": questions[idx] },
        "output": { "tool_calls": intermediary_tool_calls }
    }
    experiment.log(experiment_item)


KeyError: 'type'

The experiment can then serve as a benchmark when making a change to the application (i.e. we can compare to it).

By running the two dataset items inputs against newer versions of your agent, and launching a new experiment, you can
easily compare your runs and quickly take note of the impact on those two questions:

![exp.png](attachment:60aeab0a-ce3d-4cc1-a835-264650d63385.png)