# Intro

In this post I will share some thoughts and ideas on using LLMs along with function calling.
I'm going to do this in the context of working on a "synthetic" problem with synthetic data.
This "fake" problem has some similarities related to some other projects
I've been working on recently. I'm also going to compare OpenAI and Anthropic in 
terms of their APIs as well as model performance through a custom evaluation.

# Problem Description

All of this data has been synthetically created. I was using Anthropic's claude-3-5-sonnet in the browser and just hacking together some prompts and copying/pasting data. It could have been streamlined, but unfortunately I didn't really document this part.

Suppose there is an ecommerce platform connected to backend APIs, databases, etc. The backend data contains various metrics, brands, and sales channels. Also assume that the platform has a front end, allowing end users to create dashboards and reports with the data. Then suppose we want to add a natural language chat interface so users can ask questions and get answers from their questions about this data. Why do this? Because leadership is excited about generative AI and wants to add chatbots and AI to the platform to show off some AI "magic". 

Now let's go through some of the data and explain it.

In [13]:
import pandas as pd
from ecommerce import questions, ecommerce_metrics, sales_channels, brands
from itables import show

## Metrics

First there are the ecommerce metrics. When a user asks a question it will be about at least one of these metrics.
The user will use some natural description to reference these metrics. Possibly using similar wording to the `name` or `description` columns below.
Then there is an associated `enum` value for each metric which is used in the backend when calling the backend APIs. There are about 150 metrics.

In [16]:
ecommerce_metrics_df = pd.DataFrame(ecommerce_metrics)
show(ecommerce_metrics_df)

name,enum,description
Loading ITables v2.1.4 from the internet... (need help?),,


## Brands

Then there is the concept of brands. I have generated around 130 brands, and similar to metrics there are the fields `name`, `description`, and `enum`.

In [18]:
brands_df = pd.DataFrame(brands)
show(brands_df)

name,enum,description
Loading ITables v2.1.4 from the internet... (need help?),,


## Sales Channels

Finally, we have some various sales channels. These are different channels which sales come in through. Each sale channel 
has the same associated fields. The number of sales channels is designed to be much less than the number of metrics and brands.

In [19]:
sales_channels_df = pd.DataFrame(sales_channels)
show(sales_channels_df)

name,enum,description
Loading ITables v2.1.4 from the internet... (need help?),,


## Example User Questions

Given this data, I used Anthropic's claude-3-5-sonnet to generate some questions users could ask.
The questions can mention the metrics, brands, sales channels, as well as a time dimension.
For each question there is the "ground truth" expected metrics, brands, sales channels, and time range.
This ground truth will be used later on when we do evaluation. Just scroll to the right in the table below
to see the other fields/columns.

In [23]:
questions_df = pd.DataFrame(questions)
show(questions_df)

question,expected_metric,expected_brands,expected_sales_channels,current_period_start_date,current_period_end_date
Loading ITables v2.1.4 from the internet... (need help?),,,,,


# OpenAI and Anthropic APIs

Now that we understand the problem, let's spend a bit of time going through
some similarities and differences between the OpenAI and Anthropic APIs when using their LLMs for
inference and function calling. I am quite familiar with the [OpenAI API](https://platform.openai.com/docs/api-reference/chat) because I have been using it for more than a year. However, Anthropic's API is quite new to me.
I think this is likely true for many people building with LLMs. 
But recently I have been using Anthropic's latest model, claude-3-5-sonnet, and it "feels" great.

Anthropic's [documentation](https://docs.anthropic.com/en/api/getting-started) is great. Another great way to learn 
about the Anthropic python API is to read the source code for [answer.ai](https://www.answer.ai/)'s wrapper [Claudette](https://x.com/jeremyphoward/status/1805062541158343018).
Usually I would not point people to a wrapper to learn the underlying API but in this case the [source code for Claudette](https://claudette.answer.ai/core.html) is a readable notebook that walks you through all the code. I found this to be an amazing resource for learning the Anthropic python SDK.

## My Own Wrapper with an Emphasis on the "Tool Calling Loop"

I have written my own wrapper classes around some functionality of OpenAI and Anthropic.
Here are some reasons why I did this.

- The best way to learn is to write it yourself. The best wrapper is the one you write and test yourself, and fully understand. 
- I wanted to write a wrapper that was focused on tool/function calling and provide a similar interface between OpenAI and Anthropic.
- I wanted to add in some features specific to tool calling such as parallel calling, and followup calls (tool calling loop).

You don't need to understand my wrapper to understand this blog post. At the end of the day it's just using OpenAI and Anthropic 
python libraries for interacting with their models. The wrapper is not really what's important here. My wrapper is not supposed to be a 
generic wrapper with streaming and image support etc. It has none of that. The focus is on the tool calling loop.

### The Tool Calling Loop

Want to build an "agentic workflow"? Just kidding.

Use tool calling in a for loop. I want the tool calling loop in my wrapper to be able to do the following:

- Call functions in parallel when possible. For example, if the LLM says to call three tools which are independent of each other, then these functions should be executed in parallel (not sequentially).
- Handle followup tool calls if necessary. For example, if the output of one tool is required as the input to another tool, allow the LLM to decide if more followup tool calls are required.
- Work with both Anthropic and OpenAI tool calling, using a similar interface.
- Keep record of all the tool calls, the inputs, the outputs, etc. This is useful for debugging and for the evaluation.

That is essentially the "tool calling loop" with some custom logic for my own use case.

A simple example would be:

**USER:** I want to book a flight for Boston or New York. Pick the location that has the better weather. Thanks!

**ASSISTANT:**

- calls `get_weather(Boston)` and `get_weather(New York)` independently and in parallel.
- Picks the location with the best weather (Boston).
- calls `book_flight(Boston)`
- Provides final assistant message to user.

The tool calling loop bundles up this type of logic.





## API Differences

Let's start with comparing OpenAI chat completions and Anthropic message creations.
I'm using my wrapper class here but all the basics/foundations are the same. The returned objects
are just dictionaries and not the usual Pydantic objects. But it does not matter. If you have used either API before then
all this will be familiar.

The first major difference is that the `system` prompt has its own [field argument for Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts#how-to-give-claude-a-role). Whereas with the OpenAI messages format
you provide the role system as the first message. It's just personal preference but since I started with OpenAI I just like that way
of working with the system prompt. So the first thing I did with my Anthropic wrapper is implement the system prompt like that.
This way I can pass similar `messages` objects to either wrapper as input. As demonstrated in this example.

In [35]:
from llm import OpenAiLMM, AnthropicLLM


llm_openai = OpenAiLMM()
llm_anthropic = AnthropicLLM()

resp = llm_openai.call(
    messages=[{"role": "system", "content": "Talk like a pirate."}, {"role": "user", "content": "hey"}],
    model="gpt-3.5-turbo-0125",
    temperature=0.6,
    max_tokens=150,
)
resp

{'message': {'content': "Ahoy matey! What be ye needin' help with today, arrr?",
  'role': 'assistant'},
 'model': 'gpt-3.5-turbo-0125',
 'token_usage': {'completion_tokens': 18,
  'prompt_tokens': 17,
  'total_tokens': 35}}

In [36]:
resp = llm_anthropic.call(
    messages=[{"role": "system", "content": "Talk like a pirate."}, {"role": "user", "content": "hey"}],
    model="claude-3-5-sonnet-20240620",
    temperature=0.6,
    max_tokens=150,
)
resp

{'message': {'content': [{'text': "Ahoy there, matey! What brings ye to these digital waters? Be ye seekin' treasure, adventure, or just a friendly chat with an ol' sea dog like meself?",
    'type': 'text'}],
  'role': 'assistant'},
 'model': 'claude-3-5-sonnet-20240620',
 'token_usage': {'completion_tokens': 43,
  'prompt_tokens': 14,
  'total_tokens': 57}}

Also note that even things like `temperature` are different. For OpenAI its range is `[0,2]` whereas Anthropic it's `[0,1]`.  The output token usages are different formats as well. Here I have made them the same, similar to OpenAI token usage format. I don't plan to go through all these differences in this post. Just some of the "big" ones.

The next big difference is how the message responses/outputs are returned. By default, OpenAI returns `n=1` response for normal text generation.
It returns the `{'content': "...", 'role': 'assistant'}` format. If there are tool calls then it returns those as a separate field `tool_calls`. We will see that shortly. But Anthropic returns all its content as a list of messages. Those message objects can have different types such as `text`, `tool_use`, `tool_result`, etc. You can see this in the example below. 

Tool/function calling has some differences too. Let's go through the all too familiar weather example. It's like the "hello world" 
of function calling. Starting with OpenAI API tool format:


In [56]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The location city",
                    },
                },
                "required": ["location"],
            },
        },
    }
]
llm_openai.call(messages=[{"role": "user", "content": "What is the weather in Boston?"}], model="gpt-3.5-turbo-0125", tools=tools)

{'message': {'content': None,
  'role': 'assistant',
  'tool_calls': [{'id': 'call_RpQCbcAvrROdSDdEm1wKyWrj',
    'function': {'arguments': '{"location":"Boston"}',
     'name': 'get_current_weather'},
    'type': 'function'}]},
 'model': 'gpt-3.5-turbo-0125',
 'token_usage': {'completion_tokens': 15,
  'prompt_tokens': 61,
  'total_tokens': 76}}

This is just the first step. The LLM says we need to call a tool and OpenAI uses this `'tool_calls'` field.
Also note that OpenAI often leaves `content` as `None` when returning tools to be called. This is not always the case though!

When we make this same request with Anthropic we need to define the tools slightly different. It does not accept the same format :(. If we try we get an error.


In [86]:
llm_anthropic.call(messages=[{"role": "user", "content": "What is the weather in Boston?"}], model="claude-3-5-sonnet-20240620", tools=tools)

{'error': {'code': 'invalid_request_error',
  'status_code': 400,
  'type': 'invalid_request_error',
  'message': 'tools.0.name: Field required'}}

I have a simple function to convert from OpenAI tool format to Anthropic tool format.

In [64]:
from copy import deepcopy


def convert_openai_tool_to_anthropic(open_ai_tool: dict):
    t = deepcopy(open_ai_tool)
    t = t["function"]
    t["input_schema"] = t["parameters"]
    t.pop("parameters")
    return t

When using Anthropic we need to define the tools like below. It's almost as if they want to make it harder
to switch from OpenAI to Anthropic lol.

In [65]:
convert_openai_tool_to_anthropic(tools[0])

{'name': 'get_current_weather',
 'description': 'Get the current weather in a given location',
 'input_schema': {'type': 'object',
  'properties': {'location': {'type': 'string',
    'description': 'The location city'}},
  'required': ['location']}}

In [66]:
llm_anthropic.call(
    messages=[{"role": "user", "content": "What is the weather in Boston?"}],
    model="claude-3-5-sonnet-20240620",
    tools=[convert_openai_tool_to_anthropic(tools[0])],
)

{'message': {'content': [{'text': "Certainly! I can help you get the current weather information for Boston. To do that, I'll use the available weather tool. Let me fetch that information for you right away.",
    'type': 'text'},
   {'id': 'toolu_01NMMxXzroTLFDGF1D43Mp46',
    'input': {'location': 'Boston'},
    'name': 'get_current_weather',
    'type': 'tool_use'}],
  'role': 'assistant'},
 'model': 'claude-3-5-sonnet-20240620',
 'token_usage': {'completion_tokens': 92,
  'prompt_tokens': 374,
  'total_tokens': 466}}

Here you can see this list of `content` where the tool calls and assistant messages are all together in the same list. There are different `type`s
to tell the difference. Also note that by default I think sonnet is being prompted to use chain of thought prompting. That's why it explains itself and provides the content message of type `text`. You can try to make it not as wordy. It would be interesting to see how this affects evaluation performance though.

In [87]:
llm_anthropic.call(
    messages=[
        {
            "role": "system",
            "content": " Do not talk about the tools you use. Just call the tools without providing any reasoning."
            "Please, do not talk about the tools.",
        },
        {"role": "user", "content": "What is the weather in Boston?"},
    ],
    model="claude-3-5-sonnet-20240620",
    tools=[convert_openai_tool_to_anthropic(tools[0])],
    temperature=0,
)

{'message': {'content': [{'text': "Certainly! I'll check the current weather in Boston for you.",
    'type': 'text'},
   {'id': 'toolu_014iaAYa99njiqWdAZYEGRyp',
    'input': {'location': 'Boston'},
    'name': 'get_current_weather',
    'type': 'tool_use'}],
  'role': 'assistant'},
 'model': 'claude-3-5-sonnet-20240620',
 'token_usage': {'completion_tokens': 69,
  'prompt_tokens': 402,
  'total_tokens': 471}}

Before moving onto a more interesting problem to solve, let's just go through the tool calling loop example.
This tool calling loop is implemented in a function `generate_with_function_calling`. Let's consider the simple example
of using weather and flight booking tools together.

In [94]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The location city",
                    },
                },
                "required": ["location"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "book_flight",
            "description": "Book a flight from one location to another.",
            "parameters": {
                "type": "object",
                "properties": {
                    "departure_city": {
                        "type": "string",
                        "description": "The departure city.",
                    },
                    "arrival_city": {
                        "type": "string",
                        "description": "The arrival city.",
                    },
                },
                "required": ["departure_city", "arrival_city"],
            },
        },
    },
]


def get_current_weather(location):
    if "boston" in location.lower():
        return {"data": "Sunny!"}
    else:
        return {"data": "Rainy!"}


def book_flight(departure_city, arrival_city):
    return {"data": f"I have booked your flight from {departure_city} to {arrival_city}."}


functions_look_up = {"get_current_weather": get_current_weather, "book_flight": book_flight}

Above is what is needed for my wrapper tool calling loop logic. 
The functions must return a dict with at least the key `'data'` which is the content passed to the LLM.

First we will use OpenAI.

In [95]:
resp = llm_openai.generate_with_function_calling(
    messages=[
        {
            "role": "user",
            "content": """I need to book a flight from Halifax to either Boston or New York. 
    But I want to fly to the city with the nicer weather.
    """,
        }
    ],
    tools=tools,
    functions_look_up=functions_look_up,
    model="gpt-3.5-turbo-0125",
)

In [97]:
resp.keys()

dict_keys(['message', 'new_messages', 'model', 'tool_calls_details', 'token_usage', 'execution_time'])

- `resp['message']` is the final assistant message after all the internal looping logic.

In [98]:
resp["message"]

{'content': 'Your flight from Halifax to Boston has been successfully booked. If you need any further assistance or information, feel free to let me know!',
 'role': 'assistant'}

- `resp['new_messages']` is the record of all **new** messages created after the user message and up to and including the final assistant message. It is useful for keeping track of the conversation history. It includes all the tool calls and interactions. It will be in the format expected by either OpenAI and Anthropic.

In [101]:
resp["new_messages"]

[{'content': None,
  'role': 'assistant',
  'tool_calls': [{'id': 'call_CHndlD0V00ThofEsuUEPOTBK',
    'function': {'arguments': '{"location": "Boston"}',
     'name': 'get_current_weather'},
    'type': 'function'},
   {'id': 'call_Ym7kjlLIX2bpX892trzIqOhx',
    'function': {'arguments': '{"location": "New York"}',
     'name': 'get_current_weather'},
    'type': 'function'}]},
 {'tool_call_id': 'call_CHndlD0V00ThofEsuUEPOTBK',
  'role': 'tool',
  'name': 'get_current_weather',
  'content': 'Sunny!'},
 {'tool_call_id': 'call_Ym7kjlLIX2bpX892trzIqOhx',
  'role': 'tool',
  'name': 'get_current_weather',
  'content': 'Rainy!'},
 {'content': 'The current weather in both Boston is sunny and in New York is rainy. \nSince the weather in Boston is sunny, I recommend booking your flight from Halifax to Boston. \n\nI will proceed with booking the flight from Halifax to Boston.',
  'role': 'assistant',
  'tool_calls': [{'id': 'call_qktEzlFMLIvZCTVqZLUrrNNk',
    'function': {'arguments': '{"depa

In [88]:
llm_openai.call(messages=[], model="gpt-3.5-turbo-0125", tools=tools)

{'message': {'content': None,
  'role': 'assistant',
  'tool_calls': [{'id': 'call_wV4oaNtnIXwYKNsGVVtXHp8E',
    'function': {'arguments': '{"location": "Boston"}',
     'name': 'get_current_weather'},
    'type': 'function'},
   {'id': 'call_z2pQybdSOUvDmWbsoW7qivew',
    'function': {'arguments': '{"location": "New York"}',
     'name': 'get_current_weather'},
    'type': 'function'}]},
 'model': 'gpt-3.5-turbo-0125',
 'token_usage': {'completion_tokens': 46,
  'prompt_tokens': 85,
  'total_tokens': 131}}

In [None]:
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
import numpy as np

In [None]:
llm_anthropic = AnthropicLLM()
llm_anthropic.call(
    messages=[{"role": "system", "content": "Talk like a pirate and return JSON"}, {"role": "user", "content": "hey"}], max_tokens=1000
)

In [None]:
def prepare_embedding_input(rec):
    return f'{rec["name"]} {rec["enum"]}'

In [None]:
ecommerce_metrics_embeddings = llm_openai.get_embeddings([prepare_embedding_input(rec) for rec in ecommerce_metrics])
brands_embeddings = llm_openai.get_embeddings([prepare_embedding_input(rec) for rec in brands])

In [None]:
get_backend_metric_tool = {
    "type": "function",
    "function": {
        "name": "get_backend_metric",
        "description": """Takes in the user requested metric and 
        uses ML/AI to return the k nearest neighbors for the most likely related backend ENUM metrics.""",
        "parameters": {
            "type": "object",
            "properties": {
                "user_requested_metric": {
                    "type": "string",
                    "description": "The metric requested by the user.",
                },
            },
            "required": ["user_requested_metric"],
        },
    },
}
get_backend_brands_tool = {
    "type": "function",
    "function": {
        "name": "get_backend_brands",
        "description": """Takes in the user requested brand(s) and 
        uses ML/AI to return the k nearest neighbors for the most likely related backend ENUM brands per requested brands.""",
        "parameters": {
            "type": "object",
            "properties": {
                "user_requested_brands": {
                    "type": "array",
                    "items": {
                        "type": "string",
                    },
                    "default": [],
                    "description": "The list of brand(s) requested by the user.",
                },
            },
            "required": ["user_requested_brands"],
        },
    },
}
get_sales_data_tool = {
    "type": "function",
    "function": {
        "name": "get_sales_data",
        "description": """Get the sales data from the backend system.""",
        "parameters": {
            "type": "object",
            "properties": {
                "backend_metric": {
                    "type": "string",
                    "description": "This is the backend metric ENUM.",
                },
                "backend_brands": {
                    "type": "array",
                    "items": {
                        "type": "string",
                    },
                    "default": [],
                    "description": "The list of backend ENUM brands.",
                },
                "sales_channels": {
                    "type": "array",
                    "items": {
                        "type": "string",
                        "enum": [x["enum"] for x in sales_channels],
                    },
                    "default": [],
                    "description": "The list of sales channels.",
                },
                "current_period_start_date": {
                    "type": "string",
                    "description": "The start of the current reporting period.",
                },
                "current_period_end_date": {
                    "type": "string",
                    "description": "The end of the current reporting period.",
                },
            },
            "required": [
                "backend_metric",
                "backend_brands",
                "current_period_start_date",
                "current_period_end_date",
            ],
        },
    },
}
tools_openai = [get_backend_metric_tool, get_backend_brands_tool, get_sales_data_tool]
tools_anthropic = [convert_openai_tool_to_anthropic(t) for t in tools_openai]

In [None]:
def find_k_nearest_neighbors(embeddings, input_embedding, k):
    # Calculate distances
    distances = np.linalg.norm(embeddings - input_embedding, axis=1)
    # Get indices of k smallest distances
    nearest_indices = np.argpartition(distances, k)[:k]
    # Sort the k nearest indices by distance
    nearest_indices = nearest_indices[np.argsort(distances[nearest_indices])]
    return nearest_indices

In [None]:
def get_backend_metric(user_requested_metric: str):
    return {
        "data": [
            ecommerce_metrics[i]
            for i in find_k_nearest_neighbors(llm_openai.get_embeddings([user_requested_metric]), ecommerce_metrics_embeddings, 3)
        ]
    }


def get_backend_brands(user_requested_brands: list[str]):
    data = dict()
    for brand in user_requested_brands:
        data[brand] = [brands[i] for i in find_k_nearest_neighbors(llm_openai.get_embeddings([brand]), brands_embeddings, 3)]
    return {"data": data}


def get_sales_data(*args, **kwargs):
    return {"data": 10}


functions_look_up = {"get_backend_metric": get_backend_metric, "get_backend_brands": get_backend_brands, "get_sales_data": get_sales_data}

In [None]:
get_backend_brands(["shopify", "nike"])

In [None]:
system_prompt = """
You will be asked a question by the user about retrieving sales data.
Use the available tools but only call the tools when needed.
If you need further clarification then ask. 

There are hundreds of metrics and hundreds of brands in the backend system.
The user will not know all these metrics and brands, or how to refer to them exactly.
You do not know them either, so I have provided some helper tools for you.

In general you will follow the typical flow when answering questions:
1. Extract the user requested metric and the user requested brand(s).
2. 
    
    a) Pass the user requested metric to the the tool get_backend_metric to
    get the list of most likely corresponding backend metric ENUMs. 
    Then choose the most appropriate from this list. 
    
    b) Pass the user requested brand(s) to the the tool get_backend_brands to
    get the list of most likely corresponding backend brand ENUMs. 
    Then choose the most appropriate from this list. 
    
3. Extract any sales channels if mentioned.
4. Pass the relevant arguments into the get_sales_data tool.


Today's date is Monday, June 10, 2024
"""

In [None]:
def eval_llm_resp(question: dict, llm_resp: dict):
    if not llm_resp.get("tool_calls_details"):
        args_predicted = dict()
    else:
        args_predicted = [x["input"] for x in llm_resp["tool_calls_details"].values() if x["name"] == "get_sales_data"]
        args_predicted = args_predicted[0] if args_predicted else {}
    return {
        "question": question["question"],
        "expected_metric": question["expected_metric"],
        "predicted_metric": args_predicted.get("backend_metric", ""),
        "metric_correct": question["expected_metric"] == args_predicted.get("backend_metric", ""),
        "expected_brands": sorted(question["expected_brands"]),
        "predicted_brands": sorted(args_predicted.get("backend_brands", [])),
        "brands_correct": sorted(question["expected_brands"]) == sorted(args_predicted.get("backend_brands", [])),
        "expected_sales_channels": sorted(question["expected_sales_channels"]),
        "predicted_sales_channels": sorted(args_predicted.get("sales_channels", [])),
        "sales_channels_correct": sorted(question["expected_sales_channels"]) == sorted(args_predicted.get("sales_channels", [])),
        "expected_current_period_start_date": question["current_period_start_date"],
        "predicted_current_period_start_date": args_predicted.get("current_period_start_date", ""),
        "current_period_start_date_correct": question["current_period_start_date"] == args_predicted.get("current_period_start_date", ""),
        "expected_current_period_end_date": question["current_period_end_date"],
        "predicted_current_period_end_date": args_predicted.get("current_period_end_date", ""),
        "current_period_end_date_correct": question["current_period_end_date"] == args_predicted.get("current_period_end_date", ""),
    }


def eval_questions(llm, model, tools, questions: list[dict], max_workers=10):
    def task(question: dict):
        llm_resp = llm.generate_with_function_calling(
            messages=[
                {"role": "system", "content": system_prompt},
                {
                    "role": "user",
                    "content": question["question"],
                },
            ],
            tools=tools,
            functions_look_up=functions_look_up,
            model=model,
        )
        llm_resp.update(eval_llm_resp(question, llm_resp))
        return llm_resp

    eval_res = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(task, question) for question in questions]

        for future in tqdm(as_completed(futures), total=len(questions), desc="Evaluating questions"):
            eval_res.append(future.result())

    return eval_res

In [None]:
def calculate_accuracies(df):
    accuracies = {}
    for col in df.columns:
        if col.endswith("_correct"):
            accuracy = df[col].sum() / df.shape[0]
            accuracies[col.replace("correct", "accuracy")] = f"{accuracy:.2%}"
    return accuracies

In [None]:
import pandas as pd

df_openai = pd.DataFrame(eval_questions(llm_openai, "gpt-3.5-turbo-0125", tools_openai, questions[:10], max_workers=10))

In [None]:
df_openai

In [None]:
calculate_accuracies(df_openai)

In [None]:
df_anthropic = pd.DataFrame(eval_questions(llm_anthropic, "claude-3-5-sonnet-20240620", tools_anthropic, questions[:10], max_workers=2))

In [None]:
df_anthropic

In [None]:
calculate_accuracies(df_anthropic)