Function Calling Benchmark by Composio

Welcome to the official GitHub repository for the Composio's Function Calling Benchmark. This repository contains a benchmark of 50 function calling problems, each of which is designed to be solved using one of the 8 function schemas provided, which are inspired from some of ClickUp's integration endpoints.

Overview

The benchmark is designed to test the ability of various models to correctly call functions based on given prompts, and solve the situation in a ClickUp workspace using one of the given functions. Each question in the benchmark presents a scenario that requires the use of a specific function to solve. The function schemas provided outline the structure and parameters of the functions that can be used.

Note that, a speciality of this benchmark is, problems are designed to test the abilities of the models to handle real world API structures, and performance against differnet optimizations.

Publications

Improving GPT 4 Function Calling Accuracy

Repository Structure

prompts/: Propmts used to check & modify the Problems and Schema.
clickup_space_benchmark.json: The problems and correct solutions.
clickup_space_schema.json: Function Schema's that the LLMs use to solve the problems of the Benchmark.
*.ipynb(in relevant branches): Different optimization techniques, applied to the LLMs to check their performance against the Benchmark.

We did the all experimentations on notebooks now, as it is easier to keep track of the results.

Running the Benchmark

We have tested different function calling models, Resut notebooks of which are stored in each seperate branch.

Currently we have experimented with:

gpt-4o - OpenAI - branch
gpt-4-turbo-preview - OpenAI - branch
gpt-4-turbo - OpenAI - branch
gpt-4-0125-preview - OpenAI - branch
claude-3-haiku-20240307 - Anthropic - branch
claude-3-sonnet-20240229 - Anthropic - branch
claude-3-opus-20240229 - Anthropic - branch

We are planning to add these models in future:

Functionary Models(MeetKai)
Mistral Models
Open-Gorilla Models
NexusRaven Models

Experiments

All these different optimizations has been tested with the models, and each of the techniques are explained here.

All previous Models:

	Optimization Approach	`gpt-4-turbo-preview`	`gpt-4-turbo`	`gpt-4-0125-preview`	`claude-3-haiku-20240307`	`claude-3-sonnet-20240229`	`claude-3-opus-20240229`
1	No System Prompt	0.36	0.36	0.353	0.48	0.6	0.42
2	Flattening Schema	0.527	0.487	0.533	0.5	0.58	0.5
3	Flattened Schema + Simple System Prompt	0.553	0.533	0.54	0.54	0.6	0.54
4	Flattened Schema + Focused System Prompt	0.633	0.633	0.64	0.54	0.54	0.54
5	Flattened Schema + Focused System Prompt + Function Name Optimized	0.553	0.607	0.587	0.52	0.62	0.52
6	Flattened Schema + Focused System Prompt + Function Description Optimized	0.633	0.66	0.673	0.52	0.6	0.52
7	Flattened Schema + Focused System Prompt containing Schema summary	0.64	0.553	0.64	0.46	0.62	0.46
8	Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized	0.70	0.707	0.686	0.5	0.64	0.46
9	Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimized	0.687	0.707	0.68	0.5	0.6	0.6
10	Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized	0.767	0.767	0.787	0.58	0.74	0.58
11	Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Call examples added	0.693	0.6	0.707	0.6	0.76	0.64
12	Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Parameter examples added	0.787	0.693	0.787	0.68	0.76	0.66

Contributing

We welcome contributions to this repository. If you have a model that you would like to test against the benchmark, feel free to open a pull request. If you encounter any issues while using the benchmark, please open an issue.

License

This project is licensed under the terms of the MIT license.

About Composio

Composio is an organization dedicated to advancing the field of artificial intelligence. We create benchmarks, develop models, and build tools to push the boundaries of what is possible in AI. Follow us on Twitter for updates on our latest projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Function Calling Benchmark by Composio

Overview

Publications

Repository Structure

Running the Benchmark

We are planning to add these models in future:

Experiments

Contributing

License

About Composio

Files

README.md

Latest commit

History

README.md

File metadata and controls

Function Calling Benchmark by Composio

Overview

Publications

Repository Structure

Running the Benchmark

We are planning to add these models in future:

Experiments

Contributing

License

About Composio