Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 70 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,74 @@
👋 Join our <a href="https://join.slack.com/t/agentbenchcol-huw1944/shared_invite/zt-20ixabcuv-31cFLBAkqGQxQkJqrWVEVg" target="_blank">Slack</a> for <i>Q & A</i> or <i><b>collaboration</b> on next version of AgentBench</i>!
</p>

## 🔥[2025.10.10] Introducing **AgentBench FC (Function Calling)** based on [AgentRL](https://github.com/THUDM/AgentRL)

The current repository contains the function-calling version of AgentBench, integrated with [AgentRL](https://github.com/THUDM/AgentRL), an end-to-end multitask and mutliturn LLM Agent RL framework.
If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1) and [v0.2](https://github.com/THUDM/AgentBench/tree/v0.2).

Comparing to the original AgentBench, this version uses a function-calling style prompt,
and adds fully-containerized deployment support for the following tasks:

- `alfworld` (AF)
- `dbbench` (DB)
- `knowledgegraph` (KG)
- `os_interaction` (OS)
- `webshop` (WS)

### Quick Start

We support a quick one-command setup for all the above tasks using Docker Compose.

Before starting, please download or build the following Docker images required by the tasks:

```shell
# dbbench
docker pull mysql:8

# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles
```

To run the KG freebase server, you will also need a copy of the data found [here](https://github.com/dki-lab/Freebase-Setup).
Download, extract and place the data at `./virtuoso_db/virtuoso.db` (or modify `extra/docker-compose.yml` and set the mount point to your data location).

Then, you can bring up the stack with:

```shell
docker compose -f extra/docker-compose.yml up
```

This command will download or build the necessary Docker images and start the following services in Docker:

- AgentRL Controller
- `alfworld` task worker (x1, increase as needed)
- `dbbench` task worker (x1, increase as needed)
- `knowledgegraph` task worker (x1, increase as needed)
- `os_interaction` task worker (x1, increase as needed)
- `webshop` task worker (x1, increase as needed)
- freebase server (for `knowledgegraph` task)
- Redis server (for container allocation)

If your machine already has Redis (version 7+) running, you can omit the Redis service from the `docker-compose.yml`.

> [!WARNING]
> Please note that the `webshop` environment requires ~16GB of RAM to start,
> and the current implementation of `alfworld` leaks memory and disk space until the task worker is restarted.
> Make sure your machine has sufficient resources before running.

### Benchmarking Results

We report the results of various models on the test set of AgentBench FC.

![img.png](assets/fc_leaderboard.png)

Please see our [Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vRR3Wl7wsCgHpwUw1_eUXW_fptAPLL3FkhnW_rua0O1Ji_GIVrpTjY5LaKAhwO-WeARjnY_KNw0SYNJ/pubhtml) for full results.
Please contact [agentbench_fc&#64;googlegroups.com](mailto:agentbench_fc@googlegroups.com) if you have any questions or would like to contribute your results.

---

## 🔥[2024.08.13] Introducing [VisualAgentBench](https://github.com/THUDM/VisualAgentBench)

VisualAgentBench is designed for evaluating and training visual foundation agents based on large multimodel models (LMMs). We introduce 5 distinct environments spanning
Expand All @@ -20,16 +88,9 @@ VisualAgentBench is designed for evaluating and training visual foundation agent

to systematically benchmark 17 LMMs (proprietary & open LMMs). We also provide the trajectory dataset for behavior cloning training on open LMMs for you to develop your own visual foundation agents!

## 📌Introducing AgentBench v0.2🎉

You are now browsing AgentBench v0.2. If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1).

Based on [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1), we:
---

- Updated the framework architecture for easier use and extension
- Adjusted some task settings
- Added test results for more models
- Released the full data for the Dev and Test sets
The following is the introduction to the original AgentBench (v0.2).

# AgentBench: Evaluating LLMs as Agents

Expand Down
Binary file added assets/fc_leaderboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
40 changes: 26 additions & 14 deletions configs/tasks/alfworld.yaml
Original file line number Diff line number Diff line change
@@ -1,22 +1,34 @@
default:
module: src.server.tasks.alfworld.ALFWorld
docker:
image: longinyu/agentbench-alfworld
command: umask 0; [ -f /root/.setup.sh ] && bash /root/.setup.sh;
parameters:
name: alfworld-std
data_path: "/AgentBench/data/alfworld"
config_path: "src/server/tasks/alfworld/configs/base_config.yaml"
prompts_path: "src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json"
split: "standard"
max_step: 35

alfworld-dev:
parameters:
name: alfworld-dev
split: "dev"
concurrency: 16
data_path: "/app/data/alfworld"
config_path: "/app/src/server/tasks/alfworld/configs/base_config.yaml"
prompts_path: "/app/src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json"
split: "new_std"
max_step: 20
tools:
- type: "function"
function:
name: "take_action"
description: "Take an action."
parameters:
type: "object"
properties:
action:
type: "string"
description: "The action you would like to take"
required:
- "action"
additionalProperties: False

alfworld-std:
parameters:
name: alfworld-std
split: "standard"
split: "new_std"

alfworld-env_train:
parameters:
name: alfworld-env_train
split: "train_valid"
51 changes: 45 additions & 6 deletions configs/tasks/dbbench.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,54 @@
default:
module: src.server.tasks.dbbench.DBBench
module: src.server.tasks.dbbench.DBBenchTask
parameters:
concurrency: 1
concurrency: 32
max_round: 15

dbbench-dev:
parameters:
name: dbbench-dev
data_file: "data/dbbench/dev.jsonl"
tools:
- type: "function"
function:
name: "execute_sql"
description: "Executes a given SQL statement on the database and returns the result."
parameters:
type: "object"
properties:
query:
type: "string"
description: "The SQL query to be executed."
required:
- "query"
additionalProperties: False
- type: "function"
function:
name: "commit_final_answer"
description: "Commits the final answer after all operations are completed."
parameters:
type: "object"
properties:
answers:
type: "array"
items:
type: "string"
description: "The list of final answers to commit."
required:
- "answers"
additionalProperties: False

env_driver: docker
env_options:
network_name: dbbench_default
state_driver: redis
state_options:
connection:
host: 172.17.0.1

dbbench-std:
parameters:
name: dbbench-std
data_file: "data/dbbench/standard.jsonl"

dbbench-env_train:
parameters:
name: dbbench-env_train
data_file: "data/dbbench/db_out_new.jsonl"
db_file: "data/dbbench/db_train"
34 changes: 27 additions & 7 deletions configs/tasks/kg.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,35 @@
default:
module: "src.server.tasks.knowledgegraph.KnowledgeGraph"
parameters:
round: 15
sparql_url: "http://164.107.116.56:3093/sparql"
concurrency: 32
max_rounds: 15
one_shot: false
database_file:
env_driver: manual
env_options:
urls:
kg: http://localhost:3001/sparql

kg-dev:
parameters:
name: "KnowledgeGraph-dev"
data_file: "data/knowledgegraph/dev.json"
# alternative configuration - automatically start a SPARQL server in a docker container
# fill-in the database_file parameter with the absolute path to the freebase db file on the host
# and replace the above parameters with the following:
#
# database_file: /path/to/virtuoso_db/virtuoso.db
# env_driver: docker
# env_options:
# network_name: knowledgegraph_default
# state_driver: redis
# state_options:
# connection:
# host: 172.17.0.1

kg-std:
parameters:
name: "KnowledgeGraph-std"
name: "kg-std"
data_file: "data/knowledgegraph/std.json"
one_shot: true

kg-env_train:
parameters:
name: "kg-env_train"
data_file: "data/knowledgegraph/kg_rl_all.json"
83 changes: 62 additions & 21 deletions configs/tasks/os.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,50 @@
os-dev:
default:
module: "src.server.tasks.os_interaction.OSInteraction"
parameters:
name: "os-dev"
concurrency: 24
concurrency: 32
round_limit: 8
tools:
- type: "function"
function:
name: "bash_action"
description: "Execute bash code to perform an operation in the Linux environment."
parameters:
type: "object"
properties:
script:
type: "string"
description: "The bash script to be executed."
required:
- "script"
additionalProperties: False

- type: "function"
function:
name: "finish_action"
description: "Indicate that the task has been finished or need some additional information to be finished."
parameters:
type: "object"
properties:
thought:
type: "string"
description: "The thought or reason indicating the task is finished."
required:
- "thought"
additionalProperties: False

- type: "function"
function:
name: "answer_action"
description: "Provide the answer to the question."
parameters:
type: "object"
properties:
answer:
type: "string"
description: "The answer to the question."
required:
- "answer"
additionalProperties: False

docker_config:
localhost: local-os
Expand All @@ -12,29 +53,18 @@ os-dev:
scripts:
directory: data/os_interaction/res/scripts

data_config:
files:
- problem_file: data/os_interaction/data/dev.json
script_dir: data/os_interaction/scripts/dev/
index_prefix: "dev-001-"

bk: [ ]
ignore: [ ]
env_driver: docker
env_options:
network_name: os_interaction_default
state_driver: redis
state_options:
connection:
host: 172.17.0.1

os-std:
module: "src.server.tasks.os_interaction.OSInteraction"
parameters:
name: "os-std"
concurrency: 24
round_limit: 8

docker_config:
localhost: local-os
directory: data/os_interaction/res/dockerfiles

scripts:
directory: data/os_interaction/res/scripts

data_config:
files:
- problem_file: data/os_interaction/data/1/*.json
Expand All @@ -61,3 +91,14 @@ os-std:

bk: [ ]
ignore: [ ]

os-env_train:
parameters:
name: "os-env_train"
data_config:
files:
- problem_file: data/os_interaction/train_0317/training.json
script_dir: data/os_interaction/scripts/7/
index_prefix: "train-0223-"
bk: [ ]
ignore: [ ]
Loading