THUDM · Xiao9905 · Oct 14, 2025 · Oct 9, 2025
diff --git a/README.md b/README.md
@@ -10,6 +10,74 @@
 👋 Join our <a href="https://join.slack.com/t/agentbenchcol-huw1944/shared_invite/zt-20ixabcuv-31cFLBAkqGQxQkJqrWVEVg" target="_blank">Slack</a>  for <i>Q & A</i> or <i><b>collaboration</b> on next version of AgentBench</i>!
 </p>
 
+## 🔥[2025.10.10] Introducing **AgentBench FC (Function Calling)** based on [AgentRL](https://github.com/THUDM/AgentRL)
+
+The current repository contains the function-calling version of AgentBench, integrated with [AgentRL](https://github.com/THUDM/AgentRL), an end-to-end multitask and mutliturn LLM Agent RL framework.
+If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1) and [v0.2](https://github.com/THUDM/AgentBench/tree/v0.2).
+
+Comparing to the original AgentBench, this version uses a function-calling style prompt,
+and adds fully-containerized deployment support for the following tasks:
+
+- `alfworld` (AF)
+- `dbbench` (DB)
+- `knowledgegraph` (KG)
+- `os_interaction` (OS)
+- `webshop` (WS)
+
+### Quick Start
+
+We support a quick one-command setup for all the above tasks using Docker Compose.
+
+Before starting, please download or build the following Docker images required by the tasks:
+
+```shell
+# dbbench
+docker pull mysql:8
+
+# os_interaction
+docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
+docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
+docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles
+```
+
+To run the KG freebase server, you will also need a copy of the data found [here](https://github.com/dki-lab/Freebase-Setup).
+Download, extract and place the data at `./virtuoso_db/virtuoso.db` (or modify `extra/docker-compose.yml` and set the mount point to your data location).
+
+Then, you can bring up the stack with:
+
+```shell
+docker compose -f extra/docker-compose.yml up
+```
+
+This command will download or build the necessary Docker images and start the following services in Docker:
+
+- AgentRL Controller
+- `alfworld` task worker (x1, increase as needed)
+- `dbbench` task worker (x1, increase as needed)
+- `knowledgegraph` task worker (x1, increase as needed)
+- `os_interaction` task worker (x1, increase as needed)
+- `webshop` task worker (x1, increase as needed)
+- freebase server (for `knowledgegraph` task)
+- Redis server (for container allocation)
+
+If your machine already has Redis (version 7+) running, you can omit the Redis service from the `docker-compose.yml`.
+
+> [!WARNING]  
+> Please note that the `webshop` environment requires ~16GB of RAM to start,
+> and the current implementation of `alfworld` leaks memory and disk space until the task worker is restarted.
+> Make sure your machine has sufficient resources before running.
+
+### Benchmarking Results
+
+We report the results of various models on the test set of AgentBench FC.
+
+![img.png](assets/fc_leaderboard.png)
+
+Please see our [Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vRR3Wl7wsCgHpwUw1_eUXW_fptAPLL3FkhnW_rua0O1Ji_GIVrpTjY5LaKAhwO-WeARjnY_KNw0SYNJ/pubhtml) for full results.
+Please contact [agentbench_fc&#64;googlegroups.com](mailto:agentbench_fc@googlegroups.com) if you have any questions or would like to contribute your results.
+
+---
+
 ## 🔥[2024.08.13] Introducing [VisualAgentBench](https://github.com/THUDM/VisualAgentBench)
 
 VisualAgentBench is designed for evaluating and training visual foundation agents based on large multimodel models (LMMs). We introduce 5 distinct environments spanning 
@@ -20,16 +88,9 @@ VisualAgentBench is designed for evaluating and training visual foundation agent
 
 to systematically benchmark 17 LMMs (proprietary & open LMMs). We also provide the trajectory dataset for behavior cloning training on open LMMs for you to develop your own visual foundation agents!
 
-## 📌Introducing AgentBench v0.2🎉
-
-You are now browsing AgentBench v0.2. If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1).
-
-Based on [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1), we:
+---
 
--   Updated the framework architecture for easier use and extension
--   Adjusted some task settings
--   Added test results for more models
--   Released the full data for the Dev and Test sets
+The following is the introduction to the original AgentBench (v0.2).
 
 # AgentBench: Evaluating LLMs as Agents
 

diff --git a/assets/fc_leaderboard.png b/assets/fc_leaderboard.png
diff --git a/configs/tasks/alfworld.yaml b/configs/tasks/alfworld.yaml
@@ -1,22 +1,34 @@
 default:
   module: src.server.tasks.alfworld.ALFWorld
-  docker:
-    image: longinyu/agentbench-alfworld
-    command: umask 0; [ -f /root/.setup.sh ] && bash /root/.setup.sh;
   parameters:
     name: alfworld-std
-    data_path: "/AgentBench/data/alfworld"
-    config_path: "src/server/tasks/alfworld/configs/base_config.yaml"
-    prompts_path: "src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json"
-    split: "standard"
-    max_step: 35
-
-alfworld-dev:
-  parameters:
-    name: alfworld-dev
-    split: "dev"
+    concurrency: 16
+    data_path: "/app/data/alfworld"
+    config_path: "/app/src/server/tasks/alfworld/configs/base_config.yaml"
+    prompts_path: "/app/src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json"
+    split: "new_std"
+    max_step: 20
+    tools:
+      - type: "function"
+        function:
+          name: "take_action"
+          description: "Take an action."
+          parameters:
+            type: "object"
+            properties:
+              action:
+                type: "string"
+                description: "The action you would like to take"
+            required:
+              - "action"
+            additionalProperties: False
 
 alfworld-std:
   parameters:
     name: alfworld-std
-    split: "standard"
+    split: "new_std"
+
+alfworld-env_train:
+  parameters:
+    name: alfworld-env_train
+    split: "train_valid"
diff --git a/configs/tasks/dbbench.yaml b/configs/tasks/dbbench.yaml
@@ -1,15 +1,54 @@
 default:
-  module: src.server.tasks.dbbench.DBBench
+  module: src.server.tasks.dbbench.DBBenchTask
   parameters:
-    concurrency: 1
+    concurrency: 32
     max_round: 15
 
-dbbench-dev:
-  parameters:
-    name: dbbench-dev
-    data_file: "data/dbbench/dev.jsonl"
+    tools:
+      - type: "function"
+        function:
+          name: "execute_sql"
+          description: "Executes a given SQL statement on the database and returns the result."
+          parameters:
+            type: "object"
+            properties:
+              query:
+                type: "string"
+                description: "The SQL query to be executed."
+            required:
+              - "query"
+            additionalProperties: False
+      - type: "function"
+        function:
+          name: "commit_final_answer"
+          description: "Commits the final answer after all operations are completed."
+          parameters:
+            type: "object"
+            properties:
+              answers:
+                type: "array"
+                items:
+                  type: "string"
+                description: "The list of final answers to commit."
+            required:
+              - "answers"
+            additionalProperties: False
+
+    env_driver: docker
+    env_options:
+      network_name: dbbench_default
+      state_driver: redis
+      state_options:
+        connection:
+          host: 172.17.0.1
 
 dbbench-std:
   parameters:
     name: dbbench-std
     data_file: "data/dbbench/standard.jsonl"
+
+dbbench-env_train:
+  parameters:
+    name: dbbench-env_train
+    data_file: "data/dbbench/db_out_new.jsonl"
+    db_file: "data/dbbench/db_train"
diff --git a/configs/tasks/kg.yaml b/configs/tasks/kg.yaml
@@ -1,15 +1,35 @@
 default:
   module: "src.server.tasks.knowledgegraph.KnowledgeGraph"
   parameters:
-    round: 15
-    sparql_url: "http://164.107.116.56:3093/sparql"
+    concurrency: 32
+    max_rounds: 15
+    one_shot: false
+    database_file:
+    env_driver: manual
+    env_options:
+      urls:
+        kg: http://localhost:3001/sparql
 
-kg-dev:
-  parameters:
-    name: "KnowledgeGraph-dev"
-    data_file: "data/knowledgegraph/dev.json"
+# alternative configuration - automatically start a SPARQL server in a docker container
+# fill-in the database_file parameter with the absolute path to the freebase db file on the host
+# and replace the above parameters with the following:
+#
+#   database_file: /path/to/virtuoso_db/virtuoso.db
+#   env_driver: docker
+#   env_options:
+#     network_name: knowledgegraph_default
+#     state_driver: redis
+#     state_options:
+#       connection:
+#         host: 172.17.0.1
 
 kg-std:
   parameters:
-    name: "KnowledgeGraph-std"
+    name: "kg-std"
     data_file: "data/knowledgegraph/std.json"
+    one_shot: true
+
+kg-env_train:
+  parameters:
+    name: "kg-env_train"
+    data_file: "data/knowledgegraph/kg_rl_all.json"
diff --git a/configs/tasks/os.yaml b/configs/tasks/os.yaml
@@ -1,9 +1,50 @@
-os-dev:
+default:
   module: "src.server.tasks.os_interaction.OSInteraction"
   parameters:
-    name: "os-dev"
-    concurrency: 24
+    concurrency: 32
     round_limit: 8
+    tools:
+      - type: "function"
+        function:
+          name: "bash_action"
+          description: "Execute bash code to perform an operation in the Linux environment."
+          parameters:
+            type: "object"
+            properties:
+              script:
+                type: "string"
+                description: "The bash script to be executed."
+            required:
+              - "script"
+            additionalProperties: False
+
+      - type: "function"
+        function:
+          name: "finish_action"
+          description: "Indicate that the task has been finished or need some additional information to be finished."
+          parameters:
+            type: "object"
+            properties:
+              thought:
+                type: "string"
+                description: "The thought or reason indicating the task is finished."
+            required:
+              - "thought"
+            additionalProperties: False
+
+      - type: "function"
+        function:
+          name: "answer_action"
+          description: "Provide the answer to the question."
+          parameters:
+            type: "object"
+            properties:
+              answer:
+                type: "string"
+                description: "The answer to the question."
+            required:
+              - "answer"
+            additionalProperties: False
 
     docker_config:
       localhost: local-os
@@ -12,29 +53,18 @@ os-dev:
     scripts:
       directory: data/os_interaction/res/scripts
 
-    data_config:
-      files:
-        - problem_file: data/os_interaction/data/dev.json
-          script_dir: data/os_interaction/scripts/dev/
-          index_prefix: "dev-001-"
-
-      bk: [ ]
-      ignore: [ ]
+    env_driver: docker
+    env_options:
+      network_name: os_interaction_default
+      state_driver: redis
+      state_options:
+        connection:
+          host: 172.17.0.1
 
 os-std:
   module: "src.server.tasks.os_interaction.OSInteraction"
   parameters:
     name: "os-std"
-    concurrency: 24
-    round_limit: 8
-
-    docker_config:
-      localhost: local-os
-      directory: data/os_interaction/res/dockerfiles
-
-    scripts:
-      directory: data/os_interaction/res/scripts
-
     data_config:
       files:
         - problem_file: data/os_interaction/data/1/*.json
@@ -61,3 +91,14 @@ os-std:
 
       bk: [ ]
       ignore: [ ]
+
+os-env_train:
+  parameters:
+    name: "os-env_train"
+    data_config:
+      files:
+        - problem_file: data/os_interaction/train_0317/training.json
+          script_dir: data/os_interaction/scripts/7/
+          index_prefix: "train-0223-"
+      bk: [ ]
+      ignore: [ ]