DataClaw

A data-analysis benchmark for OpenClaw-style end-to-end agents. Every task is grounded in real-world data and has a single objective gold answer.

简体中文

🌊 Data Analysis Tasks Are Changing in the OpenClaw Era

With the emergence of end-to-end agents like OpenClaw, data analysis is no longer equivalent to static QA — "read a passage, output one answer." Real-world data analysis tasks often require agents to locate evidence across heterogeneous files, filter and join entities across tables, perform statistical and normalization calculations, verify intermediate results, and strictly follow output constraints.

This means the core difficulty of a benchmark has shifted from answer generation alone to full agent-driven execution. A truly valuable data-analysis benchmark must test not only whether the final answer is correct, but also whether the agent can reliably complete a series of steps — retrieval, filtering, computation, verification, and constraint compliance — in complex data environments.

DataClaw is designed for exactly this shift. It evaluates not abstract capability divorced from execution, but how OpenClaw-style end-to-end agents actually perform on data analysis tasks under real data conditions, explicit task constraints, and a reproducible execution protocol.

🔍 What Is DataClaw?

DataClaw is a process-oriented data-analysis benchmark for realistic, complex data environments. Its core goal is not merely to measure agents' end-task performance, but to serve as a high-fidelity testbed that also evaluates, at fine granularity, how agents evolve when facing real-world complexity and multi-step reasoning.

DataClaw simulates at scale the noisy, weakly-semantic, cross-domain data environments found in the real world. Complex data-analysis questions are authored by domain experts in finance and computer science, and each task's process annotations and unique objective answers are cross-verified by human experts with AI assistance. Process annotations include task milestones, human-corrected reference trajectories, and evidence data sources. DataClaw adopts OpenClaw as its unified agent framework.

🎯 Why DataClaw?

From idealized data environments to imperfect real-world data environments. DataClaw contains a mix of structured and unstructured data, covering enterprise profiles, business operating status, regional industry statistics, national industry statistics, and policy texts. All data is collected from the real world and comes with friction such as missing indicators, inconsistent definitions, and inconsistent naming. Tasks face realistic data environments, not over-cleaned single-table lookups.
From single-shot static queries to multi-step dynamic reasoning. DataClaw tasks typically require agents to complete a multi-stage chain of operations rather than producing a one-shot answer. The challenge for agents comes not only from retrieval but also from cross-source integration, metric construction, aggregation computation, and format constraint compliance.
From outcome-oriented evaluation to process-oriented evaluation. DataClaw goes beyond simple outcome-accuracy evaluation and dissects how the agent's execution unfolds at fine granularity. Outcome-oriented evaluation paradigms focus only on final accuracy. This black-box approach ignores intermediate reasoning and provides little actionable signal for guiding optimization.

🏗️ Repository Layout

Key directories and scripts:

assets/database/: Benchmark data files, injected wholesale into the container workspace at run time. The root contains internal_metrics.csv (internal business-logic knowledge base); enterprise/, industry/, and policy/ hold the three theme-domain datasets.
assets/qa_raw/: Raw task source files.
assets/qa_gold/: Minimized gold files derived from qa_raw.
tasks/: Generated OpenClaw task spec files.
dataclaw/build_tasks.py: Builder that produces qa_gold and tasks/ from qa_raw.
dataclaw/eval/run_batch.py: Host-side evaluation orchestrator; one isolated container per task.
dataclaw/utils/docker_utils.py: Container lifecycle management, OpenClaw onboarding, and model configuration.
dataclaw/utils/grading.py: Outcome scoring (LLM-judged Acc).
dataclaw/utils/process_grading.py: Process scoring (EE on correct tasks; GPR / TPE on incorrect tasks).
script/docker_save_image.sh: Image build and export script.

⚙️ Evaluation Lifecycle

Each evaluation task runs in its own Docker container. The host orchestrator manages the full lifecycle:

Host (dataclaw/eval/run_batch.py)
  |
  +-- For each task (parallel via --parallel N):
      1. docker run   -> start isolated container
      2. docker cp    -> inject workspace files
      3. docker exec  -> OpenClaw onboard
      4. docker exec  -> start gateway (background)
      5. docker exec  -> set model and run agent
      6. docker exec  -> run llm_judge scoring
      7. docker cp    -> collect logs and results
      8. docker rm    -> remove container

🚀 User Quick Start

1. Obtain the Pre-built Image

Download the pre-built image archive from Releases and load it:

docker load -i <dataclaw-image-archive>.tar

After loading, confirm the local image tag matches DOCKER_IMAGE in .env.

2. Clone This Repository

git clone <repository-url>
cd <repository-dir>

Use the actual repository URL shown on the GitHub page.

3. Install Python Dependencies

pip install pyyaml python-dotenv

pyproject.toml requires Python >=3.10. For a fuller local dev setup, install additional dev dependencies as you prefer.

4. Configure Environment

Copy the template:

cp .env.example .env

Edit .env and pay attention to at least the following fields:

Variable	Required	Description
`DEFAULT_MODEL`	Yes	Model under test, e.g. `openrouter/anthropic/claude-sonnet-4.6`
`OPENROUTER_API_KEY`	One of two	Used when the main model or judge is called via OpenRouter
`OPENCLAW_CUSTOM_BASE_URL` + `OPENCLAW_CUSTOM_API_KEY`	One of two	Custom OpenAI-compatible API
`OPENCLAW_CUSTOM_MODEL_ID`	No	Explicit model id at the custom provider for the main model
`JUDGE_MODEL`	No	Judge model; default in `.env.example`
`JUDGE_CUSTOM_BASE_URL` + `JUDGE_CUSTOM_API_KEY`	No	Separate custom endpoint for the judge
`JUDGE_CUSTOM_MODEL_ID`	No	Explicit model id for the judge custom endpoint
`DOCKER_IMAGE`	No	Local image tag; must match the loaded image

🔌 Custom OpenAI-compatible API

If you do not use OpenRouter, set in .env:

OPENCLAW_CUSTOM_BASE_URL=https://your-api-url/v1
OPENCLAW_CUSTOM_API_KEY=your_api_key
OPENCLAW_CUSTOM_MODEL_ID=your-provider/your-model
DEFAULT_MODEL=your-provider/your-model

If the API runs on the host:

OPENCLAW_CUSTOM_BASE_URL=http://host.docker.internal:8000/v1

When the judge uses a separate endpoint:

JUDGE_CUSTOM_BASE_URL=https://your-judge-api-url/v1
JUDGE_CUSTOM_API_KEY=your_judge_api_key
JUDGE_CUSTOM_MODEL_ID=your-provider/your-judge-model

Common Auth Setups (Main Model vs Judge)

Scenario	Main model	Judge	Required config
A	Custom API	OpenRouter	`OPENCLAW_CUSTOM_*` + `OPENROUTER_API_KEY`
B	OpenRouter	OpenRouter	`OPENROUTER_API_KEY`
C	Custom API	Custom API (separate endpoint)	`OPENCLAW_CUSTOM_` + `JUDGE_CUSTOM_`

5. Run Evaluation

# Run all tasks
python dataclaw/eval/run_batch.py --model openrouter/anthropic/claude-sonnet-4.6

# Run selected tasks
python dataclaw/eval/run_batch.py --model ... --suite task_001,task_002

# Run in parallel
python dataclaw/eval/run_batch.py --model ... --parallel 4

# Run a single task file
python dataclaw/eval/run_batch.py --task tasks/task_001_xxx.md

# Or use the convenience script (reads DEFAULT_MODEL from .env)
bash script/run.sh

CLI Options

Flag	Default	Description
`--model` / `-m`	`DEFAULT_MODEL` in `.env`	Model under test
`--judge`	`JUDGE_MODEL` in `.env`	Judge model
`--suite` / `-s`	`all`	`"all"` or comma-separated task IDs
`--task` / `-t`	—	Path to a single `task.md`
`--parallel` / `-p`	`1`	Parallel container count
`--timeout-multiplier`	`1.0`	Scale all task timeouts
`--runs`	`1`	Repeat runs per task
`--resume`	—	Resume from last interrupted run
`--verbose` / `-v`	—	Enable verbose logging

6. Results

After a run completes, results are saved under output/<task_id>/<model_timestamp_runid>/:

output/<task_id>/<suffix>/
├── score.json               # outcome score (Acc)
├── process_score.json       # process scores (EE / GPR / TPE)
├── usage.json               # token usage, cost, elapsed time
├── agent.log                # agent execution log
├── gateway.log              # gateway log
├── chat.jsonl               # full conversation record
├── judge_chat.jsonl         # outcome-judge conversation
└── judge_process_chat.jsonl # process-judge conversation

A global summary is written to:

output/summary_<model>.json

7. Grading Rules

DataClaw scores each run along four metrics.

Metric	Definition	Scope	Direction
Acc	LLM-judge semantic match between predicted answer â and gold answer a; multi-subquestion tasks return the normalized mean over the L sub-answers.	All tasks	↑
EE	Execution Efficiency = N / T, where N is the gold reference step count and T is the agent's actual step count. EE > 1 means the agent solved it in fewer steps than the gold trajectory.	Correct tasks only	↑
GPR	Goal Progress Rate = (1/M) Σⱼ 𝕀(mⱼ achieved); fraction of M annotated milestones the agent reaches anywhere in its trajectory. Captures partial process credit when the final answer is wrong.	Incorrect tasks only	↑
TPE	Temporal Progress Efficiency = (Σⱼ 𝕀(mⱼ) · γ^max(tⱼ − N, 0)) / Σⱼ 𝕀(mⱼ), γ = 0.9; averages an exponential temporal-decay factor over the milestones the agent did achieve. Milestones reached by step N contribute 1; later ones decay. TPE = 1 means all achieved milestones were on time, lower values indicate later concentration. Range [0, 1].	Incorrect tasks only	↑

8. Resume After Interruption

Long batch evaluations may be interrupted unexpectedly. The evaluation framework automatically saves a progress file after each task completes:

output/progress_<model>.json

Simply append --resume to the original command to resume:

# Original run (interrupted midway)
python dataclaw/eval/run_batch.py --model openrouter/anthropic/claude-sonnet-4.6 --suite all

# Resume (keep other arguments the same)
python dataclaw/eval/run_batch.py --model openrouter/anthropic/claude-sonnet-4.6 --suite all --resume

On resume, the framework automatically verifies that --model, --suite, and --runs match the previous run; if they don't, it exits with an error. Once all tasks are completed, the progress file is removed automatically.

To discard previous progress and start fresh, simply delete the progress file:

rm output/progress_<model>.json

9. Cleanup

Interrupted runs may leave behind uncleaned containers. Clean them up as follows:

IMAGE=<your-docker-image-tag>
docker ps -a --filter "ancestor=${IMAGE}" -q | xargs -r docker rm -f

Preview containers that would be removed:

IMAGE=<your-docker-image-tag>
docker ps -a --filter "ancestor=${IMAGE}" --format "{{.Names}}\t{{.Status}}"

📊 Dataset Statistics

DataClaw's data does not come from synthetic samples or teaching examples; it is built on the publishing team's long-term, front-line data accumulation and industry insights from research on Chinese enterprises, industries, and policies. The current version is mainly based on data from 2022. After necessary de-identification, tasks are constructed to avoid model knowledge leakage as much as possible while preserving the information noise and data friction found in real business settings. Task authoring and annotation are conducted by a professional team from Lingnan College, Sun Yat-sen University, balancing academic rigor and practical usability.

🗂️ Data Environment Statistics

Under a theme-domain view and business-oriented taxonomy, the current data environment is organized into 3 theme domains: Enterprise, Industry, and Policy; subcategories cover enterprise profiles and regional profiles, enterprise core competitiveness, business operating status, regional/national industry statistics, policy releases, and full policy texts — closely aligned with real research and consulting workflows. The 3 theme domains contain 17 independent data sources (each data file counts as one source, all mounted under assets/database/): 17 are placed under theme subdirectories enterprise/, industry/, policy/; in addition, 1 root-level file, internal_metrics.csv, serves as an internal business-logic knowledge base and does not belong to any theme domain. Details below.

Dimension	Value	Notes
Theme domains	3	Enterprise, Industry, Policy
Secondary themes	7	Enterprise ×3 (profiles, core competitiveness, business status), Industry ×2 (regional industry, national industry), Policy ×2 (release status, full text)
Total data sources	17	Injected into the container workspace at run time
Format	CSV	Primarily CSV; includes both structured fields and unstructured long-text content
Time span	Mainly concentrated in 2022	Statistical periods vary across sources

The 17 data sources by theme domain and secondary theme

Theme domain	Secondary theme	Sources	Files
Enterprise	Enterprise profiles	5	`enterprise/company_profile.csv` `enterprise/company_profile_as.csv` `enterprise/company_profile_eu.csv` `enterprise/company_profile_na.csv` `enterprise/company_profile_oc.csv`
Enterprise	Core competitiveness	1	`enterprise/company_core.csv`
Enterprise	Business status	3	`enterprise/company_operation_status.csv` `enterprise/company_operation_status_detail.csv` `enterprise/company_operation_yearly_status.csv`
Industry	Regional industry	3	`industry/regional_industry_status.csv` `industry/regional_industry_status_detail.csv` `industry/regional_industry_yearly_status.csv`
Industry	National industry	3	`industry/national_industry_status.csv` `industry/national_industry_status_detail.csv` `industry/national_industry_yearly_status.csv`
Policy	Policy release status	1	`policy/policy_release_status.csv`
Policy	Full policy text	1	`policy/policy_resource.csv`

At execution time, agents typically need to align entities across files, join across tables, normalize definitions, and perform aggregation calculations, rather than simply looking up values in a single file; when needed, they must also consult business conventions in internal_metrics.csv. This is the core value of DataClaw for evaluating real-scenario data understanding and reasoning capabilities.

📋 Task Statistics

The current version contains 492 tasks across 7 categories, with an overall difficulty distribution of 131 easy / 286 medium / 75 hard.

Category code	Meaning	Count	Difficulty split
`enterprise_industry_analysis`	Enterprise–industry analysis	226	easy 115 / medium 111
`enterprise_industry_policy_analysis`	Enterprise–industry–policy linkage analysis	76	easy 10 / medium 66
`comprehensive_decision`	Comprehensive decision	70	easy 6 / medium 45 / hard 19
`international_comparison`	International comparison	39	medium 25 / hard 14
`hypothesis_verification`	Hypothesis verification	29	medium 14 / hard 15
`industry_planning`	Industry planning	28	medium 14 / hard 14
`risk_assessment`	Risk assessment	24	medium 11 / hard 13

Except for the 39 international_comparison tasks, all others are explicitly restricted in the current task spec to use only ./database/, with no web search.

🙏 Acknowledgements

DataClaw is jointly released by Prof. Chuan Chen's team at the School of Computer Science, Sun Yat-sen University, and the Southern Weekly Sci-Tech Power Research Center. We sincerely thank the Southern Weekly Sci-Tech Power Research Center for providing invaluable data and tremendous support.

This project also builds on excellent open-source agent ecosystems. We gratefully acknowledge:

WildClawBench

Claw-Eval

PinchBench

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Images		Images
assets		assets
dataclaw		dataclaw
docs		docs
script		script
tasks		tasks
.dockerignore		.dockerignore
.env.example		.env.example
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
logo.png		logo.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataClaw

🌊 Data Analysis Tasks Are Changing in the OpenClaw Era

🔍 What Is DataClaw?

🎯 Why DataClaw?

🏗️ Repository Layout

⚙️ Evaluation Lifecycle

🚀 User Quick Start

1. Obtain the Pre-built Image

2. Clone This Repository

3. Install Python Dependencies

4. Configure Environment

🔌 Custom OpenAI-compatible API

Common Auth Setups (Main Model vs Judge)

5. Run Evaluation

CLI Options

6. Results

7. Grading Rules

8. Resume After Interruption

9. Cleanup

📊 Dataset Statistics

🗂️ Data Environment Statistics

📋 Task Statistics

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataClaw

🌊 Data Analysis Tasks Are Changing in the OpenClaw Era

🔍 What Is DataClaw?

🎯 Why DataClaw?

🏗️ Repository Layout

⚙️ Evaluation Lifecycle

🚀 User Quick Start

1. Obtain the Pre-built Image

2. Clone This Repository

3. Install Python Dependencies

4. Configure Environment

🔌 Custom OpenAI-compatible API

Common Auth Setups (Main Model vs Judge)

5. Run Evaluation

CLI Options

6. Results

7. Grading Rules

8. Resume After Interruption

9. Cleanup

📊 Dataset Statistics

🗂️ Data Environment Statistics

📋 Task Statistics

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages