A data-analysis benchmark for OpenClaw-style end-to-end agents. Every task is grounded in real-world data and has a single objective gold answer.
With the emergence of end-to-end agents like OpenClaw, data analysis is no longer equivalent to static QA — "read a passage, output one answer." Real-world data analysis tasks often require agents to locate evidence across heterogeneous files, filter and join entities across tables, perform statistical and normalization calculations, verify intermediate results, and strictly follow output constraints.
This means the core difficulty of a benchmark has shifted from answer generation alone to full agent-driven execution. A truly valuable data-analysis benchmark must test not only whether the final answer is correct, but also whether the agent can reliably complete a series of steps — retrieval, filtering, computation, verification, and constraint compliance — in complex data environments.
DataClaw is designed for exactly this shift. It evaluates not abstract capability divorced from execution, but how OpenClaw-style end-to-end agents actually perform on data analysis tasks under real data conditions, explicit task constraints, and a reproducible execution protocol.
DataClaw is a process-oriented data-analysis benchmark for realistic, complex data environments. Its core goal is not merely to measure agents' end-task performance, but to serve as a high-fidelity testbed that also evaluates, at fine granularity, how agents evolve when facing real-world complexity and multi-step reasoning.
DataClaw simulates at scale the noisy, weakly-semantic, cross-domain data environments found in the real world. Complex data-analysis questions are authored by domain experts in finance and computer science, and each task's process annotations and unique objective answers are cross-verified by human experts with AI assistance. Process annotations include task milestones, human-corrected reference trajectories, and evidence data sources. DataClaw adopts OpenClaw as its unified agent framework.
- From idealized data environments to imperfect real-world data environments. DataClaw contains a mix of structured and unstructured data, covering enterprise profiles, business operating status, regional industry statistics, national industry statistics, and policy texts. All data is collected from the real world and comes with friction such as missing indicators, inconsistent definitions, and inconsistent naming. Tasks face realistic data environments, not over-cleaned single-table lookups.
- From single-shot static queries to multi-step dynamic reasoning. DataClaw tasks typically require agents to complete a multi-stage chain of operations rather than producing a one-shot answer. The challenge for agents comes not only from retrieval but also from cross-source integration, metric construction, aggregation computation, and format constraint compliance.
- From outcome-oriented evaluation to process-oriented evaluation. DataClaw goes beyond simple outcome-accuracy evaluation and dissects how the agent's execution unfolds at fine granularity. Outcome-oriented evaluation paradigms focus only on final accuracy. This black-box approach ignores intermediate reasoning and provides little actionable signal for guiding optimization.
Key directories and scripts:
assets/database/: Benchmark data files, injected wholesale into the container workspace at run time. The root containsinternal_metrics.csv(internal business-logic knowledge base);enterprise/,industry/, andpolicy/hold the three theme-domain datasets.assets/qa_raw/: Raw task source files.assets/qa_gold/: Minimized gold files derived fromqa_raw.tasks/: Generated OpenClaw task spec files.dataclaw/build_tasks.py: Builder that producesqa_goldandtasks/fromqa_raw.dataclaw/eval/run_batch.py: Host-side evaluation orchestrator; one isolated container per task.dataclaw/utils/docker_utils.py: Container lifecycle management, OpenClaw onboarding, and model configuration.dataclaw/utils/grading.py: Outcome scoring (LLM-judged Acc).dataclaw/utils/process_grading.py: Process scoring (EE on correct tasks; GPR / TPE on incorrect tasks).script/docker_save_image.sh: Image build and export script.
Each evaluation task runs in its own Docker container. The host orchestrator manages the full lifecycle:
Host (dataclaw/eval/run_batch.py)
|
+-- For each task (parallel via --parallel N):
1. docker run -> start isolated container
2. docker cp -> inject workspace files
3. docker exec -> OpenClaw onboard
4. docker exec -> start gateway (background)
5. docker exec -> set model and run agent
6. docker exec -> run llm_judge scoring
7. docker cp -> collect logs and results
8. docker rm -> remove container
Download the pre-built image archive from Releases and load it:
docker load -i <dataclaw-image-archive>.tarAfter loading, confirm the local image tag matches DOCKER_IMAGE in .env.
git clone <repository-url>
cd <repository-dir>Use the actual repository URL shown on the GitHub page.
pip install pyyaml python-dotenv
pyproject.tomlrequires Python>=3.10. For a fuller local dev setup, install additional dev dependencies as you prefer.
Copy the template:
cp .env.example .envEdit .env and pay attention to at least the following fields:
| Variable | Required | Description |
|---|---|---|
DEFAULT_MODEL |
Yes | Model under test, e.g. openrouter/anthropic/claude-sonnet-4.6 |
OPENROUTER_API_KEY |
One of two | Used when the main model or judge is called via OpenRouter |
OPENCLAW_CUSTOM_BASE_URL + OPENCLAW_CUSTOM_API_KEY |
One of two | Custom OpenAI-compatible API |
OPENCLAW_CUSTOM_MODEL_ID |
No | Explicit model id at the custom provider for the main model |
JUDGE_MODEL |
No | Judge model; default in .env.example |
JUDGE_CUSTOM_BASE_URL + JUDGE_CUSTOM_API_KEY |
No | Separate custom endpoint for the judge |
JUDGE_CUSTOM_MODEL_ID |
No | Explicit model id for the judge custom endpoint |
DOCKER_IMAGE |
No | Local image tag; must match the loaded image |
If you do not use OpenRouter, set in .env:
OPENCLAW_CUSTOM_BASE_URL=https://your-api-url/v1
OPENCLAW_CUSTOM_API_KEY=your_api_key
OPENCLAW_CUSTOM_MODEL_ID=your-provider/your-model
DEFAULT_MODEL=your-provider/your-modelIf the API runs on the host:
OPENCLAW_CUSTOM_BASE_URL=http://host.docker.internal:8000/v1When the judge uses a separate endpoint:
JUDGE_CUSTOM_BASE_URL=https://your-judge-api-url/v1
JUDGE_CUSTOM_API_KEY=your_judge_api_key
JUDGE_CUSTOM_MODEL_ID=your-provider/your-judge-model| Scenario | Main model | Judge | Required config |
|---|---|---|---|
| A | Custom API | OpenRouter | OPENCLAW_CUSTOM_* + OPENROUTER_API_KEY |
| B | OpenRouter | OpenRouter | OPENROUTER_API_KEY |
| C | Custom API | Custom API (separate endpoint) | OPENCLAW_CUSTOM_* + JUDGE_CUSTOM_* |
# Run all tasks
python dataclaw/eval/run_batch.py --model openrouter/anthropic/claude-sonnet-4.6
# Run selected tasks
python dataclaw/eval/run_batch.py --model ... --suite task_001,task_002
# Run in parallel
python dataclaw/eval/run_batch.py --model ... --parallel 4
# Run a single task file
python dataclaw/eval/run_batch.py --task tasks/task_001_xxx.md
# Or use the convenience script (reads DEFAULT_MODEL from .env)
bash script/run.sh| Flag | Default | Description |
|---|---|---|
--model / -m |
DEFAULT_MODEL in .env |
Model under test |
--judge |
JUDGE_MODEL in .env |
Judge model |
--suite / -s |
all |
"all" or comma-separated task IDs |
--task / -t |
— | Path to a single task.md |
--parallel / -p |
1 |
Parallel container count |
--timeout-multiplier |
1.0 |
Scale all task timeouts |
--runs |
1 |
Repeat runs per task |
--resume |
— | Resume from last interrupted run |
--verbose / -v |
— | Enable verbose logging |
After a run completes, results are saved under output/<task_id>/<model_timestamp_runid>/:
output/<task_id>/<suffix>/
├── score.json # outcome score (Acc)
├── process_score.json # process scores (EE / GPR / TPE)
├── usage.json # token usage, cost, elapsed time
├── agent.log # agent execution log
├── gateway.log # gateway log
├── chat.jsonl # full conversation record
├── judge_chat.jsonl # outcome-judge conversation
└── judge_process_chat.jsonl # process-judge conversation
A global summary is written to:
output/summary_<model>.json
DataClaw scores each run along four metrics.
| Metric | Definition | Scope | Direction |
|---|---|---|---|
| Acc | LLM-judge semantic match between predicted answer â and gold answer a; multi-subquestion tasks return the normalized mean over the L sub-answers. | All tasks | ↑ |
| EE | Execution Efficiency = N / T, where N is the gold reference step count and T is the agent's actual step count. EE > 1 means the agent solved it in fewer steps than the gold trajectory. | Correct tasks only | ↑ |
| GPR | Goal Progress Rate = (1/M) Σⱼ 𝕀(mⱼ achieved); fraction of M annotated milestones the agent reaches anywhere in its trajectory. Captures partial process credit when the final answer is wrong. | Incorrect tasks only | ↑ |
| TPE | Temporal Progress Efficiency = (Σⱼ 𝕀(mⱼ) · γ^max(tⱼ − N, 0)) / Σⱼ 𝕀(mⱼ), γ = 0.9; averages an exponential temporal-decay factor over the milestones the agent did achieve. Milestones reached by step N contribute 1; later ones decay. TPE = 1 means all achieved milestones were on time, lower values indicate later concentration. Range [0, 1]. | Incorrect tasks only | ↑ |
Long batch evaluations may be interrupted unexpectedly. The evaluation framework automatically saves a progress file after each task completes:
output/progress_<model>.json
Simply append --resume to the original command to resume:
# Original run (interrupted midway)
python dataclaw/eval/run_batch.py --model openrouter/anthropic/claude-sonnet-4.6 --suite all
# Resume (keep other arguments the same)
python dataclaw/eval/run_batch.py --model openrouter/anthropic/claude-sonnet-4.6 --suite all --resumeOn resume, the framework automatically verifies that --model, --suite, and --runs match the previous run; if they don't, it exits with an error. Once all tasks are completed, the progress file is removed automatically.
To discard previous progress and start fresh, simply delete the progress file:
rm output/progress_<model>.jsonInterrupted runs may leave behind uncleaned containers. Clean them up as follows:
IMAGE=<your-docker-image-tag>
docker ps -a --filter "ancestor=${IMAGE}" -q | xargs -r docker rm -fPreview containers that would be removed:
IMAGE=<your-docker-image-tag>
docker ps -a --filter "ancestor=${IMAGE}" --format "{{.Names}}\t{{.Status}}"DataClaw's data does not come from synthetic samples or teaching examples; it is built on the publishing team's long-term, front-line data accumulation and industry insights from research on Chinese enterprises, industries, and policies. The current version is mainly based on data from 2022. After necessary de-identification, tasks are constructed to avoid model knowledge leakage as much as possible while preserving the information noise and data friction found in real business settings. Task authoring and annotation are conducted by a professional team from Lingnan College, Sun Yat-sen University, balancing academic rigor and practical usability.
Under a theme-domain view and business-oriented taxonomy, the current data environment is organized into 3 theme domains: Enterprise, Industry, and Policy; subcategories cover enterprise profiles and regional profiles, enterprise core competitiveness, business operating status, regional/national industry statistics, policy releases, and full policy texts — closely aligned with real research and consulting workflows. The 3 theme domains contain 17 independent data sources (each data file counts as one source, all mounted under assets/database/): 17 are placed under theme subdirectories enterprise/, industry/, policy/; in addition, 1 root-level file, internal_metrics.csv, serves as an internal business-logic knowledge base and does not belong to any theme domain. Details below.
| Dimension | Value | Notes |
|---|---|---|
| Theme domains | 3 | Enterprise, Industry, Policy |
| Secondary themes | 7 | Enterprise ×3 (profiles, core competitiveness, business status), Industry ×2 (regional industry, national industry), Policy ×2 (release status, full text) |
| Total data sources | 17 | Injected into the container workspace at run time |
| Format | CSV | Primarily CSV; includes both structured fields and unstructured long-text content |
| Time span | Mainly concentrated in 2022 | Statistical periods vary across sources |
The 17 data sources by theme domain and secondary theme
| Theme domain | Secondary theme | Sources | Files |
|---|---|---|---|
| Enterprise | Enterprise profiles | 5 | enterprise/company_profile.csventerprise/company_profile_as.csventerprise/company_profile_eu.csventerprise/company_profile_na.csventerprise/company_profile_oc.csv |
| Enterprise | Core competitiveness | 1 | enterprise/company_core.csv |
| Enterprise | Business status | 3 | enterprise/company_operation_status.csventerprise/company_operation_status_detail.csventerprise/company_operation_yearly_status.csv |
| Industry | Regional industry | 3 | industry/regional_industry_status.csvindustry/regional_industry_status_detail.csvindustry/regional_industry_yearly_status.csv |
| Industry | National industry | 3 | industry/national_industry_status.csvindustry/national_industry_status_detail.csvindustry/national_industry_yearly_status.csv |
| Policy | Policy release status | 1 | policy/policy_release_status.csv |
| Policy | Full policy text | 1 | policy/policy_resource.csv |
At execution time, agents typically need to align entities across files, join across tables, normalize definitions, and perform aggregation calculations, rather than simply looking up values in a single file; when needed, they must also consult business conventions in
internal_metrics.csv. This is the core value of DataClaw for evaluating real-scenario data understanding and reasoning capabilities.
The current version contains 492 tasks across 7 categories, with an overall difficulty distribution of 131 easy / 286 medium / 75 hard.
| Category code | Meaning | Count | Difficulty split |
|---|---|---|---|
enterprise_industry_analysis |
Enterprise–industry analysis | 226 | easy 115 / medium 111 |
enterprise_industry_policy_analysis |
Enterprise–industry–policy linkage analysis | 76 | easy 10 / medium 66 |
comprehensive_decision |
Comprehensive decision | 70 | easy 6 / medium 45 / hard 19 |
international_comparison |
International comparison | 39 | medium 25 / hard 14 |
hypothesis_verification |
Hypothesis verification | 29 | medium 14 / hard 15 |
industry_planning |
Industry planning | 28 | medium 14 / hard 14 |
risk_assessment |
Risk assessment | 24 | medium 11 / hard 13 |
Except for the 39
international_comparisontasks, all others are explicitly restricted in the current task spec to use only./database/, with no web search.
DataClaw is jointly released by Prof. Chuan Chen's team at the School of Computer Science, Sun Yat-sen University, and the Southern Weekly Sci-Tech Power Research Center. We sincerely thank the Southern Weekly Sci-Tech Power Research Center for providing invaluable data and tremendous support.
This project also builds on excellent open-source agent ecosystems. We gratefully acknowledge:
