Prefill-heavy, multi-turn tasks for benchmarking LLM inference optimizations on OpenClaw agent workloads. Includes document analysis and iterative coding scenarios.
- OpenClaw installed with Node.js v22+
- SGLang with a model (e.g.,
Qwen/Qwen3-4B-Instruct-2507) - ContextPilot or any inference proxy to evaluate
The runner copies documents from data/workspace/ to OpenClaw's workspace (~/.openclaw/workspace/contracts/) automatically.
git clone https://github.com/EfficientContext/ClawTasks.git
cd ClawTasks
# Run all 70 tasks
python scripts/run_bench.py --gpu 0
# Run one category
python scripts/run_bench.py --category commercial
python scripts/run_bench.py --category coding
# Analyze results
python scripts/analyze.py results/results.jsonl22 synthetic enterprise documents (490 KB) + 1 codebase file (290 lines), 70 tasks across 5 categories, ~290 turns total.
| Category | Tasks | Focus |
|---|---|---|
commercial/ |
10 | Contract values, SLAs, payments, proposal pricing, cost analysis |
legal/ |
12 | Liability, IP, termination, confidentiality, NDA alignment, indemnification |
compliance/ |
18 | Data protection, policies, certifications, proposal compliance, security audit |
strategic/ |
20 | Vendor selection, procurement review, board briefings, lifecycle reviews |
coding/ |
10 | Iterative code modification — agent outputs complete file each turn |
See claw-tasks/ for per-task details and data/ for document sources and construction notes.
A technology company (Company A Pte Ltd) manages four vendor relationships for cloud, AI, security, and data services. The document workspace mirrors the full vendor management lifecycle:
NDA signed Vendors submit Company establishes Individual contracts
before discussions proposals master framework signed per vendor
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ NDAs │────────▶│ Proposals │──────────▶│ MSA │──────────▶│ Contracts │
└─────────┘ └───────────┘ └───────────┘ └───────────┘
2 files 4 files 1 file 4 files
12 KB 242 KB 9 KB 182 KB
│
┌─────────────────────────────────────────────┤
│ │
▼ ▼
┌───────────┐ Internal policies ┌───────────┐
│Amendments │ govern all vendors │Assessments│
└───────────┘ │ └───────────┘
4 files ▼ 4 files
19 KB ┌───────────┐ 16 KB
│ Policies │ │
└───────────┘ │
2 files ▼
13 KB ┌───────────────┐
│ Board Minutes │
└───────────────┘
1 file, 8 KB
Board reviews vendor performance
and approves renewal strategy
Each stage produces documents that reference earlier ones: proposals respond to requirements, contracts incorporate MSA terms, amendments modify contracts, assessments measure contract SLAs, and board minutes summarize assessment findings. This creates natural cross-document content overlap.
| Documents | Files | Size | Role in Lifecycle |
|---|---|---|---|
| NDAs | 2 | 12 KB | Signed first — protects confidential information during vendor evaluation |
| Vendor Proposals | 4 | 242 KB | Vendors respond to RFP with methodology, team, pricing, references |
| Master Service Agreement | 1 | 9 KB | Framework terms all vendor contracts inherit |
| Service Agreements | 4 | 182 KB | Per-vendor contracts with shared legal template (Articles 1-16) |
| Amendments | 4 | 19 KB | Modifications to contracts (scope, SLA, sustainability) |
| Vendor Assessments | 4 | 16 KB | Annual performance reviews measuring contract SLA compliance |
| Internal Policies | 2 | 13 KB | Information Security and Data Governance policies vendors must follow |
| Board Minutes | 1 | 8 KB | Board reviews vendor performance and approves renewal strategy |
The 60 document tasks cover every stage — from comparing proposals during vendor selection, to reviewing contracts for legal risk, to auditing compliance against internal policies, to preparing board briefings on vendor performance.
The 10 coding tasks simulate iterative code modification: the agent reads a 290-line Python service (user_service.py), then adds features across 3 turns, printing the complete updated file each time. Each turn's output is ~90% identical to the previous — only the new feature differs.
| Property | Value |
|---|---|
| Avg input tokens (all turns) | ~46K |
| Max input tokens | ~93K |
| Avg output tokens | ~760 |
| Input/output ratio | ~60:1 |
| Content overlap (contracts) | ~70% |
| Content overlap (proposals) | ~66% |
├── README.md
├── data/
│ ├── README.md # Data sources, construction notes, licensing
│ └── workspace/ # 22 enterprise documents (490 KB)
├── claw-tasks/
│ ├── README.md # Per-task details with documents used
│ ├── commercial/tasks.json # 10 tasks
│ ├── legal/tasks.json # 12 tasks
│ ├── compliance/tasks.json # 18 tasks
│ ├── strategic/tasks.json # 20 tasks
│ └── coding/tasks.json # 10 tasks
├── scripts/
│ ├── run_bench.py # Main benchmark runner
│ ├── run_coding_sglang.py # Coding-specific runner for SGLang
│ └── analyze.py
└── results/ # Generated by run_bench.py
Apache 2.0