Before running any benchmarks, you need to set up the environment and ensure the local Agent SDK submodule is initialized.
make build📦 Submodule & Environment Setup (click to expand)
The Benchmarks project uses a local git submodule for the OpenHands Agent SDK.
This ensures your code runs against a specific, reproducible commit.
Run once after cloning (already done in make build for you):
git submodule update --init --recursiveThis command will:
- clone the SDK into
vendor/agent-sdk/ - check out the exact commit pinned by this repo
- make it available for local development (
uv syncwill install from the local folder)
If you ever clone this repository again, remember to re-initialize the submodule with the same command.
Once the submodule is set up, install dependencies via uv:
make buildThis runs:
uv syncand ensures the openhands-* packages (SDK, tools, workspace, agent-server) are installed from the local workspace declared in pyproject.toml.
If you want to update to a newer version of the SDK:
cd vendor/agent-sdk
git fetch
git checkout <new_commit_or_branch>
cd ../..
git add vendor/agent-sdk
git commit -m "Update agent-sdk submodule to <new_commit_sha>"Then re-run:
make buildto rebuild your environment with the new SDK code.
Define your LLM config as a JSON following the model fields type in the LLM class, for example, you can write the following to .llm_config/example.json:
{
"model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
"base_url": "https://llm-proxy.eval.all-hands.dev",
"api_key": "YOUR_API_KEY_HERE"
}You may validate the correctness of your config by running uv run validate-cfg .llm_config/YOUR_CONFIG_PATH.json
Build ALL docker images for SWE-Bench.
uv run benchmarks/swe_bench/build_images.py \
--dataset princeton-nlp/SWE-bench_Verified --split test \
--image ghcr.io/all-hands-ai/agent-server --target binary-minimal# Run evaluation with your configured LLM
uv run swebench-infer .llm_config/sonnet-4.jsonYou can run evaluation on a specific subset of instances using the --select option:
- Create a text file with one instance ID per line:
instances.txt:
django__django-11333
astropy__astropy-12345
requests__requests-5555
- Run evaluation with the selection file:
python -m benchmarks.swe_bench.run_infer \
--agent-cls CodeActAgent \
--llm-config llm_config.toml \
--max-iterations 30 \
--select instances.txt \
--eval-output-dir ./evaluation_resultsThis will only evaluate the instances listed in the file.
- Original OpenHands: https://github.com/All-Hands-AI/OpenHands/
- Agent SDK: https://github.com/All-Hands-AI/agent-sdk
- SWE-Bench: https://www.swebench.com/