OpenHands Benchmarks Migration

⚠️ Migration in Progress: We are currently migrating the benchmarks infrastructure from OpenHands to work with the OpenHands Agent SDK.

Prerequisites

Before running any benchmarks, you need to set up the environment and ensure the local Agent SDK submodule is initialized.

make build

📦 Submodule & Environment Setup (click to expand)

🧩 1. Initialize the Agent SDK submodule

The Benchmarks project uses a local git submodule for the OpenHands Agent SDK.
This ensures your code runs against a specific, reproducible commit.

Run once after cloning (already done in make build for you):

git submodule update --init --recursive

This command will:

clone the SDK into vendor/agent-sdk/
check out the exact commit pinned by this repo
make it available for local development (uv sync will install from the local folder)

If you ever clone this repository again, remember to re-initialize the submodule with the same command.

🏗️ 2. Build the environment

Once the submodule is set up, install dependencies via uv:

make build

This runs:

uv sync

and ensures the openhands-* packages (SDK, tools, workspace, agent-server) are installed from the local workspace declared in pyproject.toml.

🔄 3. Update the submodule (when SDK changes)

If you want to update to a newer version of the SDK:

cd vendor/agent-sdk
git fetch
git checkout <new_commit_or_branch>
cd ../..
git add vendor/agent-sdk
git commit -m "Update agent-sdk submodule to <new_commit_sha>"

Then re-run:

make build

to rebuild your environment with the new SDK code.

Quick Start

1. Configure Your LLM

Define your LLM config as a JSON following the model fields type in the LLM class, for example, you can write the following to .llm_config/example.json:

{
  "model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
  "base_url": "https://llm-proxy.eval.all-hands.dev",
  "api_key": "YOUR_API_KEY_HERE"
}

You may validate the correctness of your config by running uv run validate-cfg .llm_config/YOUR_CONFIG_PATH.json

2. Build Docker Images for SWE-Bench Evaluation

Build ALL docker images for SWE-Bench.

uv run benchmarks/swe_bench/build_images.py \
  --dataset princeton-nlp/SWE-bench_Verified --split test \
  --image ghcr.io/all-hands-ai/agent-server --target binary-minimal

3. Run SWE-Bench Evaluation

# Run evaluation with your configured LLM
uv run swebench-infer .llm_config/sonnet-4.json

4. Selecting Specific Instances

You can run evaluation on a specific subset of instances using the --select option:

Create a text file with one instance ID per line:

instances.txt:

django__django-11333
astropy__astropy-12345
requests__requests-5555

Run evaluation with the selection file:

python -m benchmarks.swe_bench.run_infer \
    --agent-cls CodeActAgent \
    --llm-config llm_config.toml \
    --max-iterations 30 \
    --select instances.txt \
    --eval-output-dir ./evaluation_results

This will only evaluate the instances listed in the file.

Links

Original OpenHands: https://github.com/All-Hands-AI/OpenHands/
Agent SDK: https://github.com/All-Hands-AI/agent-sdk
SWE-Bench: https://www.swebench.com/

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
.github		.github
.llm_config		.llm_config
benchmarks		benchmarks
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

OpenHands Benchmarks Migration

Prerequisites

🧩 1. Initialize the Agent SDK submodule

🏗️ 2. Build the environment

🔄 3. Update the submodule (when SDK changes)

Quick Start

1. Configure Your LLM

2. Build Docker Images for SWE-Bench Evaluation

3. Run SWE-Bench Evaluation

4. Selecting Specific Instances

Links

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Uh oh!

License

Uh oh!

OpenHands/benchmarks

Folders and files

Latest commit

History

Repository files navigation

OpenHands Benchmarks Migration

Prerequisites

🧩 1. Initialize the Agent SDK submodule

🏗️ 2. Build the environment

🔄 3. Update the submodule (when SDK changes)

Quick Start

1. Configure Your LLM

2. Build Docker Images for SWE-Bench Evaluation

3. Run SWE-Bench Evaluation

4. Selecting Specific Instances

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages