Codebuff SWE Bench harness

This repo contains the benchmarking harness that was used to run run Codebuff on the SWE Bench. It closely follows Aider's SWE-bench harness, but doesn't try to re-run if it fails to solve a test. We want to make the results as realistic as possible to what a real user might do.

Installation

# Clone this repo
git clone https://github.com/codebuffai/codebuff-swe-bench

# Clone the SWE Bench docker repo into a subdir of this repo
cd codebuff-swe-bench
git clone https://github.com/aorwall/SWE-bench-docker

# Install pip requirements
pip install -r requirements.txt


npm i -g codebuff

See the SWE Bench Docker docs to ensure you have built or pulled all the SWE Bench testbed docker images you'll need.

Running the benchmark and computing results

The workflow for working with SWE Bench in general is 2 steps:

Run your agent on the problems to produce predictions, which are a series of json records that get bundled up into a jsonl file.
Evaluate the predictions jsonl file using the acceptance tests. This produces .eval.log files with logs of the testing procedure.

This repo is for running and evaluating Codebuff on SWE Bench. As described in the README, it consists of 2 scripts:

The codebuff_harness.py script will run Codebuff on all the problems and produce predictions. You can optionally give it a --run-id to have it continue on a particular run (the ID is just a timestamp, which you can find at the directory name at predictions/codebuff/<ID>).
The report.py script takes in an argument to the particular directory with the diffs (should be predictions/codebuff/<ID>). It consumes all those predictions and turns them into predictions/codebuff/<ID>/all_preds.jsonl. It then feeds that jsonl file through the SWE Bench evaluation and reporting scripts to produce logs/codebuff/<ID>/<instance_id>...eval.log files as well as a summary report in predictions/<DIRNAME>/results.json.
The table.py script takes in an argument to the particular directory with the diffs (should be predictions/codebuff/<ID>). It formats the results nicely and shows the final resolved percentage.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
predictions		predictions
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
codebuff_harness.py		codebuff_harness.py
compare.py		compare.py
delete_empty_patches.sh		delete_empty_patches.sh
dump.py		dump.py
princeton-nlp--SWE-bench_Lite.json		princeton-nlp--SWE-bench_Lite.json
prompts.py		prompts.py
replay.py		replay.py
report.py		report.py
requirements.txt		requirements.txt
run_eval.sh		run_eval.sh
swebench_docker		swebench_docker
table.py		table.py
tests.py		tests.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Codebuff SWE Bench harness

Installation

Running the benchmark and computing results

About

Uh oh!

Releases

Packages

Languages

License

CodebuffAI/codebuff-swe-bench

Folders and files

Latest commit

History

Repository files navigation

Codebuff SWE Bench harness

Installation

Running the benchmark and computing results

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages